Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 94,912 bioRxiv papers from 404,161 authors.
Most downloaded bioRxiv papers, all time
in category systems biology
2,384 results found. For more information, click each entry to expand.
5,003 downloads systems biology
Understanding the regulation and structure of ribosomes is essential to understanding protein synthesis and its dysregulation in disease. While ribosomes are believed to have a fixed stoichiometry among their core ribosomal proteins (RPs), some experiments suggest a more variable composition. Testing such variability requires direct and precise quantification of RPs. We used mass-spectrometry to directly quantify RPs across monosomes and polysomes of mouse embryonic stem cells (ESC) and budding yeast. Our data show that the stoichiometry among core RPs in wild-type yeast cells and ESC depends both on the growth conditions and on the number of ribosomes bound per mRNA. Furthermore, we find that the fitness of cells with a deleted RP-gene is inversely proportional to the enrichment of the corresponding RP in polysomes. Together, our findings support the existence of ribosomes with distinct protein composition and physiological function.
4,935 downloads systems biology
Understanding how gene expression in single cells progress over time is vital for revealing the mechanisms governing cell fate transitions. RNA velocity, which infers immediate changes in gene expression by comparing levels of new (unspliced) versus mature (spliced) transcripts (La Manno et al. 2018), represents an important advance to these efforts. A key question remaining is whether it is possible to predict the most probable cell state backward or forward over arbitrary time-scales. To this end, we introduce an inclusive model (termed Dynamo) capable of predicting cell states over extended time periods, that incorporates promoter state switching, transcription, splicing, translation and RNA/protein degradation by taking advantage of scRNA-seq and the co-assay of transcriptome and proteome. We also implement scSLAM-seq by extending SLAM-seq to plate-based scRNA-seq (Hendriks et al. 2018; Erhard et al. 2019; Cao, Zhou, et al. 2019) and augment the model by explicitly incorporating the metabolic labelling of nascent RNA. We show that through careful design of labelling experiments and an efficient mathematical framework, the entire kinetic behavior of a cell from this model can be robustly and accurately inferred. Aided by the improved framework, we show that it is possible to reconstruct the transcriptomic vector field from sparse and noisy vector samples generated by single cell experiments. The reconstructed vector field further enables global mapping of potential landscapes that reflects the relative stability of a given cell state, and the minimal transition time and most probable paths between any cell states in the state space. This work thus foreshadows the possibility of predicting long-term trajectories of cells during a dynamic process instead of short time velocity estimates. Our methods are implemented as an open source tool, dynamo (https://github.com/aristoteleo/dynamo-release).
4,772 downloads systems biology
DNA double-strand breaks are lesions that form during metabolism, DNA replication and exposure to mutagens. When a double-strand break occurs one of a number of repair mechanisms is recruited, all of which have differing propensities for mutational events. Despite DNA repair being of crucial importance, the relative contribution of these mechanisms and their regulatory interactions remain to be fully elucidated. Understanding these mutational processes will have a profound impact on our knowledge of genomic instability, with implications across health, disease and evolution. Here we present a new method to model the combined activation of non-homologous end joining, single strand annealing and alternative end joining, following exposure to ionizing radiation. We use Bayesian statistics to integrate eight biological data sets of double-strand break repair curves under varying genetic knockouts and confirm that our model is predictive by re-simulating and comparing to additional data. Analysis of the model suggests that there are at least three disjoint modes of repair, which we assign as fast, slow and intermediate. Our results show that when multiple data sets are combined, the rate for intermediate repair is variable amongst genetic knockouts. Further analysis suggests that the ratio between slow and intermediate repair depends on the presence or absence of DNA-PKcs and Ku70, which implies that non-homologous end joining and alternative end joining are not independent. Finally, we consider the proportion of double-strand breaks within each mechanism as a time series and predict activity as a function of repair rate. We outline how our insights can be directly tested using imaging and sequencing techniques and conclude that there is evidence of variable dynamics in alternative repair pathways. Our approach is an important step towards providing a unifying theoretical framework for the dynamics of DNA repair processes.
4,641 downloads systems biology
Massively multiplexed sequencing of RNA in individual cells is transforming basic and clinical life sciences. However, in standard experiments, tissues must first be dissociated. Thus, after sequencing, information about the spatial relationships between cells is lost although this knowledge is crucial for understanding cellular and tissue-level function. Recent attempts to overcome this fundamental challenge rely on employing additional in situ gene expression imaging data which can guide spatial mapping of sequenced cells. Here we present a conceptually different approach that allows to reconstruct spatial positions of cells in a variety of tissues without using reference imaging data. We first show for several complex biological systems that distances of single cells in expression space monotonically increase with their physical distances across tissues. We therefore seek to map cells to tissue space such that this principle is optimally preserved, while matching existing imaging data when available. We show that this optimization problem can be cast as a generalized optimal transport problem and solved efficiently. We apply our approach successfully to reconstruct the mammalian liver and intestinal epithelium as well as fly and zebrafish embryos. Our results demonstrate a simple spatial expression organization principle and that this principle (or future refined principles) can be used to infer, for individual cells, meaningful spatial position probabilities from the sequencing data alone.
4,267 downloads systems biology
Non-genetic factors can cause individual cells to fluctuate substantially in gene expression levels over time. Yet it remains unclear whether these fluctuations can persist for much longer than the time of one cell division. Current methods for measuring gene expression in single cells mostly rely on single time point measurements, making the duration of gene expression fluctuations or cellular memory difficult to measure. Here, we report a method combining Luria and Delbrück’s fluctuation analysis with population-based RNA sequencing (MemorySeq) for identifying genes transcriptome-wide whose fluctuations persist for several cell divisions. MemorySeq revealed multiple gene modules that are expressed together in rare cells within otherwise homogeneous clonal populations. Further, we found that these rare cell subpopulations are associated with biologically distinct behaviors, such as the ability to proliferate in the face of anti-cancer therapeutics, in different cancer cell lines. The identification of non-genetic, multigenerational fluctuations has the potential to reveal new forms of biological memory at the level of single cells and suggests that non-genetic heritability of cellular state may be a quantitative property.
4,207 downloads systems biology
Benoit Lehallier, David Gate, Nicholas Schaum, Tibor Nanasi, Song Eun Lee, Hanadie Yousef, Patricia Moran Losada, Daniela Berdnik, Andreas Keller, Joe Verghese, Sanish Sathyan, Claudio Franceschi, Sofiya Milman, Nir Barzilai, Tony Wyss-Coray
Aging is the predominant risk factor for numerous chronic diseases that limit healthspan. Mechanisms of aging are thus increasingly recognized as therapeutic targets. Blood from young mice reverses aspects of aging and disease across multiple tissues, pointing to the intriguing possibility that age-related molecular changes in blood can provide novel insight into disease biology. We measured 2,925 plasma proteins from 4,331 young adults to nonagenarians and developed a novel bioinformatics approach which uncovered profound non-linear alterations in the human plasma proteome with age. Waves of changes in the proteome in the fourth, seventh, and eighth decades of life reflected distinct biological pathways, and revealed differential associations with the genome and proteome of age-related diseases and phenotypic traits. This new approach to the study of aging led to the identification of unexpected signatures and pathways of aging and disease and offers potential pathways for aging interventions.
4,168 downloads systems biology
Traver Hart, Amy Tong, Katie Chan, Jolanda Van Leeuwen, Ashwin Seetharaman, Michael Aregger, Megha Chandrashekhar, Nicole Hustedt, Sahil Seth, Avery Noonan, Andrea Habsid, Olga Sizova, Lyudmila Nedyalkova, Ryan Climie, Keith Lawson, Maria Augusta Sartori, Sabriyeh Alibeh, David Tieu, Sanna Masud, Patricia Mero, Alexander Weiss, Kevin R. Brown, Matej Usaj, Maximilian Billmann, Mahfuzur Rahman, Michael Constanzo, Chad L. Myers, Brenda J. Andrews, Charles Boone, Daniel Durocher, Jason Moffat
The adaptation of CRISPR/Cas9 technology to mammalian cell lines is transforming the study of human functional genomics. Pooled libraries of CRISPR guide RNAs (gRNAs), targeting human protein-coding genes and encoded in viral vectors, have been used to systematically create gene knockouts in a variety of human cancer and immortalized cell lines, in an effort to identify whether these knockouts cause cellular fitness defects. Previous work has shown that CRISPR screens are more sensitive and specific than pooled library shRNA screens in similar assays, but currently there exists significant variability across CRISPR library designs and experimental protocols. In this study, we re-analyze 17 genome-scale knockout screens in human cell lines from three research groups using three different genome-scale gRNA libraries, using the Bayesian Analysis of Gene Essentiality (BAGEL) algorithm to identify essential genes, to refine and expand our previously defined set of human core essential genes, from 360 to 684 genes. We use this expanded set of reference Core Essential Genes (CEG2), plus empirical data from six CRISPR knockout screens, to guide the design of a sequence-optimized gRNA library, the Toronto KnockOut version 3.0 (TKOv3) library. We demonstrate the high effectiveness of the library relative to reference sets of essential and nonessential genes as well as other screens using similar approaches. The optimized TKOv3 library, combined with the CEG2 reference set, provide an efficient, highly optimized platform for performing and assessing gene knockout screens in human cell lines.
4,166 downloads systems biology
Janine Arloth, Gökcen Eraslan, Till FM Andlauer, Jade Martins, Stella Iurato, Brigitte Kühnel, Melanie Waldenberger, Josef Frank, Ralf Gold, Bernhard Hemmer, Felix Luessi, Sandra Nischwitz, Friedemann Paul, Heinz Wiendl, Christian Gieger, Stefanie Heilmann-Heimbach, Tim Kacprowski, Matthias Laudes, Thomas Meitinger, Annette Peters, Rajesh Rawal, Konstantin Strauch, Susanne Lucae, Bertram Müller-Myhsok, Marcella Rietschel, Fabian J. Theis, Elisabeth B. Binder, Nikola S. Mueller
Genome-wide association studies (GWAS) identify genetic variants associated with quantitative traits or disease. Thus, GWAS never directly link variants to regulatory mechanisms, which, in turn, are typically inferred during post-hoc analyses. In parallel, a recent deep learning-based method allows for prediction of regulatory effects per variant on currently up to 1,000 cell type-specific chromatin features. We here describe "DeepWAS", a new approach that directly integrates predictions of these regulatory effects of single variants into a multivariate GWAS setting. As a result, single variants associated with a trait or disease are, by design, coupled to their impact on a chromatin feature in a cell type. Up to 40,000 regulatory single-nucleotide polymorphisms (SNPs) were associated with multiple sclerosis (MS, 4,888 cases and 10,395 controls), major depressive disorder (MDD, 1,475 cases and 2,144 controls), and height (5,974 individuals) to each identify 43-61 regulatory SNPs, called deepSNPs, which are shown to reach at least nominal significance in large GWAS. MS- and height-specific deepSNPs resided in active chromatin and introns, whereas MDD-specific deepSNPs located mostly to intragenic regions and repressive chromatin states. We found deepSNPs to be enriched in public or cohort-matched expression and methylation quantitative trait loci and demonstrate the potential of the DeepWAS method to directly generate testable functional hypotheses based on genotype data alone. DeepWAS is an innovative GWAS approach with the power to identify individual SNPs in non-coding regions with gene regulatory capacity with a joint contribution to disease risk. DeepWAS is available at https://github.com/cellmapslab/DeepWAS.
4,144 downloads systems biology
Piper longum L. (P. longum, also called as long pepper) is one of the common culinary herb and has been extensively used as an important constituent of various indigenous medicines, specifically in traditional Indian medicinal system known as Ayurveda. Towards obtaining a global regulatory framework of P. longum's constituents, in this work we first reviewed phytochemicals present in this herb and then studied their pharmacological and medicinal features using network pharmacology approach. We developed high-confidence level tripartite networks consisting of phytochemicals-protein targets-disease association and explain the role of its phytochemicals to various chronic diseases. 7 drug-like phytochemicals in this herb were found as the potential regulators of 5 FDA approved drug targets; and 28 novel drug targets were also reported. 105 phytochemicals were linked with immunomodulatory potency by pathway level mapping in human metabolic network. A sub-network of human PPI regulated by its phytochemicals was derived and various modules in this sub-network were successfully associated with specific diseases.
4,119 downloads systems biology
Wyler Emanuel, Mösbauer Kirstin, Vedran Franke, Diag Asija, Gottula Lina Theresa, Arsie Roberto, Klironomos Filippos, Koppstein David, Ayoub Salah, Buccitelli Christopher, Richter Anja, Legnini Ivano, Ivanov Andranik, Mari Tommaso, Del Giudice Simone, Papies Jan Patrick, Müller Marcel Alexander, Niemeyer Daniela, Selbach Matthias, Altuna Akalin, Nikolaus Rajewsky, Drosten Christian, Landthaler Markus
The coronavirus disease 2019 (COVID-19) pandemic, caused by the novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is an ongoing global health threat with more than two million infected people since its emergence in late 2019. Detailed knowledge of the molecular biology of the infection is indispensable for understanding of the viral replication, host responses, and disease progression. We provide gene expression profiles of SARS-CoV and SARS-CoV-2 infections in three human cell lines (H1299, Caco-2 and Calu-3 cells), using bulk and single-cell transcriptomics. Small RNA profiling showed strong expression of the immunity and inflammation-associated microRNA miRNA-155 upon infection with both viruses. SARS-CoV-2 elicited approximately two-fold higher stimulation of the interferon response compared to SARS-CoV in the permissive human epithelial cell line Calu-3, and induction of cytokines such as CXCL10 or IL6. Single cell RNA sequencing data showed that canonical interferon stimulated genes such as IFIT2 or OAS2 were broadly induced, whereas interferon beta (IFNB1) and lambda (IFNL1-4) were expressed only in a subset of infected cells. In addition, temporal resolution of transcriptional responses suggested interferon regulatory factors (IRFs) activities precede that of nuclear factor-κB (NF-κB). Lastly, we identified heat shock protein 90 (HSP90) as a protein relevant for the infection. Inhibition of the HSP90 charperone activity by Tanespimycin/17-N-allylamino-17-demethoxygeldanamycin (17-AAG) resulted in a reduction of viral replication, and of TNF and IL1B mRNA levels. In summary, our study established in vitro cell culture models to study SARS-CoV-2 infection and identified HSP90 protein as potential drug target for therapeutic intervention of SARS-CoV-2 infection. ### Competing Interest Statement The authors have declared no competing interest.
3,993 downloads systems biology
Dongxue Wang, Basak Eraslan, Thomas Wieland, Björn Hallström, Thomas Hopf, Daniel Paul Zolg, Jana Zecha, Anna Asplund, Li-hua Li, Chen Meng, Martin Frejno, Tobias Schmidt, Karsten Schnatbaum, Mathias Wilhelm, Frederik Ponten, Mathias Uhlén, Julien Gagneur, Hannes Hahne, Bernhard Kuster
Genome-, transcriptome- and proteome-wide measurements provide valuable insights into how biological systems are regulated. However, even fundamental aspects relating to which human proteins exist, where they are expressed and in which quantities are not fully understood. Therefore, we have generated a systematic, quantitative and deep proteome and transcriptome abundance atlas from 29 paired healthy human tissues from the Human Protein Atlas Project and representing human genes by 17,615 transcripts and 13,664 proteins. The analysis revealed that few proteins show truly tissue-specific expression, that vast differences between mRNA and protein quantities within and across tissues exist and that the expression levels of proteins are often more stable across tissues than those of transcripts. In addition, only ~2% of all exome and ~7% of all mRNA variants could be confidently detected at the protein level showing that proteogenomics remains challenging, requires rigorous validation using synthetic peptides and needs more sophisticated computational methods. Many uses of this resource can be envisaged ranging from the study of gene/protein expression regulation to protein biomarker specificity evaluation to name a few.
3,794 downloads systems biology
Transcriptional regulation occurs via changes to the rates of various biochemical processes. Sequencing based approaches that average together many cells have suggested that polymerase binding and polymerase release from promoter proximal pausing are two key regulated steps in the transcriptional process. However, single cell studies have revealed that transcription occurs in short, discontinuous bursts, suggesting that transcriptional burst initiation and termination might also be regulated steps. Here, we develop and apply a quantitative framework to connect changes in both Pol II ChIP sequencing and single cell transcriptional measurements to changes in the rates of specific steps of transcription. Using a number of global and targeted transcriptional regulatory perturbations, we show that burst initiation rate is indeed a key regulated step, demonstrating that transcriptional activity can be frequency modulated. Polymerase pause release is a second key regulated step, but the rate of polymerase binding is not changed by any of the biological perturbations we examined. Our results establish an important role for transcriptional burst regulation in the control of gene expression.
3,636 downloads systems biology
Although mRNAs are key molecules for understanding life, there exists no method to determine the full-length sequence of endogenous mRNAs including their poly(A) tails. Moreover, although poly(A) tails can be modified in functionally important ways, there also exists no method to accurately sequence them. Here, we present FLAM-seq, a rapid and simple method for high-quality sequencing of entire mRNAs. We report a cDNA library preparation method coupled to single-molecule sequencing to perform FLAM-seq. Using human cell lines, brain organoids, and C. elegans we show that FLAM-seq delivers high-quality full-length mRNA sequences for thousands of different genes per sample. We find that (a) 3' UTR length is correlated with poly(A) tail length, (b) alternative polyadenylation sites and alternative promoters for the same gene are linked to different tail lengths, (c) tails contain a significant number of cytosines. Thus, we provide a widely useful method and fundamental insights into poly(A) tail regulation.
3,509 downloads systems biology
Maximilian Strunz, Lukas M. Simon, Meshal Ansari, Laura F Mattner, Ilias Angelidis, Christoph H Mayr, Jaymin Kathiriya, Min Yee, Paulina Ogar, Arunima Sengupta, Igor Kukhtevich, Robert Schneider, Zhongming Zhao, Jens H.L. Neumann, Jürgen Behr, Carola Voss, Tobias Stöger, Mareike Lehmann, Melanie Königshoff, Gerald Burgstaller, Michael O’Reilly, Harold A. Chapman, Fabian J. Theis, Herbert B Schiller
Lung injury activates quiescent stem and progenitor cells to regenerate alveolar structures. The sequence and coordination of transcriptional programs during this process has largely remained elusive. Using single cell RNA-seq, we first generated a whole-organ bird’s-eye view on cellular dynamics and cell-cell communication networks during mouse lung regeneration from ~30,000 cells at six timepoints. We discovered an injury-specific progenitor cell state characterized by Krt8 in flat epithelial cells covering alveolar surfaces. The number of these cells peaked during fibrogenesis in independent mouse models, as well as in human acute lung injury and fibrosis. Krt8+ alveolar progenitors featured a highly distinct connectome of receptor-ligand pairs with endothelial cells, fibroblasts, and macrophages. To ‘sky dive’ into epithelial differentiation dynamics, we sequenced >30,000 sorted epithelial cells at 18 timepoints and computationally derived cell state trajectories that were validated by lineage tracing genetic reporter mice. Airway stem cells within the club cell lineage and alveolar type-2 cells underwent transcriptional convergence onto the same Krt8+ progenitor cell state, which later resolved by terminal differentiation into alveolar type-1 cells. We derived distinct transcriptional regulators as key switch points in this process and show that induction of NFkB, p53, and hypoxia driven gene expression programs precede a Sox4, Ctnnb1, and Wwtr1 driven commitment towards alveolar type-1 cell fate. We show that epithelial cell plasticity can induce non-gradual transdifferentiation, involving intermediate progenitor cell states that may persist and promote disease if checkpoint signals for terminal differentiation are perturbed.
3,390 downloads systems biology
Determining the three dimensional structures of macromolecules is a major goal of biological research because of the close relationship between structure and function. Structure determination usually relies on physical techniques including x-ray crystallography, NMR spectroscopy and cryo-electron microscopy. Here we present a method that allows the high-resolution three-dimensional structure of a biological macromolecule to be determined only from measurements of the activity of mutant variants of the molecule. This genetic approach to structure determination relies on the quantification of genetic interactions (epistasis) between mutations and the discrimination of direct from indirect interactions. This provides a new experimental strategy for structure determination, with the potential to reveal functional and in vivo structural conformations at low cost and high throughput.
3,349 downloads systems biology
Bulk-tissue RNA-Seq is seeing increasing use in the study of physiological and pathophysiological processes in the kidney. However, the presence of multiple cell types in kidney complicates the interpretation of the data. Here we address the question, What cell types are represented in whole-kidney RNA-Seq data? to identify circumstances in which bulk-kidney RNA-Seq can successfully be interpreted. We carried out RNA-Seq in mouse whole kidneys and microdissected proximal S2 segments. To aid in the interpretation of the data, we compiled a database of cell-type selective protein markers for 43 cell types believed to be present in kidney tissue. The whole-kidney RNA-Seq analysis identified transcripts corresponding to 17742 genes, distributed over 5 orders of magnitude of expression level. Markers for all 43 curated cell types were detectable. Analysis of the cellular makeup of a mouse kidney, calculated from published literature, suggests that proximal tubule cells likely account for more than half of the mRNA in a kidney. Comparison of RNA-Seq data from microdissected proximal tubules with whole-kidney data supports this view. RNA-Seq data for cell-type selective markers in bulk-kidney samples provide a valid means to identify changes in minority-cell abundances in kidney tissue. Although proximal tubules make up a substantial fraction of whole-kidney samples, changes in proximal tubule gene expression could be obscured by the presence of mRNA from other cell types.
3,320 downloads systems biology
Nathanael G. Lintner, Kim F. McClure, Donna Petersen, Allyn T. Londregan, David W. Piotrowski, Liuqing Wei, Jun Xiao, Michael Bolt, Paula M. Loria, Bruce Maguire, Kieran F. Geoghegan, Austin Huang, Tim Rolph, Spiros Liras, Jennifer Doudna, Robert G. Dullea, Jamie H. D. Cate
Proprotein Convertase Subtilisin/Kexin Type 9 (PCSK9) plays a key role in regulating the levels of plasma low density lipoprotein cholesterol (LDL-C). Here we demonstrate that the compound PF-06446846 inhibits translation of PCSK9 by inducing the ribosome to stall around codon 34, mediated by the sequence of the nascent chain within the exit tunnel. We further show that PF-06446846 reduces plasma PCSK9 and total cholesterol levels in rats following oral dosing. Using ribosome profiling, we demonstrate that PF-06446846 is highly selective for the inhibition of PCSK9 translation. The mechanism of action employed by PF-06446846 reveals a previously unexpected tunability of the human ribosome, which allows small molecules to specifically block translation of individual transcripts.
3,266 downloads systems biology
Florian Meier, Andreas-David Brunner, Max Frank, Annie Ha, Isabell Bludau, Eugenia Voytik, Stephanie Kaspar-Schoenefeld, Markus Lubeck, Oliver Raether, Ruedi Aebersold, Ben C. Collins, Hannes Röst, Matthias Mann
Data independent acquisition (DIA) modes isolate and concurrently fragment populations of different precursors by cycling through segments of a predefined precursor m/z range. Although these selection windows collectively cover the entire m/z range, overall only a few percent of all incoming ions are sampled. Making use of the correlation of molecular weight and ion mobility in a trapped ion mobility device (timsTOF Pro), we here devise a novel scan mode that samples up to 100% of the peptide precursor ion current. We extend an established targeted data extraction workflow by including the ion mobility dimension for both signal extraction and scoring, thereby increasing the specificity for precursor identification. Data acquired from whole proteome digests and mixed organism samples demonstrate deep proteome coverage and a very high degree of reproducibility as well as quantitative accuracy, even from 10 ng sample amounts.
3,261 downloads systems biology
In morphological profiling, quantitative data are extracted from microscopy images of cells to identify biologically relevant similarities and differences among samples based on these profiles. This protocol describes the design and execution of experiments using Cell Painting, a morphological profiling assay multiplexing six fluorescent dyes imaged in five channels, to reveal eight broadly relevant cellular components or organelles. Cells are plated in multi-well plates, perturbed with the treatments to be tested, stained, fixed, and imaged on a high-throughput microscope. Then, automated image analysis software identifies individual cells and measures ~1,500 morphological features (various measures of size, shape, texture, intensity, etc.) to produce a rich profile suitable for detecting subtle phenotypes. Profiles of cell populations treated with different experimental perturbations can be compared to suit many goals, such as identifying the phenotypic impact of chemical or genetic perturbations, grouping compounds and/or genes into functional pathways, and identifying signatures of disease. Cell culture and image acquisition takes two weeks; feature extraction and data analysis take an additional 1-2 weeks.
3,238 downloads systems biology
RNA profiling is an excellent phenotype of cellular responses and tissue states, but can be costly to generate at the massive scale required for studies of regulatory circuits, genetic states or perturbation screens. Here, we draw on a series of advances over the last decade in the field of mathematics to establish a rigorous link between biological structure, data compressibility, and efficient data acquisition. We propose that very few random composite measurements - in which gene abundances are combined in a random linear combination - are needed to approximate the high-dimensional similarity between any pair of gene abundance profiles. We then show how finding latent, sparse representations of gene expression data would enable us to 'decompress' a small number of random composite measurements and recover high-dimensional gene expression levels that were not measured (unobserved). We present a new algorithm for finding sparse, modular structure, which improves the ability to interpret samples in terms of small numbers of active modules, and show that the modular structure we find is sufficient to recover gene expression profiles from composite measurements (with ~100-fold fewer composite measurements than genes). Moreover, the knowledge that sparse, modular structures exist allows us to recover expression profiles from composite measurements, even without access to any training data. Finally, we present a proof-of-concept experiment for making composite measurements in the laboratory, involving the measurement of linear combinations of RNA abundances. Altogether, our results suggest new compressive modalities in experimental biology that can form a foundation for massive scaling in high-throughput measurements, while also offering new insights into the interpretation of high-dimensional data. A recorded seminar presentation of this work is available at: https://www.youtube.com/watch?v=2dBZEOXqKHs
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!