Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 62,198 bioRxiv papers from 276,129 authors.
Most downloaded bioRxiv papers, all time
in category genomics
4,215 results found. For more information, click each entry to expand.
12,248 downloads genomics
Single cell RNA-seq (scRNA-seq) has emerged as a transformative tool to discover and define cellular phenotypes. While computational scRNA-seq methods are currently well suited for experiments representing a single condition, technology, or species, analyzing multiple datasets simultaneously raises new challenges. In particular, traditional analytical workflows struggle to align subpopulations that are present across datasets, limiting the possibility for integrated or comparative analysis. Here, we introduce a new computational strategy for scRNA-seq alignment, utilizing common sources of variation to identify shared subpopulations between datasets as part of our R toolkit Seurat. We demonstrate our approach by aligning scRNA-seq datasets of PBMCs under resting and stimulated conditions, hematopoietic progenitors sequenced across two profiling technologies, and pancreatic cell 'atlases' generated from human and mouse islets. In each case, we learn distinct or transitional cell states jointly across datasets, and can identify subpopulations that could not be detected by analyzing datasets independently. We anticipate that these methods will serve not only to correct for batch or technology-dependent effects, but also to facilitate general comparisons of scRNA-seq datasets, potentially deepening our understanding of how distinct cell states respond to perturbation, disease, and evolution. Availability: Installation instructions, documentation, and tutorials are available at http://www.satijalab.org/seurat
11,962 downloads genomics
Constructing an atlas of cell types in complex organisms will require a collective effort to characterize billions of individual cells. Single cell RNA sequencing (scRNA-seq) has emerged as the main tool for characterizing cellular diversity, but current methods use custom microfluidics or microwells to compartmentalize single cells, limiting scalability and widespread adoption. Here we present Split Pool Ligation-based Transcriptome sequencing (SPLiT-seq), a scRNA-seq method that labels the cellular origin of RNA through combinatorial indexing. SPLiT-seq is compatible with fixed cells, scales exponentially, uses only basic laboratory equipment, and costs one cent per cell. We used this approach to analyze 109,069 single cell transcriptomes from an entire postnatal day 5 mouse brain, providing the first global snapshot at this stage of development. We identified 13 main populations comprising different types of neurons, glia, immune cells, endothelia, as well as types in the blood-brain-barrier. Moreover, we resolve substructure within these clusters corresponding to cells at different stages of development. As sequencing capacity increases, SPLiT-seq will enable profiling of billions of cells in a single experiment.
11,443 downloads genomics
Grace X.Y. Zheng, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W. Bent, Ryan Wilson, Solongo B. Ziraldo, Tobias D. Wheeler, Geoff P. McDermott, Junjie Zhu, Mark T. Gregory, Joe Shuga, Luz Montesclaros, Donald A Masquelier, Stefanie Y. Nishimura, Michael Schnall-Levin, Paul W Wyatt, Christopher M. Hindson, Rajiv Bharadwaj, Alexander Wong, Kevin D. Ness, Lan W. Beppu, H. Joachim Deeg, Christopher McFarland, Keith R. Loeb, William J. Valente, Nolan G. Ericson, Emily A. Stevens, Jerald P. Radich, Tarjei S. Mikkelsen, Benjamin J. Hindson, Jason H Bielas
Characterizing the transcriptome of individual cells is fundamental to understanding complex biological systems. We describe a droplet-based system that enables 3′ mRNA counting of up to tens of thousands of single cells per sample. Cell encapsulation in droplets takes place in ~6 minutes, with ~50% cell capture efficiency, up to 8 samples at a time. The speed and efficiency allow the processing of precious samples while minimizing stress to cells. To demonstrate the system′s technical performance and its applications, we collected transcriptome data from ~¼ million single cells across 29 samples. First, we validate the sensitivity of the system and its ability to detect rare populations using cell lines and synthetic RNAs. Then, we profile 68k peripheral blood mononuclear cells (PBMCs) to demonstrate the system′s ability to characterize large immune populations. Finally, we use sequence variation in the transcriptome data to determine host and donor chimerism at single cell resolution in bone marrow mononuclear cells (BMMCs) of transplant patients. This analysis enables characterization of the complex interplay between donor and host cells and monitoring of treatment response. This high-throughput system is robust and enables characterization of diverse biological systems with single cell mRNA analysis.
11,412 downloads genomics
The spatial organization of cells in tissue has a profound influence on their function, yet a high-throughput, genome-wide readout of gene expression with cellular resolution is lacking. Here, we introduce Slide-seq, a highly scalable method that enables facile generation of large volumes of unbiased spatial transcriptomes with 10 micron spatial resolution, comparable to the size of individual cells. In Slide-seq, RNA is transferred from freshly frozen tissue sections onto a surface covered in DNA-barcoded beads with known positions, allowing the spatial locations of the RNA to be inferred by sequencing. To demonstrate Slide-seq's utility, we localized cell types identified by large-scale scRNA-seq datasets within the cerebellum and hippocampus. We next systematically characterized spatial gene expression patterns in the Purkinje layer of mouse cerebellum, identifying new axes of variation across Purkinje cell compartments. Finally, we used Slide-seq to define the temporal evolution of cell-type-specific responses in a mouse model of traumatic brain injury. Slide-seq will accelerate biological discovery by enabling routine, high-resolution spatial mapping of gene expression.
11,344 downloads genomics
Eric L Van Nostrand, Peter Freese, Gabriel A Pratt, Xiaofeng Wang, Xintao Wei, Rui Xiao, Steven M. Blue, Jia-Yu Chen, Neal A.L. Cody, Daniel Dominguez, Sara Olson, Balaji Sundararaman, Lijun Zhan, Cassandra Bazile, Louis Philip Benoit Bouvrette, Julie Bergalet, Michael O Duff, Keri E. Garcia, Chelsea Gelboin-Burkhart, Myles Hochman, Nicole J Lambert, Hairi Li, Thai B Nguyen, Tsultrim Palden, Ines Rabano, Shashank Sathe, Rebecca Stanton, Amanda Su, Ruth Wang, Brian A Yee, Bing Zhou, Ashley L Louie, Stefan Aigner, Xiang-dong Fu, Eric Lécuyer, Christopher B. Burge, Brenton R. Graveley, Gene W Yeo
Genomes encompass all the information necessary to specify the development and function of an organism. In addition to genes, genomes also contain a myriad of functional elements that control various steps in gene expression. A major class of these elements function only when transcribed into RNA as they serve as the binding sites for RNA binding proteins (RBPs), which act to control post-transcriptional processes including splicing, cleavage and polyadenylation, RNA editing, RNA localization, translation, and RNA stability. Despite the importance of these functional RNA elements encoded in the genome, they have been much less studied than genes and DNA elements. Here, we describe the mapping and characterization of RNA elements recognized by a large collection of human RBPs in K562 and HepG2 cells. These data expand the catalog of functional elements encoded in the human genome by addition of a large set of elements that function at the RNA level through interaction with RBPs.
11,087 downloads genomics
Despite rapid developments in single cell sequencing technology, sample-specific batch effects, detection of cell doublets, and the cost of generating massive datasets remain outstanding challenges. Here, we introduce cell "hashing", where oligo-tagged antibodies against ubiquitously expressed surface proteins are used to uniquely label cells from distinct samples, which can be subsequently pooled. By sequencing these tags alongside the cellular transcriptome, we can assign each cell to its sample of origin, and robustly identify doublets originating from multiple samples. We demonstrate our approach by pooling eight human PBMC samples on a single run of the 10x Chromium system, substantially reducing our per-cell costs for library generation. Cell "hashing" is inspired by, and complementary to, elegant multiplexing strategies based on genetic variation, which we also leverage to validate our results. We therefore envision that our approach will help to generalize the benefits of single cell multiplexing to diverse samples and experimental designs.
11,019 downloads genomics
Genetic privacy is an area of active research. While it is important to identify new risks, it is equally crucial to supply policymakers with accurate information based on scientific evidence. Recently, Lippert et al. (PNAS, 2017) investigated the status of genetic privacy using trait-predictions from whole genome sequencing. The authors sequenced a cohort of about 1000 individuals and collected a range of demographic, visible, and digital traits such as age, sex, height, face morphology, and a voice signature. They attempted to use the genetic features in order to predict those traits and re-identify the individuals from small pool using the trait predictions. Here, I report major flaws in the Lippert et al. manuscript. In short, the authors' technique performs similarly to a simple baseline procedure, does not utilize the power of whole genome markers, uses technically wrong metrics, and finally does not really identify anyone.
10,829 downloads genomics
Thiseas C. Lamnidis, Kerttu Majander, Choongwon Jeong, Elina Salmela, Anna Wessman, Vyacheslav Moiseyev, Valery Khartanovich, Oleg Balanovsky, Matthias Ongyerth, Antje Weihmann, Antti Sajantila, Janet Kelso, Svante Pääbo, Päivi Onkamo, Wolfgang Haak, Johannes Krause, Stephan Schiffels
European history has been shaped by migrations of people, and their subsequent admixture. Recently, evidence from ancient DNA has brought new insights into migration events that could be linked to the advent of agriculture, and possibly to the spread of Indo-European languages. However, little is known so far about the ancient population history of north-eastern Europe, in particular about populations speaking Uralic languages, such as Finns and Saami. Here we analyse ancient genomic data from 11 individuals from Finland and Northwest Russia. We show that the specific genetic makeup of northern Europe traces back to migrations from Siberia that began at least 3,500 years ago. This ancestry was subsequently admixed into many modern populations in the region, in particular populations speaking Uralic languages today. In addition, we show that ancestors of modern Saami inhabited a larger territory during the Iron Age than today, which adds to historical and linguistic evidence for the population history of Finland.
10,600 downloads genomics
Zuzana Hofmanová, Susanne Kreutzer, Garrett Hellenthal, Christian Sell, Yoan Diekmann, David Díez-del-Molino, Lucy van Dorp, Saioa López, Athanasios Kousathanas, Vivian Link, Karola Kirsanow, Lara M. Cassidy, Rui Martiniano, Melanie Strobel, Amelie Scheu, Kostas Kotsakis, Paul Halstead, Sevi Triantaphyllou, Nina Kyparissi-Apostolika, Dushanka-Christina Urem-Kotsou, Christina Ziota, Fotini Adaktylou, Shyamalika Gopalan, Dean M Bobo, Laura Winkelbach, Jens Blöcher, Martina Unterländer, Christoph Leuenberger, Çiler Çilingiroğlu, Barbara Horejs, Fokke Gerritsen, Stephen Shennan, Daniel G. Bradley, Mathias Currat, Krishna R Veeramah, Daniel Wegmann, Mark G. Thomas, Christina Papageorgopoulou, Joachim Burger
Farming and sedentism first appear in southwest Asia during the early Holocene and later spread to neighboring regions, including Europe, along multiple dispersal routes. Conspicuous uncertainties remain about the relative roles of migration, cultural diffusion and admixture with local foragers in the early Neolithisation of Europe. Here we present paleogenomic data for five Neolithic individuals from northwestern Turkey and northern Greece, spanning the time and region of the earliest spread of farming into Europe. We observe striking genetic similarity both among Aegean early farmers and with those from across Europe. Our study demonstrates a direct genetic link between Mediterranean and Central European early farmers and those of Greece and Anatolia, extending the European Neolithic migratory chain all the way back to southwestern Asia.
10,391 downloads genomics
Oxford Nanopore Technologies' nanopore sequencing device, the MinION, holds the promise of sequencing ultra-long DNA fragments >100kb. An obstacle to realizing this promise is delivering ultra-long DNA molecules to the nanopores. We present our progress in developing cost-effective ways to overcome this obstacle and our resulting MinION data, including multiple reads >100kb.
10,369 downloads genomics
Single-cell RNA-seq (scRNA-seq) data exhibits significant cell-to-cell variation due to technical factors, including the number of molecules detected in each cell, which can confound biological heterogeneity with technical effects. To address this, we present a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiments. We propose that the Pearson residuals from 'regularized negative binomial regression', where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity. Importantly, we show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundances to obtain stable parameter estimates. Our procedure omits the need for heuristic steps including pseudocount addition or log-transformation, and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. Our approach can be applied to any UMI-based scRNA-seq dataset and is freely available as part of the R package sctransform (https://github.com/ChristophH/sctransform), with a direct interface to our single-cell toolkit Seurat.
10,369 downloads genomics
Directed differentiation of cells in vitro is a powerful approach for dissection of developmental pathways, disease modeling and regenerative medicine, but analysis of such systems is complicated by heterogeneous and asynchronous cellular responses to differentiation-inducing stimuli. To enable deep characterization of heterogeneous cell populations, we developed an efficient digital gene expression profiling protocol that enables surveying of mRNA in thousands of single cells at a time. We then applied this protocol to profile 12,832 cells collected at multiple time points during directed adipogenic differentiation of human adipose-derived stem/stromal cells in vitro. The resulting data reveal the major axes of cell-to-cell variation within and between time points, and an inverse relationship between inflammatory gene expression and lipid accumulation across cells from a single donor.
10,349 downloads genomics
MicroRNAs (miRNAs) are predominantly negative regulators of gene expression that act through the RNA-induced Silencing Complex (RISC) to suppress the translation of protein coding mRNAs. Despite intense study of these regulatory molecules, the specific molecular functions of most miRNAs remain unknown, largely due to the challenge of accurately identifying miRNA targets. Reporter gene assays can determine direct interactions, but are laborious and do not scale to genome-wide screens. Genomic scale methods such as HITS-CLIP do not preserve the direct interactions, and rely on computationally derived predictions of interactions that are plagued by high false positive rates. Here we describe a protocol for the isolation of direct targets of a mature miRNA, using synthetic biotinylated miRNA duplexes. This approach allows sensitive and specific detection of miRNA-mRNA interactions, isolating high quality mRNA suitable for analysis by microarray or RNAseq.
10,229 downloads genomics
Tardigrades are meiofaunal ecdysozoans that are key to understanding the origins of Arthropoda. Many species of Tardigrada can survive extreme conditions through cryptobiosis. In a recent paper (Boothby TC et al (2015) Evidence for extensive horizontal gene transfer from the draft genome of a tardigrade. Proc Natl Acad Sci USA 112:15976-15981) the authors concluded that the tardigrade Hypsibius dujardini had an unprecedented proportion (17%) of genes originating through functional horizontal gene transfer (fHGT), and speculated that fHGT was likely formative in the evolution of cryptobiosis. We independently sequenced the genome of H. dujardini. As expected from whole-organism DNA sampling, our raw data contained reads from non-target genomes. Filtering using metagenomics approaches generated a draft H. dujardini genome assembly of 135 Mb with superior assembly metrics to the previously published assembly. Additional microbial contamination likely remains. We found no support for extensive fHGT. Among 23,021 gene predictions we identified 0.2% strong candidates for fHGT from bacteria, and 0.2% strong candidates for fHGT from non-metazoan eukaryotes. Cross-comparison of assemblies showed that the overwhelming majority of HGT candidates in the Boothby et al. genome derived from contaminants. We conclude that fHGT into H. dujardini accounts for at most 1-2% of genes and that the proposal that one sixth of tardigrade genes originate from functional HGT events is an artefact of undetected contamination.
9,865 downloads genomics
Junyue Cao, Jonathan S Packer, Vijay Ramani, Darren A. Cusanovich, Chau Huynh, Riza Daza, Xiaojie Qiu, Choli Lee, Scott N. Furlan, Frank J Steemers, Andrew Adey, Robert H. Waterston, Cole Trapnell, Jay Shendure
Conventional methods for profiling the molecular content of biological samples fail to resolve heterogeneity that is present at the level of single cells. In the past few years, single cell RNA sequencing has emerged as a powerful strategy for overcoming this challenge. However, its adoption has been limited by a paucity of methods that are at once simple to implement and cost effective to scale massively. Here, we describe a combinatorial indexing strategy to profile the transcriptomes of large numbers of single cells or single nuclei without requiring the physical isolation of each cell (Single cell Combinatorial Indexing RNA-seq or sci-RNA-seq). We show that sci-RNA-seq can be used to efficiently profile the transcriptomes of tens-of-thousands of single cells per experiment, and demonstrate that we can stratify cell types from these data. Key advantages of sci-RNA-seq over contemporary alternatives such as droplet-based single cell RNA-seq include sublinear cost scaling, a reliance on widely available reagents and equipment, the ability to concurrently process many samples within a single workflow, compatibility with methanol fixation of cells, cell capture based on DNA content rather than cell size, and the flexibility to profile either cells or nuclei. As a demonstration of sci-RNA-seq, we profile the transcriptomes of 42,035 single cells from C. elegans at the L2 stage, effectively 50-fold "shotgun cellular coverage" of the somatic cell composition of this organism at this stage. We identify 27 distinct cell types, including rare cell types such as the two distal tip cells of the developing gonad, estimate consensus expression profiles and define cell-type specific and selective genes. Given that C. elegans is the only organism with a fully mapped cellular lineage, these data represent a rich resource for future methods aimed at defining cell types and states. They will advance our understanding of developmental biology, and constitute a major leap towards a comprehensive, single-cell molecular atlas of a whole animal.
9,763 downloads genomics
Genome-wide, targeted loss-of-function pooled screens using the CRISPR (clustered regularly interspaced short palindrome repeats)?associated nuclease Cas9 in human and mouse cells provide an alternative screening system to RNA interference (RNAi). Initial lentiviral delivery systems for CRISPR screening had low viral titer or required a cell line already expressing Cas9, limiting the range of biological systems amenable to screening. In this work, we present 1- and 2-vector lentiCRISPR systems capable of producing higher viral titers and, in these vectors, new human and mouse libraries for genome-scale CRISPR knock-out (GeCKO) screening.
9,699 downloads genomics
Until recently, high-throughput gene expression technology, such as RNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RNA-Seq and scRNA-seq data are markedly different. In particular, unlike RNA-Seq, the majority of reported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, gene expressing RNA, but not at a sufficient level to detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem.
9,603 downloads genomics
Since its debut in 2009, single-cell RNA-seq has been a major propeller behind biomedical research progress. Developmental biology and stem cell studies especially benefit from the ability to profile single cells. While most studies still focus on individual tissues or organs, recent development of ultra-high-throughput single-cell RNA-seq has demonstrated potential power to depict more complexed system or even the entire body. Though multiple ultra-high-throughput single-cell RNA-seq systems have acquired attention, systematic comparison of these systems is yet available. Here we focus on three prevalent droplet-based ultra-high-throughput single-cell RNA-seq systems, inDrop, Drop-seq, and 10X Genomics Chromium. While each system is capable of profiling single-cell transcriptome, detailed comparison revealed distinguishing features and suitable application scenario for each system.
9,439 downloads genomics
Aaron M Wenger, Paul Peluso, William J Rowell, Pi-Chuan Chang, Richard Hall, Gregory T. Concepcion, Jana Ebler, Arkarachai Fungtammasan, Alexey Kolesnikov, Nathan D Olson, Armin Toepfer, Michael Alonge, Medhat Mahmoud, Yufeng Qian, Chen-Shan Chin, Adam M Phillippy, Michael C. Schatz, Gene Myers, Mark A. DePristo, Jue Ruan, Tobias Marschall, Fritz J. Sedlazeck, Justin M Zook, Heng Li, Sergey Koren, Andrew Carroll, David R Rank, Michael W Hunkapiller
The major DNA sequencing technologies in use today produce either highly-accurate short reads or noisy long reads. We developed a protocol based on single-molecule, circular consensus sequencing (CCS) to generate highly-accurate (99.8%) long reads averaging 13.5 kb and applied it to sequence the well-characterized human HG002/NA24385. We optimized existing tools to comprehensively detect variants, achieving precision and recall above 99.91% for SNVs, 95.98% for indels, and 95.99% for structural variants. We estimate that 2,434 discordances are correctable mistakes in the high-quality Genome in a Bottle benchmark. Nearly all (99.64%) variants are phased into haplotypes, which further improves variant detection. De novo assembly produces a highly contiguous and accurate genome with contig N50 above 15 Mb and concordance of 99.998%. CCS reads match short reads for small variant detection, while enabling structural variant detection and de novo assembly at similar contiguity and markedly higher concordance than noisy long reads.
9,351 downloads genomics
High-throughput single cell RNA sequencing (scRNA-seq) has become an established and powerful method to investigate transcriptomic cell-to-cell variation, and has revealed new cell types, and new insights into developmental process and stochasticity in gene expression. There are now several published scRNA-seq protocols, which all sequence transcriptomes from a minute amount of starting material. Therefore, a key question is how these methods compare in terms of sensitivity of detection of mRNA molecules, and accuracy of quantification of gene expression. Here, we assessed the sensitivity and accuracy of many published data sets based on standardized spike-ins with a uniform raw data processing pipeline. We developed a flexible and fast UMI counting tool (https://github.com/vals/umis) which is compatible with all UMI based protocols. This allowed us to relate these parameters to sequencing depth, and discuss the trade offs between the different methods. To confirm our results, we performed experiments on cells from the same population using three different protocols. We also investigated the effect of RNA degradation on spike-in molecules, and the average efficiency of scRNA-seq on spike-in molecules versus endogenous RNAs.
- Top preprints of 2018
- Paper search
- Author leaderboards
- Overall metrics
- The API
- Email newsletter
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!