Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 83,713 bioRxiv papers from 360,555 authors.
Most downloaded bioRxiv papers, all time
in category genomics
5,313 results found. For more information, click each entry to expand.
7,117 downloads genomics
Nanopore sequencing technology can rapidly and directly interrogate native DNA molecules. Often we are interested only in interrogating specific areas at high depth, but conventional enrichment methods have thus far proved unsuitable for long reads. Existing strategies are currently limited by high input DNA requirements, low yield, short (<5kb) reads, time-intensive protocols, and/or amplification or cloning (losing base modification information). In this paper, we describe a technique utilizing the ability of Cas9 to introduce cuts at specific locations and ligating nanopore sequencing adaptors directly to those sites, a method we term ‘nanopore Cas9 Targeted-Sequencing’ (nCATS). We have demonstrated this using an Oxford Nanopore MinION flow cell (Capacity >10Gb+) to generate a median 165X coverage at 10 genomic loci with a median length of 18kb, representing a several hundred-fold improvement over the 2-3X coverage achieved without enrichment. We performed a pilot run on the smaller Flongle flow cell (Capacity ~1Gb), generating a median coverage of 30X at 11 genomic loci with a median length of 18kb. Using panels of guide RNAs, we show that the high coverage data from this method enables us to (1) profile DNA methylation patterns at cancer driver genes, (2) detect structural variations at known hot spots, and (3) survey for the presence of single nucleotide mutations. Together, this provides a low-cost method that can be applied even in low resource settings to directly examine cellular DNA. This technique has extensive clinical applications for assessing medically relevant genes and has the versatility to be a rapid and comprehensive diagnostic tool. We demonstrate applications of this technique by examining the well-characterized GM12878 cell line as well as three breast cell lines (MCF-10A, MCF-7, MDA-MB-231) with varying tumorigenic potential as a model for cancer. Contributions TG and WT constructed the study. TG performed the experiments. TG, IL, and FS analyzed the data. TG, JG, ER, RB and AH and developed the method. TG and WT wrote the paper : #ref-1
7,027 downloads genomics
Brendan K. Bulik-Sullivan, Po-Ru Loh, Hilary Finucane, Alkes L. Price, Jian Yang, Schizophrenia Working Group of the Psychiatric Genomics Consortium, Nick Patterson, Mark Daly, Alkes L Price, Benjamin M. Neale
Both polygenicity (i.e. many small genetic effects) and confounding biases, such as cryptic relatedness and population stratification, can yield inflated distributions of test statistics in genome-wide association studies (GWAS). However, current methods cannot distinguish between inflation from bias and true signal from polygenicity. We have developed an approach that quantifies the contributions of each by examining the relationship between test statistics and linkage disequilibrium (LD). We term this approach LD Score regression. LD Score regression provides an upper bound on the contribution of confounding bias to the observed inflation in test statistics and can be used to estimate a more powerful correction factor than genomic control. We find strong evidence that polygenicity accounts for the majority of test statistic inflation in many GWAS of large sample size.
6,951 downloads genomics
Nanopore sequencing instruments measure the change in electric current caused by DNA transiting through the pore. In experimental and prototype nanopore sequencing devices it has been shown that the electrolytic current signals are sensitive to base modifications, such as 5-methylcytosine. Here we quantify the strength of this effect for the Oxford Nanopore Technologies MinION sequencer. Using synthetically methylated DNA we are able to train a hidden Markov model to distinguish 5-methylcytosine from unmethylated cytosine in DNA. We demonstrate by sequencing natural human DNA, without any special library preparation, that global patterns of methylation can be detected from low-coverage sequencing and that the methylation status of CpG islands can be reliably predicted from single MinION reads. Our trained model and prediction software is open source and freely available to the community under the MIT license.
6,927 downloads genomics
We propose an efficient framework for genetic subtyping of a pandemic virus, with application to the novel coronavirus SARS-CoV-2. Efficient identification of subtypes is particularly important for tracking the geographic distribution and temporal dynamics of infectious spread in real-time. In this paper, we utilize an entropy analysis to identify nucleotide sites within SARS-CoV-2 genome sequences that are highly informative of genetic variation, and thereby define an Informative Subtype Marker (ISM) for each sequence. We further apply an error correction technique to the ISMs, for more robust subtype definition given ambiguity and noise in sequence data. We show that, by analyzing the ISMs of global SARS-CoV-2 sequence data, we can distinguish interregional differences in viral subtype distribution, and track the emergence of subtypes in different regions over time. Based on publicly available data up to April 5, 2020, we show, for example: (1) distinct genetic subtypes of infections in Europe, with earlier transmission linked to subtypes prevalent in Italy with later development of subtypes specific to other countries over time; (2) within the United States, the emergence of an endogenous U.S. subtype that is distinct from the outbreak in New York, which is linked instead to subtypes found in Europe; and (3) dynamic emergence of SARS-CoV-2 from localization in China to a pattern of distinct regional subtypes in different countries around the world over time. Our results demonstrate that utilizing ISMs for genetic subtyping can be an important complement to conventional phylogenetic tree-based analyses of the COVID-19 pandemic. Particularly, because ISMs are efficient and compact subtype identifiers, they will be useful for modeling, data-mining, and machine learning tools to help enhance containment, therapeutic, and vaccine targeting strategies for fighting the COVID-19 pandemic. We have made the subtype identification pipeline described in this paper publicly available at https://github.com/EESI/ISM. ### Competing Interest Statement The authors have declared no competing interest.
6,880 downloads genomics
Anders Bergström, Shane A. McCarthy, Ruoyun Hui, Mohamed A. Almarri, Qasim Ayub, Petr Danecek, Yuan Chen, Sabine Felkel, Pille Hallast, Jack Kamm, Hélène Blanché, Jean-François Deleuze, Howard Cann, Swapan Mallick, David Reich, Manjinder S Sandhu, Pontus Skoglund, Aylwyn Scally, Yali Xue, Richard Durbin, Chris Tyler-Smith
Genome sequences from diverse human groups are needed to understand the structure of genetic variation in our species and the history of, and relationships between, different populations. We present 929 high-coverage genome sequences from 54 diverse human populations, 26 of which are physically phased using linked-read sequencing. Analyses of these genomes reveal an excess of previously undocumented private genetic variation in southern and central Africa and in Oceania and the Americas, but an absence of fixed, private variants between major geographical regions. We also find deep and gradual population separations within Africa, contrasting population size histories between hunter-gatherer and agriculturalist groups in the last 10,000 years, a potentially major population growth episode after the peopling of the Americas, and a contrast between single Neanderthal but multiple Denisovan source populations contributing to present-day human populations. We also demonstrate benefits to the study of population relationships of genome sequences over ascertained array genotypes. These genome sequences are freely available as a resource with no access or analysis restrictions.
6,851 downloads genomics
Anne Senabouth, Stacey Andersen, Qianyu Shi, Lei Shi, Feng Jiang, Wenwei Zhang, Kristof Wing, Maciej Daniszewski, Samuel W. Lukowski, Sandy SC Hung, Quan Nguyen, Lynn Fink, Ant Beckhouse, Alice Pébay, Alex W. Hewitt, Joseph E. Powell
The libraries generated by high-throughput single cell RNA-sequencing platforms such as the Chromium from 10x Genomics require considerable amounts of sequencing, typically due to the large number of cells. The ability to use this data to address biological questions is directly impacted by the quality of the sequence data. Here we have compared the performance of the Illumina NextSeq 500 and NovaSeq 6000 against the BGI MGISEQ-2000 platform using identical Single Cell libraries consisting of over 70,000 cells. Our results demonstrate a highly comparable performance between the NovaSeq 6000 and MGISEQ-2000 in sequencing quality, and cell, UMI, and gene detection. However, compared with the NextSeq 500, the MGISEQ- 2000 platform performs consistently better, identifying more cells, genes, and UMIs at equalised read depth. We were able to call an additional 1,065,659 SNPs from sequence data generated by the BGI platform, enabling an additional 14% of cells to be assigned to the correct donor from a multiplexed library. However, both the NextSeq 500 and MGISEQ-2000 detected similar frequencies of gRNAs from a pooled CRISPR single cell screen. Our study provides a benchmark for high capacity sequencing platforms applied to high-throughput single cell RNA-seq libraries.
6,771 downloads genomics
Single nucleus RNA-Seq (sNuc-Seq) profiles RNA from tissues that are preserved or cannot be dissociated, but does not provide the throughput required to analyse many cells from complex tissues. Here, we develop DroNc-Seq, massively parallel sNuc-Seq with droplet technology. We profile 29,543 nuclei from mouse and human archived brain samples to demonstrate sensitive, efficient and unbiased classification of cell types, paving the way for charting systematic cell atlases.
6,645 downloads genomics
We assembled the sequences from 9,795 RNA sequencing experiments, collected from 31 human tissues and hundreds of subjects as part of the GTEx project, to create a new, comprehensive catalog of human genes and transcripts. The new human gene database contains 43,162 genes, of which 21,306 are protein-coding and 21,856 are noncoding, and a total of 323,824 transcripts, for an average of 7.5 transcripts per gene. Our expanded gene list includes 4,998 novel genes (1,178 coding and 3,819 noncoding) and 97,511 novel splice variants of protein-coding genes as compared to the most recent human gene catalogs. We detected over 30 million additional transcripts at more than 650,000 sites, nearly all of which are likely to be nonfunctional, revealing a heretofore unappreciated amount of transcriptional noise in human cells.
6,600 downloads genomics
Large-scale sequencing of RNAs from individual cells can reveal patterns of gene, isoform and allelic expression across cell types and states. However, current single-cell RNA-sequencing (scRNA-seq) methods have limited ability to count RNAs at allele- and isoform resolution, and long-read sequencing techniques lack the depth required for large-scale applications across cells. Here, we introduce Smart-seq3 that combines full-length transcriptome coverage with a 5' unique molecular identifier (UMI) RNA counting strategy that enabled in silico reconstruction of thousands of RNA molecules per cell. Importantly, a large portion of counted and reconstructed RNA molecules could be directly assigned to specific isoforms and allelic origin, and we identified significant transcript isoform regulation in mouse strains and human cell types. Moreover, Smart-seq3 showed a dramatic increase in sensitivity and typically detected thousands more genes per cell than Smart-seq2. Altogether, we developed a short-read sequencing strategy for single-cell RNA counting at isoform and allele-resolution applicable to large-scale characterization of cell types and states across tissues and organisms.
6,519 downloads genomics
Advances in CRISPR-Cas9 technology have enabled the flexible modulation of gene expression at large scale. In particular, the creation of genome-wide libraries for CRISPR knockout (CRISPRko), CRISPR interference (CRISPRi), and CRISPR activation (CRISPRa) has allowed gene function to be systematically interrogated. Here, we evaluate numerous CRISPRko libraries and show that our recently-described CRISPRko library (Brunello) is more effective than previously published libraries at distinguishing essential and non-essential genes, providing approximately the same perturbation-level performance improvement over GeCKO libraries as GeCKO provided over RNAi. Additionally, we developed genome-wide libraries for CRISPRi (Dolcetto) and CRISPRa (Calabrese). Negative selection screens showed that Dolcetto substantially outperforms existing CRISPRi libraries with fewer sgRNAs per gene and achieves comparable performance to CRISPRko in the detection of gold-standard essential genes. We also conducted positive selection CRISPRa screens and show that Calabrese outperforms the SAM library approach at detecting vemurafenib resistance genes. We further compare CRISPRa to genome-scale libraries of open reading frames (ORFs). Together, these libraries represent a suite of genome-wide tools to efficiently interrogate gene function with multiple modalities.
6,486 downloads genomics
Drosophila is a premier model system for understanding the molecular mechanisms of development. By the onset of morphogenesis, ~6000 cells express distinct gene combinations according to embryonic position. Despite extensive mRNA in situ screens, combinatorial gene expression within individual cells is largely unknown. Therefore, it is difficult to comprehensively identify the coding and non-coding transcripts that drive patterning and to decipher the molecular basis of cellular identity. Here, we single-cell sequence precisely staged embryos, measuring >3100 genes per cell. We produce a ‘transcriptomic blueprint’ of development - a virtual embryo where 3D locations of sequenced cells are confidently identified. Our Drosophila-Virtual-Expression-eXplorer performs virtual in situ hybridizations and computes expression gradients (http://dvex.org). Using DVEX, we predict spatial expression and discover patterned lncRNAs. DEVX is sensitive enough to detect subtle evolutionary changes in expression patterns between Drosophila species. We believe DVEX is a prototype for powerful single cell studies in complex tissues.
6,486 downloads genomics
Genomics has recently celebrated reaching the $1000 genome milestone, making affordable DNA sequencing a reality. With this goal successfully completed, the next goal of the sequencing revolution can be sequencing sensors - miniaturized sequencing devices that are manufactured for real time applications and deployed in large quantities at low costs. The first part of this manuscript envisions applications that will benefit from moving the sequencers to the samples in a range of domains. In the second part, the manuscript outlines the critical barriers that need to be addressed in order to reach the goal of ubiquitous sequencing sensors.
6,467 downloads genomics
Timothy R. Fallon, Sarah E Lower, Ching-Ho Chang, Manabu Bessho-Uehara, Gavin J Martin, Adam J Bewick, Megan Behringer, Humberto J Debat, Isaac Wong, John C Day, Anton Suvorov, Christian J Silva, Kathrin F Stanger-Hall, David W Hall, Robert J. Schmitz, David R. Nelson, Sara M. Lewis, Shuji Shigenobu, Seth M. Bybee, Amanda M. Larracuente, Yuichi Oba, Jing-Ke Weng
Fireflies and their fascinating luminous courtships have inspired centuries of scientific study. Today firefly luciferase is widely used in biotechnology, but the evolutionary origin of their bioluminescence remains unclear. To shed light on this long-standing question, we sequenced the genomes of two firefly species that diverged over 100 million-years-ago: the North American Photinus pyralis and Japanese Aquatica lateralis. We also sequenced the genome of a related click-beetle, the Caribbean Ignelater luminosus, with bioluminescent biochemistry near-identical to fireflies, but anatomically unique light organs, suggesting the intriguing but contentious hypothesis of parallel gains of bioluminescence. Our analyses support two independent gains of bioluminescence between fireflies and click-beetles, and provide new insights into the genes, chemical defenses, and symbionts that evolved alongside their luminous lifestyle.
6,399 downloads genomics
We recently described CUT&Tag, a general strategy for epigenomic profiing in which antibody-tethered Tn5 transposase integrates DNA sequencing adapters at sites of specific chromatin protein binding or histone modification in intact cells or nuclei. Here we introduce a simplified CUT&Tag method that can be performed at home to help ameliorate the interruption of bench research caused by COVID-19 physical distancing requirements. All steps beginning with frozen nuclei are performed in single PCR tubes through to barcoded library amplication and clean-up, ready for pooling and DNA sequencing. Our CUT&Tag@home protocol has minimal equipment, reagent and supply needs and does not require handling of toxic or biologically active materials. We show that data quality and reproducibility for samples down to ~100 nuclei compare favorably to datasets produced using lab-based CUT&Tag and other chromatin profiling methods. We use CUT&Tag@home with antibodies to trimethylated histone H3 lysine-4, -36, -27 and -9 to comprehensively profile the epigenome of human K562 cells, consisting respectively of active gene regulatory elements, transcribed gene bodies, developmentally silenced domains and constitutively silenced parasitic elements. ### Competing Interest Statement The authors have declared no competing interest.
6,393 downloads genomics
CRISPR-based genetic screens have revolutionized the search for new gene functions and biological mechanisms. However, widely used pooled screens are limited to simple read-outs of cell proliferation or the production of a selectable marker protein. Arrayed screens allow for more complex molecular read-outs such as transcriptome profiling, but they provide much lower throughput. Here we demonstrate CRISPR genome editing together with single-cell RNA sequencing as a new screening paradigm that combines key advantages of pooled and arrayed screens. This approach allowed us to link guide-RNA expression to the associated transcriptome responses in thousands of single cells using a straightforward and broadly applicable screening workflow.
6,349 downloads genomics
Given increasing numbers of patients who are undergoing exome or genome sequencing, it is critical to establish tools and methods to interpret the impact of genetic variation. While the ability to predict deleteriousness for any given variant is limited, missense variants remain a particularly challenging class of variation to interpret, since they can have drastically different effects depending on both the precise location and specific amino acid substitution of the variant. In order to better evaluate missense variation, we leveraged the exome sequencing data of 60,706 individuals from the Exome Aggregation Consortium (ExAC) dataset to identify sub-genic regions that are depleted of missense variation. We further used this depletion as part of a novel missense deleteriousness metric named MPC. We applied MPC to de novo missense variants and identified a category of de novo missense variants with the same impact on neurodevelopmental disorders as truncating mutations in intolerant genes, supporting the value of incorporating regional missense constraint in variant interpretation.
6,335 downloads genomics
Patrick Turley, Raymond K Walters, Omeed Maghzian, Aysu Okbay, James J Lee, Mark Alan Fontana, Tuan Anh Nguyen-Viet, Robbee Wedow, Meghan Zacher, Nicholas A. Furlotte, 23andMe Research Team, Social Science Genetic Association Consortium, Patrik K.E. Magnusson, Sven Oskarsson, Magnus Johannesson, Peter M Visscher, David Laibson, David Cesarini, Benjamin M. Neale, Daniel J Benjamin
We introduce Multi-Trait Analysis of GWAS (MTAG), a method for joint analysis of summary statistics from GWASs of different traits, possibly from overlapping samples. We apply MTAG to summary statistics for depressive symptoms (Neff = 354,862), neuroticism (N = 168,105), and subjective well-being (N = 388,538). Compared to 32, 9, and 13 genome-wide significant loci in the single-trait GWASs (most of which are themselves novel), MTAG increases the number of loci to 64, 37, and 49, respectively. Moreover, association statistics from MTAG yield more informative bioinformatics analyses and increase variance explained by polygenic scores by approximately 25%, matching theoretical expectations.
6,316 downloads genomics
Cannabis has been cultivated for millennia with distinct cultivars providing either fiber and grain or tetrahydrocannabinol. Recent demand for cannabidiol rather than tetrahydrocannabinol has favored the breeding of admixed cultivars with extremely high cannabidiol content. Despite several draft Cannabis genomes, the genomic structure of cannabinoid synthase loci has remained elusive. A genetic map derived from a tetrahydrocannabinol/cannabidiol segregating population and a complete chromosome assembly from a high-cannabidiol cultivar together resolve the linkage of cannabidiolic and tetrahydrocannabinolic acid synthase gene clusters which are associated with transposable elements. High-cannabidiol cultivars appear to have been generated by integrating hemp-type cannabidiolic acid synthase gene clusters into a background of marijuana-type cannabis. Quantitative trait locus mapping suggests that overall drug potency, however, is associated with other genomic regions needing additional study.
6,313 downloads genomics
Sanja Vickovic, Gökcen Eraslan, Fredrik Salmén, Johanna Klughammer, Linnea Stenbeck, Tarmo Äijö, Richard Bonneau, Ludvig Bergenstråhle, José Fernandéz Navarro, Joshua Gould, Mostafa Ronaghi, Jonas Frisén, Joakim Lundeberg, Aviv Regev, Patrik L Ståhl
Tissue function relies on the precise spatial organization of cells characterized by distinct molecular profiles. Single-cell RNA-Seq captures molecular profiles but not spatial organization. Conversely, spatial profiling assays to date have lacked global transcriptome information, throughput or single-cell resolution. Here, we develop High-Density Spatial Transcriptomics (HDST), a method for RNA-Seq at high spatial resolution. Spatially barcoded reverse transcription oligonucleotides are coupled to beads that are randomly deposited into tightly packed individual microsized wells on a slide. The position of each bead is decoded with sequential hybridization using complementary oligonucleotides providing a unique bead-specific spatial address. We then capture, and spatially in situ barcode, RNA from the histological tissue sections placed on the HDST array. HDST recovers hundreds of thousands of transcript-coupled spatial barcodes per experiment at 2 μm resolution. We demonstrate HDST in the mouse brain, use it to resolve spatial expression patterns and cell types, and show how to combine it with histological stains to relate expression patterns to tissue architecture and anatomy. HDST opens the way to spatial analysis of tissues at high resolution.
6,307 downloads genomics
Single cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero-inflation. Current normalization procedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We propose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform current practice in a downstream clustering assessment using ground-truth datasets.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!