Most downloaded biology preprints, all time
in category genomics
6,710 results found. For more information, click each entry to expand.
8,391 downloads bioRxiv genomics
We describe MULTI-seq: A rapid, modular, and universal scRNA-seq sample multiplexing strategy using lipid-tagged indices. MULTI-seq reagents can barcode any cell type from any species with an accessible plasma membrane. The method is compatible with enzymatic tissue dissociation, and also preserves viability and endogenous gene expression patterns. We leverage these features to multiplex the analysis of multiple solid tissues comprising human and mouse cells isolated from patient-derived xenograft mouse models. We also utilize MULTI-seq's modular design to perform a 96-plex perturbation experiment with human mammary epithelial cells. MULTI-seq also enables robust doublet identification, which improves data quality and increases scRNA-seq cell throughput by minimizing the negative effects of Poisson loading. We anticipate that the sample throughput and reagent savings enabled by MULTI-seq will expand the purview of scRNA-seq and democratize the application of these technologies within the scientific community.
8,386 downloads bioRxiv genomics
Allelic expression (AE) analysis has become an important tool for integrating genome and transcriptome data to characterize various biological phenomena such as cis-regulatory variation and nonsense-mediated decay. In this paper, we systematically analyze the properties of AE read count data and technical sources of error, such as low-quality or double-counted RNA-seq reads, genotyping errors, allelic mapping bias, and technical covariates due to sample preparation and sequencing, and variation in total read depth. We provide guidelines for correcting and filtering for such errors, and show that the resulting AE data has extremely low technical noise. Finally, we introduce novel software for high-throughput production of AE data from RNA-sequencing data, implemented in the GATK framework. These improved tools and best practices for AE analysis yield higher quality AE data by reducing technical bias. This provides a practical framework for wider adoption of AE analysis by the genomics community.
8,341 downloads bioRxiv genomics
Cell differentiation and function are regulated across multiple layers of gene regulation, including the modulation of gene expression by changes in chromatin accessibility. However, differentiation is an asynchronous process precluding a temporal understanding of the regulatory events leading to cell fate commitment. Here, we developed SHARE-seq, a highly scalable approach for measurement of chromatin accessibility and gene expression within the same single cell. Using 34,774 joint profiles from mouse skin, we develop a computational strategy to identify cis-regulatory interactions and define Domains of Regulatory Chromatin (DORCs), which significantly overlap with super-enhancers. We show that during lineage commitment, chromatin accessibility at DORCs precedes gene expression, suggesting changes in chromatin accessibility may prime cells for lineage commitment. We therefore develop a computational strategy (chromatin potential) to quantify chromatin lineage-priming and predict cell fate outcomes. Together, SHARE-seq provides an extensible platform to study regulatory circuitry across diverse cells within tissues. ### Competing Interest Statement A.R. is a founder of and equity holder in Celsius therapeutics, an equity holder in Immunitas, and an SAB member of ThermoFisher Scientific, Syros Pharmaceutical, Asimov, and Neogene Therapeutics. J.D.B. holds patents related to ATAC-seq and is an SAB member of Camp4. J.D.B., A.R., S.M. submitted a provisional patent application based on this work.
8,336 downloads bioRxiv genomics
Mark Chaisson, Ashley D. Sanders, Xuefang Zhao, Ankit Malhotra, David Porubsky, Tobias Rausch, Eugene J Gardner, Oscar Rodriguez, Li Guo, Ryan L. Collins, Xian Fan, Jia Wen, Robert E Handsaker, Susan Fairley, Zev N. Kronenberg, Xiangmeng Kong, Fereydoun Hormozdiari, Dillon Lee, Aaron M. Wenger, Alex Hastie, Danny Antaki, Peter Audano, Harrison Brand, Stuart Cantsilieris, Han Cao, Eliza Cerveira, Chong Chen, Xintong Chen, Chen-Shan Chin, Zechen Chong, Nelson T. Chuang, Christine C. Lambert, Deanna M Church, Laura Clarke, Andrew Farrell, Joey Flores, Timur Galeev, David Gorkin, Madhusudan Gujral, Victor Guryev, William Haynes Heaton, Jonas Korlach, Sushant Kumar, Jee Young Kwon, Jong Eun Lee, Joyce Lee, Wan-Ping Lee, Sau Peng Lee, Shantao Li, Patrick Marks, Karine Viaud-Martinez, Sascha Meiers, Katherine M. Munson, Fabio Navarro, Bradley J Nelson, Conor Nodzak, Amina Noor, Sofia Kyriazopoulou-Panagiotopoulou, Andy Pang, Yunjiang Qiu, Gabriel Rosanio, Mallory Ryan, Adrian Stütz, Diana C.J. Spierings, Alistair Ward, AnneMarie E. Welch, Ming Xiao, Wei Xu, Chengsheng Zhang, Qihui Zhu, Xiangqun Zheng-Bradley, Ernesto Lowy, Sergei Yakneen, Steven McCarroll, Goo Jun, Li Ding, Chong Lek Koh, Bing Ren, Paul Flicek, Ken Chen, Mark Gerstein, Pui-Yan Kwok, Peter M. Lansdorp, Gabor Marth, Jonathan Sebat, Xinghua Shi, Ali Bashir, Kai Ye, Scott E. Devine, Michael Talkowski, Ryan E. Mills, Tobias Marschall, Jan Korbel, Evan E. Eichler, Charles Lee
The incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, and strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three human parent-child trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per human genome. We also discover 156 inversions per genome - most of which previously escaped detection. Fifty-eight of the inversions we discovered intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The method and the dataset serve as a gold standard for the scientific community and we make specific recommendations for maximizing structural variation sensitivity for future large-scale genome sequencing studies.
8,274 downloads bioRxiv genomics
The excision of introns from pre-mRNA is an essential step in mRNA processing. We developed LeafCutter to study sample and population variation in intron splicing. LeafCutter identifies variable intron splicing events from short-read RNA-seq data and finds alternative splicing events of high complexity. Our approach obviates the need for transcript annotations and circumvents the challenges in estimating relative isoform or exon usage in complex splicing events. LeafCutter can be used both for detecting differential splicing between sample groups, and for mapping splicing quantitative trait loci (sQTLs). Compared to contemporary methods, we find 1.4-2.1 times more sQTLs, many of which help us ascribe molecular effects to disease-associated variants. Strikingly, transcriptome-wide associations between LeafCutter intron quantifications and 40 complex traits increased the number of associated disease genes at 5% FDR by an average of 2.1-fold as compared to using gene expression levels alone. LeafCutter is fast, scalable, easy to use, and available at https://github.com/davidaknowles/leafcutter.
8,253 downloads bioRxiv genomics
Travis C. Glenn, Roger A. Nilsen, Troy Kieran, Jon G Sanders, Natalia J. Bayona-Vásquez, John W. Finger, Todd W. Pierson, Kerin E. Bentley, Sandra L. Hoffberg, Swarnali Louha, Francisco J. García-De León, Miguel Angel Del Río-Portilla, Kurt D. Reed, Jennifer L. Anderson, Jennifer K. Meece, Samuel E. Aggrey, Romdhane Rekaya, Magdy Alabady, Myriam Bélanger, Kevin Winker, Brant C. Faircloth
Next-generation DNA sequencing (NGS) offers many benefits, but major factors limiting NGS include reducing costs of: 1) start-up (i.e., doing NGS for the first time); 2) buy-in (i.e., getting the smallest possible amount of data from a run); and 3) sample preparation. Reducing sample preparation costs is commonly addressed, but start-up and buy-in costs are rarely addressed. We present dual-indexing systems to address all three of these issues. By breaking the library construction process into universal, re-usable, combinatorial components, we reduce all costs, while increasing the number of samples and the variety of library types that can be combined within runs. We accomplish this by extending the Illumina TruSeq dual-indexing approach to 768 (384 + 384) indexed primers that produce 384 unique dual-indexes or 147,456 (384 x 384) unique combinations. We maintain eight nucleotide indexes, with many that are compatible with Illumina index sequences. We synthesized these indexing primers, purifying them with only standard desalting and placing small aliquots in replicate plates. In qPCR validation tests, 206 of 208 primers tested passed (99% success). We then created hundreds of libraries in various scenarios. Our approach reduces start-up and per-sample costs by requiring only one universal adapter that works with indexed PCR primers to uniquely identify samples. Our approach reduces buy-in costs because: 1) relatively few oligonucleotides are needed to produce a large number of indexed libraries; and 2) the large number of possible primers allows researchers to use unique primer sets for different projects, which facilitates pooling of samples during sequencing. Our libraries make use of standard Illumina sequencing primers and index sequence length and are demultiplexed with standard Illumina software, thereby minimizing customization headaches. In subsequent Adapterama papers, we use these same primers with different adapter stubs to construct amplicon and restriction-site associated DNA libraries, but their use can be expanded to any type of library sequenced on Illumina platforms.
8,232 downloads bioRxiv genomics
An approach for generating high-resolution a priori maximum parsimony Y-chromosome (chrY) phylogenies based on SNP and small INDEL variant data from massively-parallel short-read (next-generation) sequencing data is described; the tree-generation methodology produces annotations localizing mutations to individual branches of the tree, along with indications of mutation placement uncertainty in cases for which "no-calls" (through lack of mapped reads or otherwise) at particular sites precludes precise phylogenetic placement of mutations. The approach leverages careful variant site filtering and a novel iterative reweighting procedure to generate high-accuracy trees while considering variants in regions of chrY that had previously been excluded from analyses based on short-read sequencing data. It is argued that the proposed approach is also superior to previous region-based filtering approaches in that it adapts to the quality of the underlying data and will automatically allow the scope of sites considered to expand as the underlying data quality improves (e.g. through longer read lengths). Key related issues, including calling of genotypes for the hemizygous chrY, reliability of variant results, read mismappings and "heterozygous" genotype calls, and the mutational stability of different variants are discussed and taken into account. The methodology is demonstrated through application to a dataset consisting of 1292 male samples from diverse populations and haplogroups, with the majority coming from low-coverage sequencing by the 1000 Genomes Project. Application of the tree-generation approach to these data produces a tree involving over 120,000 chrY variant sites (about 45,000 sites if singletons are excluded). The utility of this approach in refining the Y-chromosome phylogenetic tree is demonstrated by examining results for several haplogroups. The results indicate a number of new branches on the Y-chromosome phylogenetic tree, many of them subdividing known branches, but also including some that inform the presence of additional levels along the trunk of the tree. Finally, opportunities for extensions of this phylogenetic analysis approach to other types of genetic data are noted.
8,203 downloads bioRxiv genomics
Nanopore sequencing technology can rapidly and directly interrogate native DNA molecules. Often we are interested only in interrogating specific areas at high depth, but conventional enrichment methods have thus far proved unsuitable for long reads. Existing strategies are currently limited by high input DNA requirements, low yield, short (<5kb) reads, time-intensive protocols, and/or amplification or cloning (losing base modification information). In this paper, we describe a technique utilizing the ability of Cas9 to introduce cuts at specific locations and ligating nanopore sequencing adaptors directly to those sites, a method we term ‘nanopore Cas9 Targeted-Sequencing’ (nCATS). We have demonstrated this using an Oxford Nanopore MinION flow cell (Capacity >10Gb+) to generate a median 165X coverage at 10 genomic loci with a median length of 18kb, representing a several hundred-fold improvement over the 2-3X coverage achieved without enrichment. We performed a pilot run on the smaller Flongle flow cell (Capacity ~1Gb), generating a median coverage of 30X at 11 genomic loci with a median length of 18kb. Using panels of guide RNAs, we show that the high coverage data from this method enables us to (1) profile DNA methylation patterns at cancer driver genes, (2) detect structural variations at known hot spots, and (3) survey for the presence of single nucleotide mutations. Together, this provides a low-cost method that can be applied even in low resource settings to directly examine cellular DNA. This technique has extensive clinical applications for assessing medically relevant genes and has the versatility to be a rapid and comprehensive diagnostic tool. We demonstrate applications of this technique by examining the well-characterized GM12878 cell line as well as three breast cell lines (MCF-10A, MCF-7, MDA-MB-231) with varying tumorigenic potential as a model for cancer. Contributions TG and WT constructed the study. TG performed the experiments. TG, IL, and FS analyzed the data. TG, JG, ER, RB and AH and developed the method. TG and WT wrote the paper : #ref-1
8,132 downloads bioRxiv genomics
Background: SARS-CoV-2 most likely evolved from a bat beta-coronavirus and started infecting humans in December 2019. Since then it has rapidly infected people around the world, with more than 3 million confirmed cases by the end of April 2020. Early genome sequencing of the virus has enabled the development of molecular diagnostics and the commencement of therapy and vaccine development. The analysis of the early sequences showed relatively few evolutionary selection pressures. However, with the rapid worldwide expansion into diverse human populations, significant genetic variations are becoming increasingly likely. The current limitations on social movement between countries also offers the opportunity for these viral variants to become distinct strains with potential implications for diagnostics, therapies and vaccines. Methods: We used the current sequencing archives (NCBI and GISAID) to investigate 5,349 whole genomes, looking for evidence of strain diversification and selective pressure. Results: We used 3,958 SNPs to build a phylogenetic tree of SARS-CoV-2 diversity and noted strong evidence for the existence of two major clades and six sub-clades, unevenly distributed across the world. We also noted that convergent evolution has potentially occurred across several locations in the genome, showing selection pressures, including on the spike glycoprotein where we noted a potentially critical mutation that could affect its binding to the ACE2 receptor. We also report on mutations that could prevent current molecular diagnostics from detecting some of the sub-clades. Conclusions: The worldwide whole genome sequencing effort is revealing the challenge of developing SARS-CoV-2 containment tools suitable for everyone and the need for data to be continually evaluated to ensure accuracy in outbreak estimations. ### Competing Interest Statement The authors have declared no competing interest.
8,125 downloads bioRxiv genomics
The COVID-19 pandemic is caused by the coronavirus SARS-CoV-2, which jumped into the human population in late 2019 from a currently uncharacterised animal reservoir. Due to this extremely recent association with humans, SARS-CoV-2 may not yet be fully adapted to its human host. This has led to speculations that some lineages of SARS-CoV-2 may be evolving towards higher transmissibility. The most plausible candidate mutations under putative natural selection are those which have emerged repeatedly and independently (homoplasies). Here, we formally test whether any of the recurrent mutations that have been observed in SARS-CoV-2 are significantly associated with increased viral transmission. To do so, we develop a phylogenetic index to quantify the relative number of descendants in sister clades with and without a specific allele. We apply this index to a carefully curated set of recurrent mutations identified within a dataset of 46,723 SARS-CoV-2 genomes isolated from patients worldwide. We do not identify a single recurrent mutation in this set convincingly associated with increased viral transmission. Instead, recurrent SARS-CoV-2 mutations currently in circulation appear to be evolutionary neutral. Recurrent mutations also seem primarily induced by the human immune system via host RNA editing, rather than being signatures of adaptation to the novel human host. In conclusion, we find no evidence at this stage for the emergence of significantly more transmissible lineages of SARS-CoV-2 due to recurrent mutations. ### Competing Interest Statement The authors have declared no competing interest.
8,021 downloads bioRxiv genomics
Many chromatin features play critical roles in regulating gene expression. A complete understanding of gene regulation will require the mapping of specific chromatin features in small samples of cells at high resolution. Here we describe Cleavage Under Targets and Tagmentation (CUT&Tag), an enzyme-tethering strategy that provides efficient high-resolution sequencing libraries for profiling diverse chromatin components. In CUT&Tag, a chromatin protein is bound in situ by a specific antibody, which then tethers a protein A-Tn5 transposase fusion protein. Activation of the transposase efficiently generates fragment libraries with high resolution and exceptionally low background. All steps from live cells to sequencing-ready libraries can be performed in a single tube on the benchtop or a microwell in a high-throughput pipeline, and the entire procedure can be performed in one day. We demonstrate the utility of CUT&Tag by profiling histone modifications, RNA Polymerase II and transcription factors on low cell numbers and single cells.
7,997 downloads bioRxiv genomics
Large-scale sequencing of RNAs from individual cells can reveal patterns of gene, isoform and allelic expression across cell types and states. However, current single-cell RNA-sequencing (scRNA-seq) methods have limited ability to count RNAs at allele- and isoform resolution, and long-read sequencing techniques lack the depth required for large-scale applications across cells. Here, we introduce Smart-seq3 that combines full-length transcriptome coverage with a 5' unique molecular identifier (UMI) RNA counting strategy that enabled in silico reconstruction of thousands of RNA molecules per cell. Importantly, a large portion of counted and reconstructed RNA molecules could be directly assigned to specific isoforms and allelic origin, and we identified significant transcript isoform regulation in mouse strains and human cell types. Moreover, Smart-seq3 showed a dramatic increase in sensitivity and typically detected thousands more genes per cell than Smart-seq2. Altogether, we developed a short-read sequencing strategy for single-cell RNA counting at isoform and allele-resolution applicable to large-scale characterization of cell types and states across tissues and organisms.
7,960 downloads bioRxiv genomics
Shahar Alon, Daniel R Goodwin, Anubhav Sinha, Asmamaw T. Wassie, Fei Chen, Evan R Daugharthy, Yosuke Bando, Atsushi Kajita, Andrew G Xue, Karl Marrett, Robert Prior, Yi Cui, Andrew C Payne, Chun-Chen Yao, Ho-Jun Suk, Ru Wang, Chih-Chieh (Jay) Yu, Paul Tillberg, Paul Reginato, Nikita Pak, Songlei Liu, Sukanya Punthambaker, Eswar P. R. Iyer, Richie E. Kohman, Jeremy A. Miller, Ed S Lein, Ana Lako, Nicole Cullen, Scott Rodig, Karla Helvie, Daniel L Abravanel, Nikhil Wagle, Bruce E. Johnson, Johanna Klughammer, Michal Slyper, Julia Waldman, Judit Jané-Valbuena, Orit Rozenblatt-Rosen, Aviv Regev, IMAXT Consortium, George Church, Adam H Marblestone, Edward S. Boyden
Methods for highly multiplexed RNA imaging are limited in spatial resolution, and thus in their ability to localize transcripts to nanoscale and subcellular compartments. We adapt expansion microscopy, which physically expands biological specimens, for long-read untargeted and targeted in situ RNA sequencing. We applied untargeted expansion sequencing (ExSeq) to mouse brain, yielding readout of thousands of genes, including splice variants and novel transcripts. Targeted ExSeq yielded nanoscale-resolution maps of RNAs throughout dendrites and spines in neurons of the mouse hippocampus, revealing patterns across multiple cell types; layer-specific cell types across mouse visual cortex; and the organization and position-dependent states of tumor and immune cells in a human metastatic breast cancer biopsy. Thus ExSeq enables highly multiplexed mapping of RNAs, from nanoscale to system scale. ### Competing Interest Statement The authors have declared no competing interest.
7,946 downloads bioRxiv genomics
The human pathogen severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is responsible for the major pandemic of the 21st century. We analyzed >4,700 SARS-CoV-2 genomes and associated meta-data retrieved from public repositories. SARS-CoV-2 sequences have a high sequence identity (>99.9%), which drops to >96% when compared to bat coronavirus. We built a mutation-annotated reference SARS-CoV-2 phylogeny with two main macro-haplogroups, A and B, both of Asian origin, and >160 sub-branches representing virus strains of variable geographical origins worldwide, revealing a uniform mutation occurrence along branches that could complicate the design of future vaccines. The root of SARS-CoV-2 genomes locates at the Chinese haplogroup B1, with a TMRCA dating to 12 November 2019 - thus matching epidemiological records. Sub-haplogroup A2a originates in China and represents the major non-Asian outbreak. Multiple founder effects, most likely associated with super-spreader hosts, explain COVID-19 pandemic to a large extent. ### Competing Interest Statement The authors have declared no competing interest.
7,931 downloads bioRxiv genomics
Daniel Taliun, Daniel N. Harris, Michael D Kessler, Jedidiah Carlson, Zachary A. Szpiech, Raul Torres, Sarah A. Gagliano Taliun, André Corvelo, Stephanie M Gogarten, Hyun Min Kang, Achilleas N Pitsillides, Jonathon LeFaive, Seung-been Lee, Xiaowen Tian, Brian L. Browning, Sayantan Das, Anne-Katrin Emde, Wayne E. Clarke, Douglas P. Loesch, Amol C. Shetty, Thomas W Blackwell, Quenna Wong, Francois Aguet, Christine Albert, Alvaro Alonso, Kristin G. Ardlie, Stella Aslibekyan, Paul L. Auer, John Barnard, R. Graham Barr, Lewis C. Becker, Rebecca L Beer, Emelia J. Benjamin, Lawrence F Bielak, John Blangero, Michael Boehnke, Donald W Bowden, Jennifer A. Brody, Esteban G. Burchard, Brian E Cade, James F. Casella, Brandon Chalazan, Yii-Der Ida Chen, Michael H Cho, Seung Hoan Choi, Mina K. Chung, Clary B. Clish, Adolfo Correa, Joanne E. Curran, Brian Custer, Dawood Darbar, Michelle Daya, Mariza de Andrade, Dawn L DeMeo, Susan K. Dutcher, Patrick T. Ellinor, Leslie S Emery, Diane Fatkin, Lukas Forer, Myriam Fornage, Nora Franceschini, Christian Fuchsberger, Stephanie M Fullerton, Soren Germer, Mark T Gladwin, Daniel J Gottlieb, Xiuqing Guo, Michael E Hall, Jiang He, Nancy L. Heard-Costa, Susan R. Heckbert, Marguerite R Irvin, Jill M Johnsen, Andrew D. Johnson, Sharon LR Kardia, Tanika Kelly, Shannon Kelly, Eimear Kenny, Douglas P Kiel, Robert Klemmer, Barbara A Konkle, Charles Kooperberg, Anna Köttgen, Leslie A Lange, Jessica Lasky-Su, Daniel Levy, Xihong Lin, Keng-Han Lin, Chunyu Liu, Ruth J.F. Loos, Lori Garman, Robert Gerszten, Steven A. Lubitz, Kathryn L. Lunetta, Angel Mak, Ani Manichaikul, Alisa K Manning, Rasika A. Mathias, David D McManus, Stephen T McGarvey, James B Meigs, Deborah A Meyers, Julie L Mikulla, Mollie A Minear, Braxton Mitchell, Sanghamitra Mohanty, May E Montasser, Courtney Montgomery, Alanna C. Morrison, Joanne M Murabito, Andrea Natale, Pradeep Natarajan, Sarah C. Nelson, Kari E. North, Jeffrey R O’Connell, Nicholette D Palmer, Nathan Pankratz, Gina M Peloso, Patricia A. Peyser, Wendy S. Post, Bruce M Psaty, DC Rao, Susan Redline, Alexander P Reiner, Dan Roden, Jerome I. Rotter, Ingo Ruczinski, Chloé Sarnowski, Sebastian Schoenherr, Jeong-Sun Seo, Sudha Seshadri, Vivien A Sheehan, M. Benjamin Shoemaker, Albert V Smith, Jennifer A Smith, Jennifer A. Smith, Nona Sotoodehnia, Adrienne M. Stilp, Weihong Tang, Kent D Taylor, Marilyn Telen, Timothy A. Thornton, Russell P Tracy, David J. Van Den Berg, Ramachandran S Vasan, Karine A Viaud-Martinez, Scott Vrieze, Daniel E Weeks, Bruce S. Weir, Scott T Weiss, Lu-Chen Weng, Cristen J. Willer, Yingze Zhang, Xutong Zhao, Donna K. Arnett, Allison E Ashley-Koch, Kathleen C Barnes, Eric Boerwinkle, Stacey Gabriel, Richard Gibbs, Kenneth M Rice, Stephen S Rich, Edwin Silverman, Pankaj Qasba, Weiniu Gan, Trans-Omics for Precision Medicine (TOPMed) Program, TOPMed Population Genetics Working Group, George J Papanicolaou, Deborah A. Nickerson, Sharon R. Browning, Michael C. Zody, Sebastian Zöllner, James G Wilson, L. Adrienne Cupples, Cathy C Laurie, Cashell E Jaquish, Ryan D Hernandez, Timothy D. O’Connor, Gonçalo R. Abecasis
The Trans-Omics for Precision Medicine (TOPMed) program seeks to elucidate the genetic architecture and disease biology of heart, lung, blood, and sleep disorders, with the ultimate goal of improving diagnosis, treatment, and prevention. The initial phases of the program focus on whole genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here, we describe TOPMed goals and design as well as resources and early insights from the sequence data. The resources include a variant browser, a genotype imputation panel, and sharing of genomic and phenotypic data via dbGaP. In 53,581 TOPMed samples, >400 million single-nucleotide and insertion/deletion variants were detected by alignment with the reference genome. Additional novel variants are detectable through assembly of unmapped reads and customized analysis in highly variable loci. Among the >400 million variants detected, 97% have frequency <1% and 46% are singletons. These rare variants provide insights into mutational processes and recent human evolutionary history. The nearly complete catalog of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and non-coding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and extends the reach of nearly all genome-wide association studies to include variants down to ~0.01% in frequency.
7,863 downloads bioRxiv genomics
We construct genomic predictors for heritable and extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). Replication tests show that these predictors capture, respectively, ~40, 20, and 9 percent of total variance for the three traits. For example, predicted heights correlate ~0.65 with actual height; actual heights of most individuals in validation samples are within a few cm of the prediction. The variance captured for height is comparable to the estimated SNP heritability from GCTA (GREML) analysis, and seems to be close to its asymptotic value (i.e., as sample size goes to infinity), suggesting that we have captured most of the heritability for the SNPs used. Thus, our results resolve the common SNP portion of the "missing heritability" problem - i.e., the gap between prediction R-squared and SNP heritability. The ~20k activated SNPs in our height predictor reveal the genetic architecture of human height, at least for common SNPs. Our primary dataset is the UK Biobank cohort, comprised of almost 500k individual genotypes with multiple phenotypes. We also use other datasets and SNPs found in earlier GWAS for out-of-sample validation of our results.
7,854 downloads bioRxiv genomics
Given increasing numbers of patients who are undergoing exome or genome sequencing, it is critical to establish tools and methods to interpret the impact of genetic variation. While the ability to predict deleteriousness for any given variant is limited, missense variants remain a particularly challenging class of variation to interpret, since they can have drastically different effects depending on both the precise location and specific amino acid substitution of the variant. In order to better evaluate missense variation, we leveraged the exome sequencing data of 60,706 individuals from the Exome Aggregation Consortium (ExAC) dataset to identify sub-genic regions that are depleted of missense variation. We further used this depletion as part of a novel missense deleteriousness metric named MPC. We applied MPC to de novo missense variants and identified a category of de novo missense variants with the same impact on neurodevelopmental disorders as truncating mutations in intolerant genes, supporting the value of incorporating regional missense constraint in variant interpretation.
7,782 downloads bioRxiv genomics
Anders Bergström, Shane A. McCarthy, Ruoyun Hui, Mohamed A. Almarri, Qasim Ayub, Petr Danecek, Yuan Chen, Sabine Felkel, Pille Hallast, Jack Kamm, Hélène Blanché, Jean-François Deleuze, Howard Cann, Swapan Mallick, David Reich, Manjinder S Sandhu, Pontus Skoglund, Aylwyn Scally, Yali Xue, Richard Durbin, Chris Tyler-Smith
Genome sequences from diverse human groups are needed to understand the structure of genetic variation in our species and the history of, and relationships between, different populations. We present 929 high-coverage genome sequences from 54 diverse human populations, 26 of which are physically phased using linked-read sequencing. Analyses of these genomes reveal an excess of previously undocumented private genetic variation in southern and central Africa and in Oceania and the Americas, but an absence of fixed, private variants between major geographical regions. We also find deep and gradual population separations within Africa, contrasting population size histories between hunter-gatherer and agriculturalist groups in the last 10,000 years, a potentially major population growth episode after the peopling of the Americas, and a contrast between single Neanderthal but multiple Denisovan source populations contributing to present-day human populations. We also demonstrate benefits to the study of population relationships of genome sequences over ascertained array genotypes. These genome sequences are freely available as a resource with no access or analysis restrictions.
7,690 downloads bioRxiv genomics
In recent years, the assay for transposase-accessible chromatin using sequencing (ATAC-Seq) has become a fundamental tool of epigenomic research. However, it has proven difficult to perform this technique on frozen samples because freezing cells before extracting nuclei impairs nuclear integrity and alters chromatin structure. We describe a protocol for freezing cells that is compatible with ATAC-Seq, producing results that compare well with those generated from fresh cells. We found that while flash-frozen samples are not suitable for ATAC-Seq, the assay is successful with slow-cooled cryopreserved samples. Using this method, we were able to isolate high quality, intact nuclei, and we verified that epigenetic results from fresh and cryopreserved samples agree quantitatively. We developed our protocol on a disease-relevant cell type, namely motor neurons differentiated from induced pluripotent stem cells from a patient affected by spinal muscular atrophy.
7,600 downloads bioRxiv genomics
Chuan-Chao Wang, Hui-Yuan Yeh, Alexander N Popov, Hu-Qin Zhang, Hirofumi Matsumura, Kendra Sirak, Olivia Cheronet, Alexey Kovalev, Nadin Rohland, Alexander M Kim, Rebecca Bernardos, Dashtseveg Tumen, Jing Zhao, Yi-Chang Liu, Jiun-Yu Liu, Matthew Mah, Swapan Mallick, Ke Wang, Zhao Zhang, Nicole Adamski, Nasreen Broomandkhoshbacht, Kimberly Callan, Brendan J Culleton, Laurie Eccles, Ann Marie Lawson, Megan Michel, Jonas Oppenheimer, Kristin Stewardson, Shaoqing Wen, Shi Yan, Fatma Zalzala, Richard Chuang, Ching-Jung Huang, Chung-Ching Shiung, Yuri G Nikitin, Andrei V Tabarev, Alexey A Tishkin, Song Lin, Zhou-Yong Sun, Xiao-Ming Wu, Tie-Lin Yang, Xi Hu, Liang Chen, Hua Du, Jamsranjav Bayarsaikhan, Enkhbayar Mijiddorj, Diimaajav Erdenebaatar, Tumur-Ochir Iderkhangai, Erdene Myagmar, Hideaki Kanzawa-Kiriyama, Msato Nishino, Ken-ichi Shinoda, Olga A Shubina, Jianxin Guo, Qiongying Deng, Longli Kang, Dawei Li, Dongna Li, Rong Lin, Wangwei Cai, Rukesh Shrestha, Ling-Xiang Wang, Lanhai Wei, Guangmao Xie, Hongbing Yao, Manfei Zhang, Guanglin He, Xiaomin Yang, Rong Hu, Martine Robbeets, Stephan Schiffels, Douglas J. Kennett, Li Jin, Hui Li, Johannes Krause, Ron Pinhasi, David Reich
The deep population history of East Asia remains poorly understood due to a lack of ancient DNA data and sparse sampling of present-day people. We report genome-wide data from 191 individuals from Mongolia, northern China, Taiwan, the Amur River Basin and Japan dating to 6000 BCE - 1000 CE, many from contexts never previously analyzed with ancient DNA. We also report 383 present-day individuals from 46 groups mostly from the Tibetan Plateau and southern China. We document how 6000-3600 BCE people of Mongolia and the Amur River Basin were from populations that expanded over Northeast Asia, likely dispersing the ancestors of Mongolic and Tungusic languages. In a time transect of 89 Mongolians, we reveal how Yamnaya steppe pastoralist spread from the west by 3300-2900 BCE in association with the Afanasievo culture, although we also document a boy buried in an Afanasievo barrow with ancestry entirely from local Mongolian hunter-gatherers, representing a unique case of someone of entirely non-Yamnaya ancestry interred in this way. The second spread of Yamnaya-derived ancestry came via groups that harbored about a third of their ancestry from European farmers, which nearly completely displaced unmixed Yamnaya-related lineages in Mongolia in the second millennium BCE, but did not replace Afanasievo lineages in western China where Afanasievo ancestry persisted, plausibly acting as the source of the early-splitting Tocharian branch of Indo-European languages. Analyzing 20 Yellow River Basin farmers dating to ~3000 BCE, we document a population that was a plausible vector for the spread of Sino-Tibetan languages both to the Tibetan Plateau and to the central plain where they mixed with southern agriculturalists to form the ancestors of Han Chinese. We show that the individuals in a time transect of 52 ancient Taiwan individuals spanning at least 1400 BCE to 600 CE were consistent with being nearly direct descendants of Yangtze Valley first farmers who likely spread Austronesian, Tai-Kadai and Austroasiatic languages across Southeast and South Asia and mixing with the people they encountered, contributing to a four-fold reduction of genetic differentiation during the emergence of complex societies. We finally report data from Jomon hunter-gatherers from Japan who harbored one of the earliest splitting branches of East Eurasian variation, and show an affinity among Jomon, Amur River Basin, ancient Taiwan, and Austronesian-speakers, as expected for ancestry if they all had contributions from a Late Pleistocene coastal route migration to East Asia.
- 27 Nov 2020: The website and API now include results pulled from medRxiv as well as bioRxiv.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!