Rxivist logo

Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 65,106 bioRxiv papers from 288,545 authors.

Most downloaded bioRxiv papers, since beginning of last month

in category genomics

4,366 results found. For more information, click each entry to expand.

61: Software as a Service for the Genomic Prediction of Complex Diseases
more details view paper

Posted to bioRxiv 11 Sep 2019

Software as a Service for the Genomic Prediction of Complex Diseases
409 downloads genomics

Alessandro Bolli, Paolo Di Domenico, Giordano Bottà

In the last decade the scientific community witnessed a large increase in Genome-Wide Association Study sample size, in the availability of large Biobanks and in the improvements of statistical methods to model genomes features. This have paved the way for the development of new prediction medicine tools that use genomic data to estimate disease risk. One of these tools is the Polygenic Risk Score (PRS), a metric that estimates the genetic risk of an individual to develop a disease, based on a combination of a large number of genetic variants. Using the largest prospective genotyped cohort available to date, the UK Biobank, we built a new PRS for Coronary Artery Disease (CAD) and assessed its predictive performances along with two additional PRS for Breast Cancer (BC), and Prostate Cancer (PC). When compared with previously published PRS, the newly developed PRS for CAD displayed higher AUC and positive predictive value. PRSs were able to stratify disease risks from 1.34% to 25.7% (CAD in men), from 0.26% to 8.62% (CAD in women), from 1.6% to 24.6% (BC), and from 1.4% to 24.3% (PC) in the lowest and highest percentiles, respectively. Additionally, the three PRSs were able to identify the 5% of the UK Biobank population with a relative risk for the diseases at least 3 times higher than the average. Family history is a well recognised risk factor of CAD, BC, and PC and it is currently used to identify individuals at high risk of developing the diseases. We show that individuals with family history can have completely different disease risks based on PRS stratification: from 2.1% to 33% (CAD in men), from 0.56% to 10% (CAD in women), from 2.3% to 35.8% (BC), and from 1.0% to 34.0% (PC) in the lowest and highest percentiles, respectively. Additionally, the PRSs demonstrated higher predictive performance (AUCs (including age) CAD: 0.81, PC: 0.80, and BC: 0.68) than family history (AUCs (including age) CAD: 0.79, PC: 0.73, and BC: 0.61) in predicting the onset of diseases. Hyperlipidemia is well known to be associated with higher CAD risk, but a predictive performance comparison between each lipoprotein and CAD PRS has never been assessed. PRS shows higher discrimination capacity and Odds ratio per Standard deviation than LDL, HDL, total cholesterol-HDL ratio, ApoA, ApoB, ApoB-ApoA ratio, and Lipoprotein(a). Comparing the empirical risk distribution between PRS and each lipoprotein, we show that lipoprotein thresholds, currently used in clinical practice, identify a population equal to or smaller than what can be identified with the PRS at the same CAD risk threshold. Moreover, there is not correlation (max ρ : 0.137) between PRS and each lipoprotein, indicating that PRS captures different component of CAD etiology and identifies different people at high risk than those identified by lipoproteins, demonstrating to be an invaluable tool in CAD prevention. One of the major impairment of the PRS usage in clinical practice is the computational complexity needed to calculate per-individual PRSs. Deep bioinformatics expertise is required to run the entire pipeline, from imputing genomic data, through quality control to result visualisation. For these reasons we developed a Software as a Service (SaaS) for genomic risk prediction of complex diseases. The SaaS is fully automated, GDPR complaint and has been certified as a CE marked medical device. We made the SaaS freely available for research purposes. Researchers willing to use the SaaS can contact research{at}genomicriskscore.io

62: Stacks 2: Analytical Methods for Paired-end Sequencing Improve RADseq-based Population Genomics
more details view paper

Posted to bioRxiv 22 Apr 2019

Stacks 2: Analytical Methods for Paired-end Sequencing Improve RADseq-based Population Genomics
404 downloads genomics

Nicolas C Rochette, Angel G Rivera-Colón, Julian M Catchen

For half a century population genetics studies have put type II restriction endonucleases to work. Now, coupled with massively-parallel, short-read sequencing, the family of RAD protocols that wields these enzymes has generated vast genetic knowledge from the natural world. Here we describe the first software capable of using paired-end sequencing to derive short contigs from de novo RAD data natively. Stacks version 2 employs a de Bruijn graph assembler to build contigs from paired-end reads and overlap those contigs with the corresponding single-end loci. The new architecture allows all the individuals in a meta population to be considered at the same time as each RAD locus is processed. This enables a Bayesian genotype caller to provide precise SNPs, and a robust algorithm to phase those SNPs into long haplotypes -- generating RAD loci that are 400-800bp in length. To prove its recall and precision, we test the software with simulated data and compare reference-aligned and de novo analyses of three empirical datasets. We show that the latest version of Stacks is highly accurate and outperforms other software in assembling and genotyping paired-end de novo datasets.

63: Inherited Causes of Clonal Hematopoiesis of Indeterminate Potential in TOPMed Whole Genomes
more details view paper

Posted to bioRxiv 27 Sep 2019

Inherited Causes of Clonal Hematopoiesis of Indeterminate Potential in TOPMed Whole Genomes
403 downloads genomics

Alexander G Bick, Joshua Weinstock, Satish K Nandakumar, Charles P Fulco, Matthew J Leventhal, Erik L Bao, Joseph Nasser, Seyedeh M. Zekavat, Mindy D Szeto, Cecelia Laurie, Margaret Taub, Braxton D Mitchell, Kathleen Barnes, Arden Moscati, Myriam Fornage, Susan Redline, Bruce M Psaty, Edwin Silverman, Scott T Weiss, Nicolette Palmer, Ramachandran S Vasan, Esteban Burchard, Sharon Kardia, Jiang He, Robert Kaplan, Nicholas L Smith, Donna Arnett, David Schwartz, Adolfo Correa, Mariza de Andrade, Xiuqing Guo, Barbara A Konkle, Brian Custer, Juan Peralta, Hongsheng Gui, Deborah Meyers, Stephen T McGarvey, Ida Chen, M Benjamin Shoemaker, Patricia A Peyser, Jai Broome, Stephanie M. Gogarten, Fei Fei Wang, Quenna Wong, May Montasser, Michelle Daya, Eimear E Kenny, Kari North, Lenore J Launer, Brian E Cade, Joshua C Bis, Michael Cho, Jessica Lasky-Su, Donald W. Bowden, L Adrienne Cupples, Angel CY Mak, Lewis C. Becker, Jennifer A. Smith, Tanika N Kelly, Stella Aslibekyan, Susan R Heckbert, Hermant Tiwari, Ivana V. Yang, John Heit, Steven Lubitz, Steve Rich, Jill Johnsen, Joanne E. Curran, Sally E Wenzel, Daniel E Weeks, Dabeeru C Rao, Dawood Darbar, Jee-Young Moon, Russell P Tracy, Erin J Buth, Nicholas Rafaels, Ruth JF Loos, Lifang Hou, Jiwon Lee, Priyadarshini Kachroo, Barry I. Freedman, Daniel Levy, Lawrence F Bielak, James Hixson, James S Floyd, Eric A Whitsel, Patrick Ellinor, Marguerite R Irvin, Tasha E. Fingerlin, Laura M Raffield, Sebastian M Armasu, Jerome I Rotter, Marsha Wheeler, Ester C. Sabino, John Blangero, L. Keoki Williams, Bruce D. Levy, Wayne Huey-Herng Sheu, Dan Roden, Eric Boerwinkle, JoAnn E Manson, Rasika A. Mathias, Pinkal Desai, Kent D Taylor, Andrew D. Johnson, Paul Auer, Charles Kooperberg, Cathy C. Laurie, Tom Blackwell, Albert V Smith, Hong-yu Zhao, Ethan Lange, Leslie Lange, James G Wilson, Eric S Lander, Jesse M Engreitz, Benjamin L Ebert, Alexander P Reiner, Vijay G Sankaran, Sidd Jaiswal, Goncalo Abecasis, Pradeep Natarajan, Sekar Kathiresan, NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium

Age is the dominant risk factor for most chronic human diseases; yet the mechanisms by which aging confers this risk are largely unknown. Recently, the age-related acquisition of somatic mutations in regenerating hematopoietic stem cell populations was associated with both hematologic cancer incidence and coronary heart disease prevalence. Somatic mutations with leukemogenic potential may confer selective cellular advantages leading to clonal expansion, a phenomenon termed 'Clonal Hematopoiesis of Indeterminate Potential' (CHIP). Simultaneous germline and somatic whole genome sequence analysis now provides the opportunity to identify root causes of CHIP. Here, we analyze high-coverage whole genome sequences from 97,691 participants of diverse ancestries in the NHLBI TOPMed program and identify 4,229 individuals with CHIP. We identify associations with blood cell, lipid, and inflammatory traits specific to different CHIP genes. Association of a genome-wide set of germline genetic variants identified three genetic loci associated with CHIP status, including one locus at TET2 that was African ancestry specific. In silico-informed in vitro evaluation of the TET2 germline locus identified a causal variant that disrupts a TET2 distal enhancer. Aggregates of rare germline loss-of-function variants in CHEK2, a DNA damage repair gene, predisposed to CHIP acquisition. Overall, we observe that germline genetic variation altering hematopoietic stem cell function and the fidelity of DNA-damage repair increase the likelihood of somatic mutations leading to CHIP.

64: miqoGraph : Fitting admixture graphs using mixed-integer quadratic optimization
more details view paper

Posted to bioRxiv 11 Oct 2019

miqoGraph : Fitting admixture graphs using mixed-integer quadratic optimization
399 downloads genomics

Julia Yan, Nick Patterson, Vagheesh M Narasimhan

Admixture graphs represent the genetic relationship between a set of populations through splits, drift and admixture. In this paper we present the Julia package miqoGraph, which uses mixed-integer quadratic optimization to fit topology, drift lengths, and admixture proportions simultaneously. Inference of topology is particularly powerful, with integer optimization automating what is usually an arduous manual process.

65: Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program
more details view paper

Posted to bioRxiv 06 Mar 2019

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program
383 downloads genomics

Daniel Taliun, Daniel N. Harris, Michael D Kessler, Jedidiah Carlson, Zachary A. Szpiech, Raul Torres, Sarah A Gagliano Taliun, André Corvelo, Stephanie M. Gogarten, Hyun Min Kang, Achilleas N Pitsillides, Jonathon LeFaive, Seung-been Lee, Xiaowen Tian, Brian L Browning, Sayantan Das, Anne-Katrin Emde, Wayne E. Clarke, Douglas P. Loesch, Amol C Shetty, Thomas W Blackwell, Quenna Wong, François Aguet, Christine Albert, Alvaro Alonso, Kristin G Ardlie, Stella Aslibekyan, Paul L Auer, John Barnard, R. Graham Barr, Lewis C. Becker, Rebecca L Beer, Emelia J. Benjamin, Lawrence F Bielak, John Blangero, Michael Boehnke, Donald W. Bowden, Jennifer A Brody, Esteban G Burchard, Brian E Cade, James F. Casella, Brandon Chalazan, Yii-Der Ida Chen, Michael H Cho, Seung Hoan Choi, Mina K. Chung, Clary B Clish, Adolfo Correa, Joanne E. Curran, Brian Custer, Dawood Darbar, Michelle Daya, Mariza de Andrade, Dawn L DeMeo, Susan K Dutcher, Patrick T Ellinor, Leslie S Emery, Diane Fatkin, Lukas Forer, Myriam Fornage, Nora Franceschini, Christian Fuchsberger, Stephanie M Fullerton, Soren Germer, Mark T Gladwin, Daniel J Gottlieb, Xiuqing Guo, Michael E Hall, Jiang He, Nancy L. Heard-Costa, Susan R Heckbert, Marguerite R Irvin, Jill M Johnsen, Andrew D. Johnson, Sharon LR Kardia, Tanika Kelly, Shannon Kelly, Eimear E Kenny, Douglas P Kiel, Robert Klemmer, Barbara A Konkle, Charles Kooperberg, Anna Köttgen, Leslie A Lange, Jessica Lasky-Su, Daniel Levy, Xihong Lin, Keng-Han Lin, Chunyu Liu, Ruth JF Loos, Lori Garman, Robert Gerszten, Steven A. Lubitz, Kathryn L. Lunetta, Angel CY Mak, Ani Manichaikul, Alisa K Manning, Rasika A. Mathias, David D McManus, Stephen T McGarvey, James B Meigs, Deborah A Meyers, Julie L Mikulla, Mollie A Minear, Braxton Mitchell, Sanghamitra Mohanty, May E Montasser, Courtney Montgomery, Alanna C. Morrison, Joanne M. Murabito, Andrea Natale, Pradeep Natarajan, Sarah C. Nelson, Kari E. North, Jeffrey R. O’Connell, Nicholette D Palmer, Nathan Pankratz, Gina M Peloso, Patricia A Peyser, Wendy S. Post, Bruce M Psaty, DC Rao, Susan Redline, Alexander P Reiner, Dan Roden, Jerome I Rotter, Ingo Ruczinski, Chloé Sarnowski, Sebastian Schoenherr, Jeong-Sun Seo, Sudha Seshadri, Vivien A Sheehan, M Benjamin Shoemaker, Albert V Smith, Nicholas L Smith, Jennifer A. Smith, Nona Sotoodehnia, Adrienne M. Stilp, Weihong Tang, Kent D Taylor, Marilyn Telen, Timothy A Thornton, Russell P Tracy, David J. Van Den Berg, Ramachandran S Vasan, Karine A Viaud-Martinez, Scott Vrieze, Daniel E Weeks, Bruce S. Weir, Scott T Weiss, Lu-Chen Weng, Cristen J. Willer, Yingze Zhang, Xutong Zhao, Donna K. Arnett, Allison E Ashley-Koch, Kathleen C Barnes, Eric Boerwinkle, Stacey Gabriel, Richard Gibbs, Kenneth M Rice, Stephen S Rich, Edwin Silverman, Pankaj Qasba, Weiniu Gan, Trans-Omics for Precision Medicine (TOPMed) Program, TOPMed Population Genetics Working Group, George J Papanicolaou, Deborah A. Nickerson, Sharon R Browning, Michael C. Zody, Sebastian Zöllner, James G Wilson, L Adrienne Cupples, Cathy C. Laurie, Cashell E Jaquish, Ryan D. Hernandez, Timothy D. O’Connor, Gonçalo R Abecasis

The Trans-Omics for Precision Medicine (TOPMed) program seeks to elucidate the genetic architecture and disease biology of heart, lung, blood, and sleep disorders, with the ultimate goal of improving diagnosis, treatment, and prevention. The initial phases of the program focus on whole genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here, we describe TOPMed goals and design as well as resources and early insights from the sequence data. The resources include a variant browser, a genotype imputation panel, and sharing of genomic and phenotypic data via dbGaP. In 53,581 TOPMed samples, >400 million single-nucleotide and insertion/deletion variants were detected by alignment with the reference genome. Additional novel variants are detectable through assembly of unmapped reads and customized analysis in highly variable loci. Among the >400 million variants detected, 97% have frequency <1% and 46% are singletons. These rare variants provide insights into mutational processes and recent human evolutionary history. The nearly complete catalog of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and non-coding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and extends the reach of nearly all genome-wide association studies to include variants down to ~0.01% in frequency.

66: Dissociation of solid tumour tissues with cold active protease for single-cell RNA-seq minimizes conserved collagenase associated stress responses
more details view paper

Posted to bioRxiv 27 Jun 2019

Dissociation of solid tumour tissues with cold active protease for single-cell RNA-seq minimizes conserved collagenase associated stress responses
379 downloads genomics

Ciara H O’Flanagan, Kieran R Campbell, Allen W Zhang, Farhia Kabeer, Jamie LP Lim, Justina Biele, Peter Eirew, Daniel Lai, Andrew McPherson, Esther Kong, Cherie Bates, Kelly Borkowski, Matt Wiens, James Hopkins, Brittany Hewitson, Nicholas Ceglia, Richard Moore, Andy J. Mungall, Jessica N. McAlpine, The CRUK IMAXT Grand Challenge Team, Sohrab P Shah, Samuel Aparicio

Background: Single-cell RNA sequencing (scRNAseq) is a powerful tool for studying complex biological systems, such as tumour heterogeneity and tissue microenvironments. However, the sources of technical and biological variation in primary solid tumour tissues and patient-derived mouse xenografts for scRNAseq, are not well understood. Here, we used low temperature (6C) protease and collagenase (37C) to identify the transcriptional signatures associated with tissue dissociation across a diverse scRNAseq dataset comprising 128,481 cells from patient cancer tissues, patient-derived breast cancer xenografts and cancer cell lines. Results: We observe substantial variation in standard quality control (QC) metrics of cell viability across conditions and tissues. From FACS sorted populations gated for cell viability, we identify a sub-population of dead cells that would pass standard data filtering practices, and quantify the extent to which their transcriptomes differ from live cells. We identify a further subpopulation of transcriptomically "dying" cells that exhibit up-regulation of MHC class I transcripts, in contrast with live and fully dead cells. From the contrast between tissue protease dissociation at 37C or 6C, we observe that collagenase digestion results in a stress response. We derive a core gene set of 512 heat shock and stress response genes, including FOS and JUN, induced by collagenase (37C), which are minimized by dissociation with a cold active protease (6C). While induction of these genes was highly conserved across all cell types, cell type-specific responses to collagenase digestion were observed in patient tissues. We observe that the yield of cancer and non-cancer cell types varies between tissues and dissociation methods. Conclusions: The method and conditions of tumour dissociation influence cell yield and transcriptome state and are both tissue and cell type dependent. Interpretation of stress pathway expression differences in cancer single cell studies, including components of surface immune recognition such as MHC class I, may be especially confounded. We define a core set of 512 genes that can assist with identification of such effects in dissociated scRNA-seq experiments.

67: A sorghum Practical Haplotype Graph facilitates genome-wide imputation and cost-effective genomic prediction
more details view paper

Posted to bioRxiv 03 Oct 2019

A sorghum Practical Haplotype Graph facilitates genome-wide imputation and cost-effective genomic prediction
376 downloads genomics

Sarah E. Jensen, Jean Rigaud Charles, Kebede Muleta, Peter Bradbury, Terry Casstevens, Santosh P. Deshpande, Michael A. Gore, Rajeev Gupta, Daniel C. Ilut, Lynn Johnson, Roberto Lozano, Zachary Miller, Punna Ramu, Abhishek Rathore, Cinta Romay, Hari D. Upadhyaya, Rajeev Varshney, Geoffrey P. Morris, Gael Pressoir, Edward S Buckler, Guillaume Paul Ramstein

Successful management and utilization of increasingly large genomic datasets is essential for breeding programs to increase genetic gain and accelerate cultivar development. To help with data management and storage, we developed a sorghum Practical Haplotype Graph (PHG) pangenome database that stores all identified haplotypes and variant information for a given set of individuals. We developed two PHGs in sorghum, one with 24 individuals and another with 374 individuals, that reflect the diversity across genic regions of the sorghum genome. 24 founders of the Chibas sorghum breeding program were sequenced at low coverage (0.01x) and processed through the PHG to identify genome-wide variants. The PHG called SNPs with only 5.9% error at 0.01x coverage - only 3% lower than its accuracy when calling SNPs from 8x coverage sequence. Additionally, 207 progeny from the Chibas genomic selection (GS) training population were sequenced and processed through the PHG. Missing genotypes in the progeny were imputed from the parental haplotypes available in the PHG and used for genomic prediction. Mean prediction accuracies with PHG SNP calls range from 0.57-0.73 for different traits, and are similar to prediction accuracies obtained with genotyping-by-sequencing (GBS) or markers from sequencing targeted amplicons (rhAmpSeq). This study provides a proof of concept for using a sorghum PHG to call and impute SNPs from low-coverage sequence data and also shows that the PHG can unify genotype calls from different sequencing platforms. By reducing the amount of input sequence needed, the PHG has the potential to decrease the cost of genotyping for genomic selection, making GS more feasible and facilitating larger breeding populations that can capture maximum recombination. Our results demonstrate that the PHG is a useful research and breeding tool that can maintain variant information from a diverse group of taxa, store sequence data in a condensed but readily accessible format, unify genotypes from different genotyping methods, and provide a cost-effective option for genomic selection for any species.

68: A Large-Scale Binding and Functional Map of Human RNA Binding Proteins
more details view paper

Posted to bioRxiv 23 Aug 2017

A Large-Scale Binding and Functional Map of Human RNA Binding Proteins
373 downloads genomics

Eric L Van Nostrand, Peter Freese, Gabriel A Pratt, Xiaofeng Wang, Xintao Wei, Rui Xiao, Steven M. Blue, Jia-Yu Chen, Neal A.L. Cody, Daniel Dominguez, Sara Olson, Balaji Sundararaman, Lijun Zhan, Cassandra Bazile, Louis Philip Benoit Bouvrette, Julie Bergalet, Michael O Duff, Keri E. Garcia, Chelsea Gelboin-Burkhart, Myles Hochman, Nicole J Lambert, Hairi Li, Thai B Nguyen, Tsultrim Palden, Ines Rabano, Shashank Sathe, Rebecca Stanton, Amanda Su, Ruth Wang, Brian A Yee, Bing Zhou, Ashley L Louie, Stefan Aigner, Xiang-dong Fu, Eric Lécuyer, Christopher B. Burge, Brenton R. Graveley, Gene W Yeo

Genomes encompass all the information necessary to specify the development and function of an organism. In addition to genes, genomes also contain a myriad of functional elements that control various steps in gene expression. A major class of these elements function only when transcribed into RNA as they serve as the binding sites for RNA binding proteins (RBPs), which act to control post-transcriptional processes including splicing, cleavage and polyadenylation, RNA editing, RNA localization, translation, and RNA stability. Despite the importance of these functional RNA elements encoded in the genome, they have been much less studied than genes and DNA elements. Here, we describe the mapping and characterization of RNA elements recognized by a large collection of human RBPs in K562 and HepG2 cells. These data expand the catalog of functional elements encoded in the human genome by addition of a large set of elements that function at the RNA level through interaction with RBPs.

69: The genetic origins of Saint Helena's liberated Africans
more details view paper

Posted to bioRxiv 01 Oct 2019

The genetic origins of Saint Helena's liberated Africans
371 downloads genomics

Marcela Sandoval-Velasco, Anuradha Jagadeesan, María C Ávila-Arcos, Shyam Gopalakrishnan, Jazmín Ramos-Madrigal, J. Víctor Moreno-Mayar, Gabriel Renaud, Diana I. Cruz-Dávalos, Erna Johannesdóttir, Judy Watson, Kate Robson-Brown, Andrew Pearson, Agnar Helgason, M. Thomas P. Gilbert, Hannes Schroeder

From 1500 to 1900, an estimated 12 million Africans were transported to the Americas as part of the transatlantic slave trade. Following Britain's abolition of slave trade in 1807, the Royal Navy patrolled the Atlantic and intercepted slave ships that continued to operate. During this period, the island of St Helena in the South Atlantic served as a depot for "liberated" Africans. Between 1840 and 1867, approximately 27,000 Africans were disembarked on the island. To investigate their origins, we generated genome-wide ancient DNA data for 20 individuals recovered from St Helena. The genetic data shows they came from West Central Africa, possibly the area of present-day Gabon and Angola. The data implies that they did not belong to a single population, confirming historical reports of cultural heterogeneity in the island's African community. Our results shed new light on the origins of enslaved Africans during the final stages of the slave trade and illustrate how genetic data can be used to complement and validate historical sources.

70: Whole-genome sequencing of rare disease patients in a national healthcare system
more details view paper

Posted to bioRxiv 01 Jan 2019

Whole-genome sequencing of rare disease patients in a national healthcare system
363 downloads genomics

The NIHR BioResource, on behalf of the 100,000 Genomes Project

Most patients with hereditary rare diseases do not receive a molecular diagnosis and the aetiological variants and mediating genes for half such disorders remain to be discovered. We implemented whole-genome sequencing (WGS) in a national healthcare system to streamline diagnosis and to discover unknown aetiological variants, in the coding and non-coding regions of the genome. In a pilot study for the 100,000 Genomes Project, we generated WGS data for 13,037 participants, of whom 9,802 had a rare disease, and provided a genetic diagnosis to 1,040 of the 7,065 patients with detailed phenotypic data. We identified 99 Mendelian associations between genes and rare diseases, of which at least 80 are confirmed aetiological. Using WGS of UK Biobank, we showed that rare alleles can explain the presence of some individuals in the tails of a quantitative red blood cell (RBC) trait. Finally, we reported novel non-coding variants which cause disease through the disruption of transcription of ARPC1B, GATA1, LRBA and MPL. Our study demonstrates a synergy by using WGS for diagnosis and aetiological discovery in routine healthcare.

71: Admixture-enabled selection for rapid adaptive evolution in the Americas
more details view paper

Posted to bioRxiv 28 Sep 2019

Admixture-enabled selection for rapid adaptive evolution in the Americas
361 downloads genomics

Emily T. Norris, Lavanya Rishishwar, Aroon T. Chande, Andrew B. Conley, Kaixiong Ye, Augusto Valderrama-Aguirre, I. King Jordan

Background: Admixture occurs when previously isolated populations come together and exchange genetic material. We hypothesized that admixture can enable rapid adaptive evolution in human populations by introducing novel genetic variants (haplotypes) at intermediate frequencies, and we tested this hypothesis via the analysis of whole genome sequences sampled from admixed Latin American populations in Colombia, Mexico, Peru, and Puerto Rico. Results: Our screen for admixture-enabled selection relies on the identification of loci that contain more or less ancestry from a given source population than would be expected given the genome-wide ancestry frequencies. We employed a combined evidence approach to evaluate levels of ancestry enrichment at (1) single loci across multiple populations and (2) multiple loci that function together to encode polygenic traits. We found cross-population signals of African ancestry enrichment at the major histocompatibility locus on chromosome 6, consistent with admixture-enabled selection for enhanced adaptive immune response. Several of the human leukocyte antigen genes at this locus (HLA-A, HLA-DRB51 and HLA-DRB5) showed independent evidence of positive selection prior to admixture, based on extended haplotype homozygosity in African populations. A number of traits related to inflammation, blood metabolites, and both the innate and adaptive immune system showed evidence of admixture-enabled polygenic selection in Latin American populations. Conclusions: The results reported here, considered together with the ubiquity of admixture in human evolution, suggest that admixture serves as a fundamental mechanism that drives rapid adaptive evolution in human populations.

72: Unraveling the polygenic architecture of complex traits using blood eQTL meta-analysis
more details view paper

Posted to bioRxiv 19 Oct 2018

Unraveling the polygenic architecture of complex traits using blood eQTL meta-analysis
354 downloads genomics

Urmo Võsa, Annique Claringbould, Harm-Jan Westra, Marc Jan Bonder, Patrick Deelen, Biao Zeng, Holger Kirsten, Ashis Saha, Roman Kreuzhuber, Silva Kasela, Natalia Pervjakova, Isabel Alvaes, Marie-Julie Fave, Mawusse Agbessi, Mark Christiansen, Rick Jansen, Ilkka Seppälä, Lin Tong, Alexander Teumer, Katharina Schramm, Gibran Hemani, Joost Verlouw, Hanieh Yaghootkar, Reyhan Sönmez, Andrew Brown, Viktorija Kukushkina, Anette Kalnapenkis, Sina Rüeger, Eleonora Porcu, Jaanika Kronberg-Guzman, Johannes Kettunen, Joseph Powell, Bernett Lee, Futao Zhang, Wibowo Arindrarto, Frank Beutner, BIOS Consortium, Harm Brugge, i2QTL Consortium, Julia Dmitreva, Mahmoud Elansary, Benjamin P. Fairfax, Michel Georges, Bastiaan T. Heijmans, Mika Kähönen, Yungil Kim, Julian C Knight, Peter Kovacs, Knut Krohn, Shuang Li, Markus Loeffler, Urko M Marigorta, Hailang Mei, Yukihide Momozawa, Martina Müller-Nurasyid, Matthias Nauck, Michel Nivard, Brenda Penninx, Jonathan Pritchard, Olli Raitakari, Olaf Rotzchke, Eline P Slagboom, Coen D.A. Stehouwer, Michael Stumvoll, Patrick Sullivan, Peter A.C. ‘t Hoen, Joachim Thiery, Anke Tönjes, Jenny van Dongen, Maarten van Iterson, Jan Veldink, Uwe Völker, Cisca Wijmenga, Morris Swertz, Anand Andiappan, Grant W. Montgomery, Samuli Ripatti, Markus Perola, Zoltan Kutalik, Emmanouil Dermitzakis, Sven Bergmann, Timothy Frayling, Joyce van Meurs, Holger Prokisch, Habibul Ahsan, Brandon Pierce, Terho Lehtimäki, Dorret Boomsma, Bruce M Psaty, Sina A. Gharib, Philip Awadalla, Lili Milani, Willem Ouwehand, Kate Downes, Oliver Stegle, Alexis Battle, Jian Yang, Peter M. Visscher, Markus Scholz, Gregory Gibson, Tõnu Esko, Lude Franke

While many disease-associated variants have been identified through genome-wide association studies, their downstream molecular consequences remain unclear. To identify these effects, we performed cis- and trans-expression quantitative trait locus (eQTL) analysis in blood from 31,684 individuals through the eQTLGen Consortium. We observed that cis-eQTLs can be detected for 88% of the studied genes, but that they have a different genetic architecture compared to disease-associated variants, limiting our ability to use cis-eQTLs to pinpoint causal genes within susceptibility loci. In contrast, trans-eQTLs (detected for 37% of 10,317 studied trait-associated variants) were more informative. Multiple unlinked variants, associated to the same complex trait, often converged on trans-genes that are known to play central roles in disease etiology. We observed the same when ascertaining the effect of polygenic scores calculated for 1,263 genome-wide association study (GWAS) traits. Expression levels of 13% of the studied genes correlated with polygenic scores, and many resulting genes are known to drive these traits.

73: Nanopore native RNA sequencing of a human poly(A) transcriptome
more details view paper

Posted to bioRxiv 09 Nov 2018

Nanopore native RNA sequencing of a human poly(A) transcriptome
342 downloads genomics

Rachael E. Workman, Alison D Tang, Paul S. Tang, Miten Jain, John R Tyson, Philip C Zuzarte, Timothy Gilpatrick, Roham Razaghi, Joshua Quick, Norah Sadowski, Nadine Holmes, Jaqueline Goes de Jesus, Karen L. Jones, Terrance P Snutch, Nicholas J Loman, Benedict Paten, Matthew Loose, Jared T Simpson, Hugh E Olsen, Angela N Brooks, Mark Akeson, Winston Timp

High throughput RNA sequencing technologies have dramatically advanced our understanding of transcriptome complexity and regulation. However, these cDNA-based methods lose information contained in biological RNA because the copied reads are short or because modifications are not carried forward in cDNA. Here we address these limitations using a native poly(A) RNA sequencing strategy developed by Oxford Nanopore Technologies (ONT). Our study focused on poly(A) RNA isolated from the human cell line GM12878, from which we sequenced approximately 9.9 million individual aligned strands. These native RNA sequence reads had an N50 length of 1334 bases, and a maximum length of 22,000 bases. A total of 78,199 high-confidence isoforms were identified by combining long nanopore reads with short higher accuracy Illumina reads. Among these isoforms, over 50% are not present in GENCODE v24. We describe strategies for assessing 3'poly(A) tail length, base modifications and transcript haplotypes using this single molecule technology. Together, these nanopore-based techniques are poised to deliver new insights into RNA biology.

74: A complete Cannabis chromosome assembly and adaptive admixture for elevated cannabidiol (CBD) content
more details view paper

Posted to bioRxiv 31 Oct 2018

A complete Cannabis chromosome assembly and adaptive admixture for elevated cannabidiol (CBD) content
339 downloads genomics

Christopher J Grassa, Jonathan P Wenger, Clemon Dabney, Shane G Poplawski, S Timothy Motley, Todd P Michael, C.J. Schwartz, George D Weiblen

Cannabis has been cultivated for millennia with distinct cultivars providing either fiber and grain or tetrahydrocannabinol. Recent demand for cannabidiol rather than tetrahydrocannabinol has favored the breeding of admixed cultivars with extremely high cannabidiol content. Despite several draft Cannabis genomes, the genomic structure of cannabinoid synthase loci has remained elusive. A genetic map derived from a tetrahydrocannabinol/cannabidiol segregating population and a complete chromosome assembly from a high-cannabidiol cultivar together resolve the linkage of cannabidiolic and tetrahydrocannabinolic acid synthase gene clusters which are associated with transposable elements. High-cannabidiol cultivars appear to have been generated by integrating hemp-type cannabidiolic acid synthase gene clusters into a background of marijuana-type cannabis. Quantitative trait locus mapping suggests that overall drug potency, however, is associated with other genomic regions needing additional study.

75: Great ape mutation spectra vary across the phylogeny and the genome due to distinct mutational processes that evolve at different rates
more details view paper

Posted to bioRxiv 15 Oct 2019

Great ape mutation spectra vary across the phylogeny and the genome due to distinct mutational processes that evolve at different rates
338 downloads genomics

Michael E. Goldberg, Kelley Harris

Recent studies of hominoid variation have shown that mutation rates and spectra can evolve rapidly, contradicting the fixed molecular clock model. The relative mutation rates of three-base-pair motifs differ significantly among great ape species, suggesting the action of unknown modifiers of DNA replication fidelity. To illuminate the footprints of these hypothetical mutators, we measured mutation spectra of several functional compartments (such as late-replicating regions) that are likely targeted by localized mutational processes. Using genetic diversity from 88 great apes, we find that compartment-specific mutational signatures appear largely conserved between species. These signatures layer with species-specific signatures to create rich mutational portraits: for example, late-replicating regions in gorillas contain an identifiable mixture of a replication timing signature and a gorilla-specific signature. Our results suggest that cis-acting mutational modifiers are highly conserved between species and trans-acting modifiers are driving rapid mutation spectrum evolution.

76: Citizen-Centered, Auditable, and Privacy-Preserving Population Genomics
more details view paper

Posted to bioRxiv 10 Oct 2019

Citizen-Centered, Auditable, and Privacy-Preserving Population Genomics
326 downloads genomics

Dennis Grishin, Jean Louis Raisaro, Juan Ramón Troncoso-Pastoriza, Kamal Obbad, Kevin Quinn, Mickaël Misbach, Jared Gollhardt, Joao Sa, Jacques Fellay, George M Church, Jean-Pierre Hubaux

The growing number of health-data breaches, the use of genomic databases for law enforcement purposes and the lack of transparency of personal-genomics companies are raising unprecedented privacy concerns. To enable a secure exploration of genomic datasets with controlled and transparent data access, we propose a novel approach that combines cryptographic privacy-preserving technologies, such as homomorphic encryption and secure multi-party computation, with the auditability of blockchains. This approach provides strong security guarantees against realistic threat models by empowering individual citizens to decide who can query and access their genomic data and by ensuring end-to-end data confidentiality. Our open-source implementation supports queries on the encrypted genomic data of hundreds of thousands of individuals, with minimal overhead. Our work opens a path towards multi-functional, privacy-preserving genomic-data analysis.

77: Minor allele frequency thresholds strongly affect population structure inference with genomic datasets
more details view paper

Posted to bioRxiv 14 Sep 2017

Minor allele frequency thresholds strongly affect population structure inference with genomic datasets
322 downloads genomics

Ethan B. Linck, C.J. Battey

Across the genome, the effects of different evolutionary processes and historical events can result in different classes of genetic variants (or alleles) characterized by their relative frequency in a given population. As a result, population genetic inference can be strongly affected by biases in laboratory and bioinformatics treatments that affect the site frequency spectrum, or SFS. Yet despite the widespread use of reduced-representation genomic datasets with nonmodel organisms, the potential consequences of these biases for downstream analyses remain poorly examined. Here, we assess the influence of minor allele frequency (MAF) thresholds implemented during variant detection on inference of population structure. We use simulated and empirical datasets to evaluate the effect of MAF thresholds on the ability to discriminate among populations and quantify admixture with both model-based and non-model-based clustering methods. We find model-based inference of population structure is highly sensitive to choice of MAF, and may be confounded by either including singletons or excluding all rare alleles. In contrast, non-model-based clustering is largely robust to MAF choice. Our results suggest that model-based inference of population structure can fail due to either natural demographic processes or assembly artifacts, with broad consequences for phylogeographic and population genetic studies using NGS data. We propose a simple hypothesis to explain this behavior and recommend a set of best practices for researchers seeking to describe population structure using reduced-representation libraries.

78: Common alleles of CMT2 and NRPE1 are major determinants of de novo DNA methylation variation in Arabidopsis thaliana
more details view paper

Posted to bioRxiv 25 Oct 2019

Common alleles of CMT2 and NRPE1 are major determinants of de novo DNA methylation variation in Arabidopsis thaliana
318 downloads genomics

Eriko Sasaki, Taiji Kawakatsu, Joseph R. Ecker, Magnus Nordborg

DNA cytosine methylation is an epigenetic mark associated with silencing of transposable elements (TEs) and heterochromatin formation. In plants, it occurs in three sequence contexts: CG, CHG, and CHH (where H is A, T, or C). The latter does not allow direct inheritance of methylation during DNA replication due to lack of symmetry, and methylation must therefore be re-established every cell generation. Genome-wide association studies (GWAS) have previously shown that CMT2 and NRPE1 are major determinants of genome-wide patterns of TE CHH-methylation. Here we instead focus on CHH-methylation of individual TEs and TE-families, allowing us to identify the pathways involved in CHH-methylation simply from natural variation and confirm the associations by comparing them with mutant phenotypes. Methylation at TEs targeted by the RNA-directed DNA methylation (RdDM) pathway is unaffected by CMT2 variation, but is strongly affected by variation at NRPE1, which is largely responsible for the longitudinal cline in this phenotype. In contrast, CMT2-targeted TEs are affected by both loci, which jointly explain 7.3% of the phenotypic variation (13.2% of total genetic effects). There is no longitudinal pattern for this phenotype, however, because the geographic patterns appear to compensate for each other in a pattern suggestive of stabilizing selection.

79: High-quality chromosome-scale assembly of the walnut (Juglans regia L) reference genome
more details view paper

Posted to bioRxiv 17 Oct 2019

High-quality chromosome-scale assembly of the walnut (Juglans regia L) reference genome
318 downloads genomics

Annarita Marrano, Monica Britton, Paulo Adriano Zaini, Aleksey Zimin, Rachael Workman, Daniela Puiu, Luca Bianco, Erica Adele Di Pierro, Brian Allen, Sandeep Chakraborty, Michela Troggio, Charles Leslie, Winston Timp, Abhaya Dendekar, Steven Salzberg, David B. Neale

The release of the first reference genome of walnut (Juglans regia L.) enabled many achievements in the characterization of walnut genetic and functional variation. However, it is highly fragmented, preventing the integration of genetic, transcriptomic and proteomic information for a full determinism of walnut biological processes. Here we report the new chromosome-scale assembly of the walnut reference genome (Chandler v2.0), obtained combining Oxford Nanopore long-read sequencing with chromosome conformation capture (Hi-C) technology. Relative to the previous reference genome, the new assembly features a 84.4-fold increase in N50 size, and the 16 chromosomal pseudomolecules fully-assembled, nine of which presenting telomere sequences at both ends. Using full-length transcripts from single-molecule real-time sequencing, we predicted 40,491 gene models, with a mean gene length higher than the previous gene annotations. Most of the new protein-coding genes (90%) are full-length, which represents a great improvement compared to Chandler v1.0 (only 48%). We then tested the potential impact of the new chromosome-level genome on different areas of walnut research. By studying the proteome changes occurring during catkin development, we observed that the virtual proteome obtained from Chandler v2.0 presents fewer artifacts than the previous reference genome, enabling the identification of a new potential pollen allergen in walnut. Also, the new chromosome-scale genome facilitates in-depth studies of intraspecies genetic diversity, by revealing previously undetected autozygous regions in Chandler, likely resulting from inbreeding, and 195 genomic regions highly differentiated between western and eastern walnut cultivars. Overall, Chandler v2.0 is a valuable resource to understand walnut biology better.

80: Contribution of unfixed transposable element insertions to human regulatory variation
more details view paper

Posted to bioRxiv 03 Oct 2019

Contribution of unfixed transposable element insertions to human regulatory variation
316 downloads genomics

Clement Goubert, Nicolas Arce Zevallos, Cedric Feschotte

Thousands of unfixed transposable element (TE) insertions segregate in the human population, but little is known about their impact on genome function. Recently, a few studies associated polymorphic TE insertions to mRNA levels of adjacent genes, but the biological significance of these associations, their replicability across cell types, and the mechanisms by which they may regulate genes remain largely unknown. Here we performed a TE-expression QTL analysis of 444 lymphoblastoid cell lines and 294 induced pluripotent stem cells using a newly developed set of genotypes for 2,806 polymorphic TE insertions. We identified 211 and 324 TE-eQTL acting in cis in each respective cell type. Approximately one fourth were shared across cell types with strongly correlated effects. Furthermore, analysis of chromatin accessibility QTL in a subset of the lymphoblastoid cell lines suggests that unfixed TEs often modulate the activity of enhancers and other distal regulatory DNA elements, which tend to lose accessibility when a TE inserts within them. We also document a case of an unfixed TE likely influencing gene expression at the post-transcriptional level. Our study points to broad and diverse cis-regulatory effects of unfixed TEs in the human population and underscores their plausible contribution to phenotypic variation.

Previous page 1 2 3 4 5 6 7 8 . . . 219 Next page

Sign up for the Rxivist weekly newsletter! (Click here for more details.)