Rxivist logo

Rxivist.org combines preprints from bioRxiv.org with data from Twitter to help you find the papers being discussed in your field.
Currently indexing 83,751 bioRxiv papers from 360,740 authors.

Most downloaded bioRxiv papers, all time

Results 1 through 20 out of 5312

in category genomics


1: The Genomic Formation of South and Central Asia

Vagheesh Narasimhan, Nick Patterson et al.

89,595 downloads (posted 31 Mar 2018)

The genetic formation of Central and South Asian populations has been unclear because of an absence of ancient DNA. To address this gap, we generated genome-wide data from 362 ancient individuals, including the first from eastern Iran, Turan (Uzbekistan, Turkmenistan, and Tajikistan), Bronze Age Kazakhstan, and South Asia. Our data reveal a complex set of genetic sources that ultimately combined to form the ancestry of South Asians today. We document a southward spread of genetic ancestry from the Eurasian Steppe, correlating with the archaeologically known expansion of pastoralist sites from the Steppe to Turan in the Middle Bronze Age (2300-1500 BCE). These Steppe communities mixed genetically with peoples of the Bactria Margiana Archaeological Complex (BMAC) whom they encountered in Turan (primarily descendants of earlier agriculturalists of Iran), but there is no evidence that the main BMAC population contributed genetically to later South Asians. Instead, Steppe communities integrated farther south throughout the 2nd millennium BCE, and we show that they mixed with a more southern population that we document at multiple sites as outlier individuals exhibiting a distinctive mixture of ancestry related to Iranian agriculturalists and South Asian hunter-gathers. We call this group Indus Periphery because they were found at sites in cultural contact with the Indus Valley Civilization (IVC) and along its northern fringe, and also because they were genetically similar to post-IVC groups in the Swat Valley of Pakistan. By co-analyzing ancient DNA and genomic data from diverse present-day South Asians, we show that Indus Periphery-related people are the single most important source of ancestry in South Asia — consistent with the idea that the Indus Periphery individuals are providing us with the first direct look at the ancestry of peoples of the IVC — and we develop a model for the formation of present-day South Asians in terms of the temporally and geographically proximate sources of Indus Periphery-related, Steppe, and local South Asian hunter-gatherer-related ancestry. Our results show how ancestry from the Steppe genetically linked Europe and South Asia in the Bronze Age, and identifies the populations that almost certainly were responsible for spreading Indo-European languages across much of Eurasia.


2: Patterns of genetic differentiation and the footprints of historical migrations in the Iberian Peninsula

Clare Bycroft, Ceres Fernandez-Rozadilla et al.

42,244 downloads (posted 12 Mar 2018)

Genetic differences within or between human populations (population structure) has been studied using a variety of approaches over many years. Recently there has been an increasing focus on studying genetic differentiation at fine geographic scales, such as within countries. Identifying such structure allows the study of recent population history, and identifies the potential for confounding in association studies, particularly when testing rare, often recently arisen variants. The Iberian Peninsula is linguistically di...


3: Creating a universal SNP and small indel variant caller with deep neural networks

Ryan Poplin, Pi-Chuan Chang et al.

36,565 downloads (posted 14 Dec 2016)

Next-generation sequencing (NGS) is a rapidly evolving set of technologies that can be used to determine the sequence of an individual's genome by calling genetic variants present in an individual using billions of short, errorful sequence reads. Despite more than a decade of effort and thousands of dedicated researchers, the hand-crafted and parameterized statistical models used for variant calling still produce thousands of errors and missed variants in each genome. Here we show that a deep convolutional neural networ...


4: Comprehensive integration of single cell data

Tim Stuart, Andrew Butler et al.

34,145 downloads (posted 02 Nov 2018)

Single cell transcriptomics (scRNA-seq) has transformed our ability to discover and annotate cell types and states, but deep biological understanding requires more than a taxonomic listing of clusters. As new methods arise to measure distinct cellular modalities, including high-dimensional immunophenotypes, chromatin accessibility, and spatial positioning, a key analytical challenge is to integrate these datasets into a harmonized atlas that can be used to better understand cellular identity and function. Here, we devel...


5: Common methods for fecal sample storage in field studies yield consistent signatures of individual identity in microbiome sequencing data

Ran Blekhman, Karen Tang et al.

33,699 downloads (posted 04 Feb 2016)

Field studies of wild vertebrates are frequently associated with extensive collections of banked fecal samples, which are often collected from known individuals and sometimes also sampled longitudinally across time. Such collections represent unique resources for understanding ecological, behavioral, and phylogenetic effects on the gut microbiome, especially for species of particular conservation concern. However, we do not understand whether sample storage methods confound the ability to investigate interindividual var...


6: A guide to performing Polygenic Risk Score analyses

Shing Wan Choi, Timothy Shin Heng Mak et al.

32,985 downloads (posted 14 Sep 2018)

The application of polygenic risk scores (PRS) has become routine in genetic epidemiological studies. Among a range of applications, PRS are commonly used to assess shared aetiology among different phenotypes and to evaluate the predictive power of genetic data, while they are also now being exploited as part of study design, in which experiments are performed on individuals, or their biological samples (eg. tissues, cells), at the tails of the PRS distribution and contrasted. As GWAS sample sizes increase and PRS becom...


7: Introductions and early spread of SARS-CoV-2 in France

Fabiana Gámbaro, Sylvie Behillil et al.

32,211 downloads (posted 24 Apr 2020)

Following the emergence of coronavirus disease (COVID-19) in Wuhan, China in December 2019, specific COVID-19 surveillance was launched in France on January 10, 2020. Two weeks later, the first three imported cases of COVID-19 into Europe were diagnosed in France. We sequenced 97 severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genomes from samples collected between January 24 and March 24, 2020 from infected patients in France. Phylogenetic analysis identified several early independent SARS-CoV-2 introducti...


8: Analysis of protein-coding genetic variation in 60,706 humans

Exome Aggregation Consortium, Monkol Lek et al.

21,713 downloads (posted 30 Oct 2015)

Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) sequence data for 60,706 individuals of diverse ethnicities generated as part of the Exome Aggregation Consortium (ExAC). The resulting catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of w...


9: The mutational constraint spectrum quantified from variation in 141,456 humans

Konrad Karczewski, Laurent C Francioli et al.

20,388 downloads (posted 28 Jan 2019)

Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes critical for an organism’s function will be depleted for such variants in natural populations, while non-essential genes will tolerate their accumulation. However, predicted loss-of-function (pLoF) variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample size...


10: Complemented palindrome small RNAs first discovered from SARS coronavirus

Chang Liu, Ze Chen et al.

19,444 downloads (posted 07 Sep 2017)

In this study, we reported for the first time the existence of complemented palindrome small RNAs (cpsRNAs) and proposed cpsRNAs and palindrome small RNAs (psRNAs) as a novel class of small RNAs. The first discovered cpsRNA UCUUUAACAAGCUUGUUAAAGA from SARS coronavirus named SARS-CoV-cpsR-22 contained 22 nucleotides perfectly matching its reverse complementary sequence. Further sequence analysis supported that SARS-CoV-cpsR-22 originated from bat betacoronavirus. The results of RNAi experiments showed that one 19-nt segm...


11: Quantitative analysis of population-scale family trees using millions of relatives

Joanna Kaplanis, Assaf Gordon et al.

18,972 downloads (posted 07 Feb 2017)

Family trees have vast applications in multiple fields from genetics to anthropology and economics. However, the collection of extended family trees is tedious and usually relies on resources with limited geographical scope and complex data usage restrictions. Here, we collected 86 million profiles from publicly-available online data from genealogy enthusiasts. After extensive cleaning and validation, we obtained population-scale family trees, including a single pedigree of 13 million individuals. We leveraged the data ...


12: Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression

Christoph Hafemeister, Rahul Satija

17,279 downloads (posted 14 Mar 2019)

Single-cell RNA-seq (scRNA-seq) data exhibits significant cell-to-cell variation due to technical factors, including the number of molecules detected in each cell, which can confound biological heterogeneity with technical effects. To address this, we present a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiments. We propose that the Pearson residuals from 'regularized negative binomial regression', where cellular sequencing depth is utilized as a covaria...


13: Vibrio natriegens, a new genomic powerhouse

Henry H Lee, Nili Ostrov et al.

16,863 downloads (posted 12 Jun 2016)

Recombinant DNA technology has revolutionized biomedical research with continual innovations advancing the speed and throughput of molecular biology. Nearly all these tools, however, are reliant on Escherichia coli as a host organism, and its lengthy growth rate increasingly dominates experimental time. Here we report the development of Vibrio natriegens, a free-living bacteria with the fastest generation time known, into a genetically tractable host organism. We systematically characterize its growth properties to esta...


14: Comparative analysis of single-cell RNA sequencing methods

Christoph Ziegenhain, Beate Vieth et al.

16,651 downloads (posted 05 Jan 2016)

Background: Single-cell RNA sequencing (scRNA‑seq) offers exciting possibilities to address biological and medical questions, but a systematic comparison of recently developed protocols is still lacking. Results: We generated data from 447 mouse embryonic stem cells using Drop‑seq, SCRB‑seq, Smart‑seq (on Fluidigm C1) and Smart‑seq2 and analyzed existing data from 35 mouse embryonic stem cells prepared with CEL‑seq. We find that Smart‑seq2 is the most sensitive method as it detects the most genes per cell and across cel...


15: When to use Quantile Normalization?

Stephanie C. Hicks, Rafael Irizarry

16,155 downloads (posted 04 Dec 2014)

Normalization and preprocessing are essential steps for the analysis of high-throughput data including next-generation sequencing and microarrays. Multi-sample global normalization methods, such as quantile normalization, have been successfully used to remove technical variation from noisy data. These methods rely on the assumption that observed global changes across samples are due to unwanted technical variability. Transforming the data to remove these differences has the potential to remove interesting biologically d...


16: Mass-spectrometry of single mammalian cells quantifies proteome heterogeneity during cell differentiation

Bogdan Budnik, Ezra Levy et al.

16,028 downloads (posted 24 Jan 2017)

Cellular heterogeneity is important to biological processes, including cancer and development. However, proteome heterogeneity is largely unexplored because of the limitations of existing methods for quantifying protein levels in single cells. To alleviate these limitations, we developed Single Cell ProtEomics by Mass Spectrometry (SCoPE-MS), and validated its ability to identify distinct human cancer cell types based on their proteomes. We used SCoPE-MS to quantify over a thousand proteins in differentiating mouse embr...


17: The Beaker Phenomenon And The Genomic Transformation Of Northwest Europe

Iñigo Olalde, Selina Brace et al.

15,561 downloads (posted 09 May 2017)

Bell Beaker pottery spread across western and central Europe beginning around 2750 BCE before disappearing between 2200-1800 BCE. The mechanism of its expansion is a topic of long-standing debate, with support for both cultural diffusion and human migration. We present new genome-wide ancient DNA data from 170 Neolithic, Copper Age and Bronze Age Europeans, including 100 Beaker-associated individuals. In contrast to the Corded Ware Complex, which has previously been identified as arriving in central Europe following mig...


18: Highly parallel direct RNA sequencing on an array of nanopores

Daniel R Garalde, Elizabeth A Snell et al.

15,186 downloads (posted 12 Aug 2016)

Ribonucleic acid sequencing can allow us to monitor the RNAs present in a sample. This enables us to detect the presence and nucleotide sequence of viruses, or to build a picture of how active transcriptional processes are changing -- information that is useful for understanding the status and function of a sample. Nanopore-based sequencing technology is capable of electronically analysing a sample's DNA directly, and in real-time. In this manuscript we demonstrate the ability of an array of nanopores to sequence RNA di...


19: Regulation of Life Span by the Gut Microbiota in The Short-Lived African Turquoise Killifish

Patrick Smith, David Willemsen et al.

15,014 downloads (posted 27 Mar 2017)

Gut bacteria occupy the interface between the organism and the external environment, contributing to homeostasis and disease. Yet, the causal role of the gut microbiota during host aging is largely unexplored. Here, using the African turquoise killifish (Nothobranchius furzeri), a naturally short-lived vertebrate, we show that the gut microbiota plays a key role in modulating vertebrate life span. Recolonizing the gut of middle-age individuals with bacteria from young donors resulted in life span extension and delayed b...


20: RNA velocity in single cells

Gioele La Manno, Ruslan Soldatov et al.

14,044 downloads (posted 19 Oct 2017)

RNA abundance is a powerful indicator of the state of individual cells, but does not directly reveal dynamic processes such as cellular differentiation. Here we show that RNA velocity - the time derivative of RNA abundance - can be estimated by distinguishing unspliced and spliced mRNAs in standard single-cell RNA sequencing protocols. We show that RNA velocity is a vector that predicts the future state of individual cells on a timescale of hours. We validate the accuracy of RNA velocity in the neural crest lineage, dem...