1: The Genomic Formation of South and Central Asia

Vagheesh M Narasimhan, Nick Patterson et al.

79,746 downloads (posted 31 Mar 2018)

The genetic formation of Central and South Asian populations has been unclear because of an absence of ancient DNA. To address this gap, we generated genome-wide data from 362 ancient individuals, including the first from eastern Iran, Turan (Uzbekistan, Turkmenistan, and Tajikistan), Bronze Age Kazakhstan, and South Asia. Our data reveal a complex set of genetic sources that ultimately combined to form the ancestry of South Asians today. We document a southward spread of genetic ancestry from the Eurasian Steppe, correlating with the archaeologically known expansion of pastoralist sites from the Steppe to Turan in the Middle Bronze Age (2300-1500 BCE). These Steppe communities mixed genetically with peoples of the Bactria Margiana Archaeological Complex (BMAC) whom they encountered in Turan (primarily descendants of earlier agriculturalists of Iran), but there is no evidence that the main BMAC population contributed genetically to later South Asians. Instead, Steppe communities integrated farther south throughout the 2nd millennium BCE, and we show that they mixed with a more southern population that we document at multiple sites as outlier individuals exhibiting a distinctive mixture of ancestry related to Iranian agriculturalists and South Asian hunter-gathers. We call this group Indus Periphery because they were found at sites in cultural contact with the Indus Valley Civilization (IVC) and along its northern fringe, and also because they were genetically similar to post-IVC groups in the Swat Valley of Pakistan. By co-analyzing ancient DNA and genomic data from diverse present-day South Asians, we show that Indus Periphery-related people are the single most important source of ancestry in South Asia — consistent with the idea that the Indus Periphery individuals are providing us with the first direct look at the ancestry of peoples of the IVC — and we develop a model for the formation of present-day South Asians in terms of the temporally and geographically proximate sources of Indus Periphery-related, Steppe, and local South Asian hunter-gatherer-related ancestry. Our results show how ancestry from the Steppe genetically linked Europe and South Asia in the Bronze Age, and identifies the populations that almost certainly were responsible for spreading Indo-European languages across much of Eurasia.


2: Patterns of genetic differentiation and the footprints of historical migrations in the Iberian Peninsula

Clare Bycroft, Ceres Fernández-Rozadilla et al.

40,309 downloads (posted 12 Mar 2018)

Genetic differences within or between human populations (population structure) has been studied using a variety of approaches over many years. Recently there has been an increasing focus on studying genetic differentiation at fine geographic scales, such as within countries. Identifying such structure allows the study of recent population history, and identifies the potential for confounding in association studies, particularly when testing rare, often recently arisen variants. The Iberian Peninsula is linguistically di...


3: Creating a universal SNP and small indel variant caller with deep neural networks

Ryan Poplin, Pi-Chuan Chang et al.

34,679 downloads (posted 14 Dec 2016)

Next-generation sequencing (NGS) is a rapidly evolving set of technologies that can be used to determine the sequence of an individual's genome by calling genetic variants present in an individual using billions of short, errorful sequence reads. Despite more than a decade of effort and thousands of dedicated researchers, the hand-crafted and parameterized statistical models used for variant calling still produce thousands of errors and missed variants in each genome. Here we show that a deep convolutional neural networ...


4: Common methods for fecal sample storage in field studies yield consistent signatures of individual identity in microbiome sequencing data

Ran Blekhman, Karen Tang et al.

33,621 downloads (posted 04 Feb 2016)

Field studies of wild vertebrates are frequently associated with extensive collections of banked fecal samples, which are often collected from known individuals and sometimes also sampled longitudinally across time. Such collections represent unique resources for understanding ecological, behavioral, and phylogenetic effects on the gut microbiome, especially for species of particular conservation concern. However, we do not understand whether sample storage methods confound the ability to investigate interindividual var...


5: Comprehensive integration of single cell data

Tim Stuart, Andrew Butler et al.

22,616 downloads (posted 02 Nov 2018)

Single cell transcriptomics (scRNA-seq) has transformed our ability to discover and annotate cell types and states, but deep biological understanding requires more than a taxonomic listing of clusters. As new methods arise to measure distinct cellular modalities, including high-dimensional immunophenotypes, chromatin accessibility, and spatial positioning, a key analytical challenge is to integrate these datasets into a harmonized atlas that can be used to better understand cellular identity and function. Here, we devel...


6: Analysis of protein-coding genetic variation in 60,706 humans

Exome Aggregation Consortium, Monkol Lek et al.

21,445 downloads (posted 30 Oct 2015)

Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) sequence data for 60,706 individuals of diverse ethnicities generated as part of the Exome Aggregation Consortium (ExAC). The resulting catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of w...


7: A guide to performing Polygenic Risk Score analyses

Shing Wan Choi, Timothy Mak et al.

19,534 downloads (posted 14 Sep 2018)

The application of polygenic risk scores (PRS) has become routine in genetic epidemiological studies. Among a range of applications, PRS are commonly used to assess shared aetiology among different phenotypes and to evaluate the predictive power of genetic data, while they are also now being exploited as part of study design, in which experiments are performed on individuals, or their biological samples (eg. tissues, cells), at the tails of the PRS distribution and contrasted. As GWAS sample sizes increase and PRS becom...


8: Complemented palindrome small RNAs first discovered from SARS coronavirus

Chang Liu, Ze Chen et al.

19,250 downloads (posted 07 Sep 2017)

In this study, we reported for the first time the existence of complemented palindrome small RNAs (cpsRNAs) and proposed cpsRNAs and palindrome small RNAs (psRNAs) as a novel class of small RNAs. The first discovered cpsRNA UCUUUAACAAGCUUGUUAAAGA from SARS coronavirus named SARS-CoV-cpsR-22 contained 22 nucleotides perfectly matching its reverse complementary sequence. Further sequence analysis supported that SARS-CoV-cpsR-22 originated from bat betacoronavirus. The results of RNAi experiments showed that one 19-nt segm...


9: Quantitative analysis of population-scale family trees using millions of relatives

Joanna Kaplanis, Assaf Gordon et al.

18,433 downloads (posted 07 Feb 2017)

Family trees have vast applications in multiple fields from genetics to anthropology and economics. However, the collection of extended family trees is tedious and usually relies on resources with limited geographical scope and complex data usage restrictions. Here, we collected 86 million profiles from publicly-available online data from genealogy enthusiasts. After extensive cleaning and validation, we obtained population-scale family trees, including a single pedigree of 13 million individuals. We leveraged the data ...


10: Comparative analysis of single-cell RNA sequencing methods

Christoph Ziegenhain, Beate Vieth et al.

15,924 downloads (posted 05 Jan 2016)

Background: Single-cell RNA sequencing (scRNA‑seq) offers exciting possibilities to address biological and medical questions, but a systematic comparison of recently developed protocols is still lacking. Results: We generated data from 447 mouse embryonic stem cells using Drop‑seq, SCRB‑seq, Smart‑seq (on Fluidigm C1) and Smart‑seq2 and analyzed existing data from 35 mouse embryonic stem cells prepared with CEL‑seq. We find that Smart‑seq2 is the most sensitive method as it detects the most genes per cell and across cel...


11: Mass-spectrometry of single mammalian cells quantifies proteome heterogeneity during cell differentiation

Bogdan Budnik, Ezra Levy et al.

15,404 downloads (posted 24 Jan 2017)

Cellular heterogeneity is important to biological processes, including cancer and development. However, proteome heterogeneity is largely unexplored because of the limitations of existing methods for quantifying protein levels in single cells. To alleviate these limitations, we developed Single Cell ProtEomics by Mass Spectrometry (SCoPE-MS), and validated its ability to identify distinct human cancer cell types based on their proteomes. We used SCoPE-MS to quantify over a thousand proteins in differentiating mouse embr...


12: Vibrio natriegens, a new genomic powerhouse

Henry H Lee, Nili Ostrov et al.

15,138 downloads (posted 12 Jun 2016)

Recombinant DNA technology has revolutionized biomedical research with continual innovations advancing the speed and throughput of molecular biology. Nearly all these tools, however, are reliant on Escherichia coli as a host organism, and its lengthy growth rate increasingly dominates experimental time. Here we report the development of Vibrio natriegens, a free-living bacteria with the fastest generation time known, into a genetically tractable host organism. We systematically characterize its growth properties to esta...


13: Regulation of Life Span by the Gut Microbiota in The Short-Lived African Turquoise Killifish

Patrick Smith, David Willemsen et al.

14,780 downloads (posted 27 Mar 2017)

Gut bacteria occupy the interface between the organism and the external environment, contributing to homeostasis and disease. Yet, the causal role of the gut microbiota during host aging is largely unexplored. Here, using the African turquoise killifish (Nothobranchius furzeri), a naturally short-lived vertebrate, we show that the gut microbiota plays a key role in modulating vertebrate life span. Recolonizing the gut of middle-age individuals with bacteria from young donors resulted in life span extension and delayed b...


14: The Beaker Phenomenon And The Genomic Transformation Of Northwest Europe

Iñigo Olalde, Selina Brace et al.

14,548 downloads (posted 09 May 2017)

Bell Beaker pottery spread across western and central Europe beginning around 2750 BCE before disappearing between 2200-1800 BCE. The mechanism of its expansion is a topic of long-standing debate, with support for both cultural diffusion and human migration. We present new genome-wide ancient DNA data from 170 Neolithic, Copper Age and Bronze Age Europeans, including 100 Beaker-associated individuals. In contrast to the Corded Ware Complex, which has previously been identified as arriving in central Europe following mig...


15: Highly parallel direct RNA sequencing on an array of nanopores

Daniel R Garalde, Elizabeth A Snell et al.

14,287 downloads (posted 12 Aug 2016)

Ribonucleic acid sequencing can allow us to monitor the RNAs present in a sample. This enables us to detect the presence and nucleotide sequence of viruses, or to build a picture of how active transcriptional processes are changing -- information that is useful for understanding the status and function of a sample. Nanopore-based sequencing technology is capable of electronically analysing a sample's DNA directly, and in real-time. In this manuscript we demonstrate the ability of an array of nanopores to sequence RNA di...


16: RNA velocity in single cells

Gioele La Manno, Ruslan Soldatov et al.

13,272 downloads (posted 19 Oct 2017)

RNA abundance is a powerful indicator of the state of individual cells, but does not directly reveal dynamic processes such as cellular differentiation. Here we show that RNA velocity - the time derivative of RNA abundance - can be estimated by distinguishing unspliced and spliced mRNAs in standard single-cell RNA sequencing protocols. We show that RNA velocity is a vector that predicts the future state of individual cells on a timescale of hours. We validate the accuracy of RNA velocity in the neural crest lineage, dem...


17: When to use Quantile Normalization?

Stephanie C Hicks, Rafael A. Irizarry

13,130 downloads (posted 04 Dec 2014)

Normalization and preprocessing are essential steps for the analysis of high-throughput data including next-generation sequencing and microarrays. Multi-sample global normalization methods, such as quantile normalization, have been successfully used to remove technical variation from noisy data. These methods rely on the assumption that observed global changes across samples are due to unwanted technical variability. Transforming the data to remove these differences has the potential to remove interesting biologically d...


18: Nanopore sequencing and assembly of a human genome with ultra-long reads

Miten Jain, S Koren et al.

12,955 downloads (posted 20 Apr 2017)

Nanopore sequencing is a promising technique for genome sequencing due to its portability, ability to sequence long reads from single molecules, and to simultaneously assay DNA methylation. However until recently nanopore sequencing has been mainly applied to small genomes, due to the limited output attainable. We present nanopore sequencing and assembly of the GM12878 Utah/Ceph human reference genome generated using the Oxford Nanopore MinION and R9.4 version chemistry. We generated 91.2 Gb of sequence data (~30x theor...


19: An Atlas of Genetic Correlations across Human Diseases and Traits

Brendan Bulik-Sullivan, Hilary K Finucane et al.

12,875 downloads (posted 27 Jan 2015)

Identifying genetic correlations between complex traits and diseases can provide useful etiological insights and help prioritize likely causal relationships. The major challenges preventing estimation of genetic correlation from genome-wide association study (GWAS) data with current methods are the lack of availability of individual genotype data and widespread sample overlap among meta-analyses. We circumvent these difficulties by introducing a technique for estimating genetic correlation that requires only GWAS summar...


20: Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes

Konrad Karczewski, Laurent C Francioli et al.

12,643 downloads (posted 28 Jan 2019)

Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes critical for an organism’s function will be depleted for such variants in natural populations, while non-essential genes will tolerate their accumulation. However, predicted loss-of-function (pLoF) variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample size...