Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects
Allison A Regier,
Hyun Min Kang,
Daniel P Howrigan,
Darren C Ames,
Adam C English,
the NHLBI Trans-Omics for Precision Medicine (TOPMed) Program,
Goncalo R. Abecasis,
Michael C. Zody,
Benjamin M Neale,
Ira M Hall
Posted 22 Feb 2018
bioRxiv DOI: 10.1101/269316 (published DOI: 10.1038/s41467-018-06159-4)
Posted 22 Feb 2018
Hundreds of thousands of human whole genome sequencing (WGS) datasets will be generated over the next few years to interrogate a broad range of traits, across diverse populations. These data are more valuable in aggregate: joint analysis of genomes from many sources increases sample size and statistical power for trait mapping, and will enable studies of genome biology, population genetics and genome function at unprecedented scale. A central challenge for joint analysis is that different WGS data processing and analysis pipelines cause substantial batch effects in combined datasets, necessitating computationally expensive reprocessing and harmonization prior to variant calling. This approach is no longer tenable given the scale of current studies and data volumes. Here, in a collaboration across multiple genome centers and NIH programs, we define WGS data processing standards that allow different groups to produce "functionally equivalent" (FE) results suitable for joint variant calling with minimal batch effects. Our approach promotes broad harmonization of upstream data processing steps, while allowing for diverse variant callers. Importantly, it allows each group to continue innovating on data processing pipelines, as long as results remain compatible. We present initial FE pipelines developed at five genome centers and show that they yield similar variant calling results — including single nucleotide (SNV), insertion/deletion (indel) and structural variation (SV) — and produce significantly less variability than sequencing replicates. Residual inter-pipeline variability is concentrated at low quality sites and repetitive genomic regions prone to stochastic effects. This work alleviates a key technical bottleneck for genome aggregation and helps lay the foundation for broad data sharing and community-wide "big-data" human genetics studies.
- Downloaded 2,291 times
- Download rankings, all-time:
- Site-wide: 3,025 out of 92,411
- In bioinformatics: 535 out of 8,662
- Year to date:
- Site-wide: 42,191 out of 92,411
- Since beginning of last month:
- Site-wide: 12,181 out of 92,411
Downloads over time
Distribution of downloads per paper, site-wide
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!