A robust benchmark for germline structural variant detection
Justin M. Zook,
Nancy F Hansen,
Nathan D. Olson,
Lesley M Chapman,
James C. Mullikin,
Adam M. Phillippy,
Paul C Boutros,
Sayed Mohammad E. Sahraeian,
Ian T. Fiddes,
Alvaro Martinez Barrio,
Oscar L. Rodriguez,
Shaun D Jackman(0000-0002-9275-5966),
John J Farrell,
Aaron M. Wenger,
Michael C. Schatz,
Adam C English,
Jeffrey A. Rosenfeld,
Ryan E. Mills,
Jay M. Sage,
Jennifer R. Davis,
Michael D. Kaiser,
John S. Oliver,
Anthony P Catalano,
Fritz J. Sedlazeck,
the Genome in a Bottle Consortium
Posted 09 Jun 2019
bioRxiv DOI: 10.1101/664623
Posted 09 Jun 2019
New technologies and analysis methods are enabling genomic structural variants (SVs) to be detected with ever-increasing accuracy, resolution, and comprehensiveness. Translating these methods to routine research and clinical practice requires robust benchmark sets. We developed the first benchmark set for identification of both false negative and false positive germline SVs, which complements recent efforts emphasizing increasingly comprehensive characterization of SVs. To create this benchmark for a broadly consented son in a Personal Genome Project trio with broadly available cells and DNA, the Genome in a Bottle (GIAB) Consortium integrated 19 sequence-resolved variant calling methods, both alignment- and de novo assembly-based, from short-, linked-, and long-read sequencing, as well as optical and electronic mapping. The final benchmark set contains 12745 isolated, sequence-resolved insertion and deletion calls ≥50 base pairs (bp) discovered by at least 2 technologies or 5 callsets, genotyped as heterozygous or homozygous variants by long reads. The Tier 1 benchmark regions, for which any extra calls are putative false positives, cover 2.66 Gbp and 9641 SVs supported by at least one diploid assembly. Support for SVs was assessed using svviz with short-, linked-, and long-read sequence data. In general, there was strong support from multiple technologies for the benchmark SVs, with 90 % of the Tier 1 SVs having support in reads from more than one technology. The Mendelian genotype error rate was 0.3 %, and genotype concordance with manual curation was >98.7 %. We demonstrate the utility of the benchmark set by showing it reliably identifies both false negatives and false positives in high-quality SV callsets from short-, linked-, and long-read sequencing and optical mapping.
- Downloaded 5,295 times
- Download rankings, all-time:
- Site-wide: 816 out of 103,749
- In genomics: 156 out of 6,382
- Year to date:
- Site-wide: 851 out of 103,749
- Since beginning of last month:
- Site-wide: 3,977 out of 103,749
Downloads over time
Distribution of downloads per paper, site-wide
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!