De novo assembly of two Swedish genomes reveals missing segments from the human GRCh38 reference and improves variant calling of population-scale sequencing data
Posted 18 Feb 2018
bioRxiv DOI: 10.1101/267062 (published DOI: 10.3390/genes9100486)
Posted 18 Feb 2018
We have performed de novo assembly of two Swedish genomes using long-read sequencing and optical mapping, resulting in total assembly sizes of nearly 3 Gb and hybrid scaffold N50 values of over 45 Mb. A further analysis revealed over 10 Mb of sequences absent from the human GRCh38 reference in each individual. Around 6 Mb of these novel sequences (NS) are shared with a Chinese personal genome. The NS are highly repetitive, have elevated GC-content and are primarily located in centromeric or telomeric regions. A BLAST search showed that 31% of the NS are different from any sequences deposited in nucleotide databases. The remaining NS correspond to human (62%) or primate (6%) nucleotide entries, while 1% of hits show the highest similarity to other species, including mouse and a few different classes of parasitic worms. Up to 1 Mb of NS can be assigned to chromosome Y, and large segments are missing from GRCh38 also at chromosomes 14, 17 and 21. Inclusion of these novel sequences into the GRCh38 reference radically improves the alignment and variant calling of whole-genome sequencing data at several genomic loci. Through a re-analysis of 200 samples from a Swedish population-scale sequencing project, we obtained over 75,000 putative novel SNVs per individual when using a custom version of GRCh38 extended with 17.3 Mb of NS. In addition, about 10,000 false positive SNV calls per individual were removed from the GRCh38 autosomes and sex chromosomes in the re-analysis, with some of them located in protein coding regions.
- Downloaded 1,039 times
- Download rankings, all-time:
- Site-wide: 13,033 out of 103,749
- In genomics: 1,742 out of 6,382
- Year to date:
- Site-wide: 86,117 out of 103,749
- Since beginning of last month:
- Site-wide: 86,564 out of 103,749
Downloads over time
Distribution of downloads per paper, site-wide
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!