The goal of strain-aware genome assembly is to reconstruct all individual haplotypes from a mixed sample at the strain level and to provide abundance estimates for the strains. Given that the use of a reference genome can introduce significant biases, de novo approaches are most suitable for this task. So far, reference-genome-independent assemblers have been shown to reconstruct haplotypes for mixed samples of limited complexity and genomes not exceeding 10000 bp in length. Here, we present VG-Flow, a de novo approach that enables full-length haplotype reconstruction from pre-assembled contigs of complex mixed samples. Our method increases contiguity of the input assembly and, at the same time, it performs haplotype abundance estimation. VG-Flow is the first approach to require polynomial, and not exponential runtime in terms of the underlying graphs. Since runtime increases only linearly in the length of the genomes in practice, it enables the reconstruction also of genomes that are longer by orders of magnitude, thereby establishing the first de novo solution to strain-aware full-length genome assembly applicable to bacterial sized genomes. VG-Flow is based on the flow variation graph as a novel concept that both captures all diversity present in the sample and enables to cast the central contig abundance estimation problem as a flow-like, polynomial time solvable optimization problem. As a consequence, we are in position to compute maximal-length haplotypes in terms of decomposing the resulting flow efficiently using a greedy algorithm, and obtain accurate frequency estimates for the reconstructed haplotypes through linear programming techniques. Benchmarking experiments show that our method outperforms state-of-the-art approaches on mixed samples from short genomes in terms of assembly accuracy as well as abundance estimation. Experiments on longer, bacterial sized genomes demonstrate that VG-Flow is the only current approach that can reconstruct full-length haplotypes from mixed samples at the strain level in human-affordable runtime.
- Downloaded 1,431 times
- Download rankings, all-time:
- Site-wide: 9,557 out of 116,126
- In bioinformatics: 1,295 out of 9,552
- Year to date:
- Site-wide: 5,187 out of 116,126
- Since beginning of last month:
- Site-wide: 21,938 out of 116,126
Downloads over time
Distribution of downloads per paper, site-wide
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!