Rxivist logo

Strain-aware assembly of genomes from mixed samples using flow variation graphs

By Jasmijn A Baaijens, Leen Stougie, Alexander Schönhuth

Posted 24 May 2019
bioRxiv DOI: 10.1101/645721

The goal of strain-aware genome assembly is to reconstruct all individual haplotypes from a mixed sample at the strain level and to provide abundance estimates for the strains. Given that the use of a reference genome can introduce significant biases, de novo approaches are most suitable for this task. So far, reference-genome-independent assemblers have been shown to reconstruct haplotypes for mixed samples of limited complexity and genomes not exceeding 10000 bp in length. Here, we present VG-Flow, a de novo approach that enables full-length haplotype reconstruction from pre-assembled contigs of complex mixed samples. Our method increases contiguity of the input assembly and, at the same time, it performs haplotype abundance estimation. VG-Flow is the first approach to require polynomial, and not exponential runtime in terms of the underlying graphs. Since runtime increases only linearly in the length of the genomes in practice, it enables the reconstruction also of genomes that are longer by orders of magnitude, thereby establishing the first de novo solution to strain-aware full-length genome assembly applicable to bacterial sized genomes. VG-Flow is based on the flow variation graph as a novel concept that both captures all diversity present in the sample and enables to cast the central contig abundance estimation problem as a flow-like, polynomial time solvable optimization problem. As a consequence, we are in position to compute maximal-length haplotypes in terms of decomposing the resulting flow efficiently using a greedy algorithm, and obtain accurate frequency estimates for the reconstructed haplotypes through linear programming techniques. Benchmarking experiments show that our method outperforms state-of-the-art approaches on mixed samples from short genomes in terms of assembly accuracy as well as abundance estimation. Experiments on longer, bacterial sized genomes demonstrate that VG-Flow is the only current approach that can reconstruct full-length haplotypes from mixed samples at the strain level in human-affordable runtime.

Download data

  • Downloaded 1,431 times
  • Download rankings, all-time:
    • Site-wide: 9,557 out of 116,126
    • In bioinformatics: 1,295 out of 9,552
  • Year to date:
    • Site-wide: 5,187 out of 116,126
  • Since beginning of last month:
    • Site-wide: 21,938 out of 116,126

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide


Sign up for the Rxivist weekly newsletter! (Click here for more details.)