Strain Level Microbial Detection and Quantification with Applications to Single Cell Metagenomics
By
Kaiyuan Zhu,
Welles Robinson,
Alejandro A. Schäffer,
Junyan Xu,
Eytan Ruppin,
A. Funda Ergun,
Yuzhen Ye,
S. Cenk Sahinalp
Posted 14 Jun 2020
bioRxiv DOI: 10.1101/2020.06.12.149245
The identification and quantification of microbial abundance at the species or strain level from sequencing data is crucial for our understanding of human health and disease. Existing approaches for microbial abundance estimation either use accurate but computationally expensive alignment-based approaches for species-level estimation or less accurate but computationally fast alignment-free approaches that fail to classify many reads accurately at the species or strain-level. Here we introduce CAMMiQ, a novel combinatorial solution to the microbial identification and abundance estimation problem, which performs better than the best used tools on simulated and real datasets with respect to the number of correctly classified reads (i.e., specificity) by an order of magnitude and resolves possible mixtures of similar genomes. As we demonstrate, CAMMiQ can better distinguish between single cells deliberately infected with distinct Salmonella strains and sequenced using scRNA-seq reads than alternative approaches. We also demonstrate that CAMMiQ is also more accurate than the best used approaches on a variety of synthetic genomic read data involving some of the most challenging bacterial genomes derived from NCBI RefSeq database; it can distinguish not only distinct species but also closely related strains of bacteria. The key methodological innovation of CAMMiQ is its use of arbitrary length, doubly-unique substrings, i.e. substrings that appear in (exactly) two genomes in the input database, instead of fixed-length, unique substrings. To resolve the ambiguity in the genomic origin of doubly-unique substrings, CAMMiQ employs a combinatorial optimization formulation, which can be solved surprisingly quickly. CAMMiQ’s index consists of a sparsified subset of the shortest unique and doubly-unique substrings of each genome in the database, within a user specified length range and as such it is fairly compact. In short, CAMMiQ offers more accurate genomic identification and abundance estimation than the best used alternatives while using similar computational resources. Availability <https://github.com/algo-cancer/CAMMiQ> ### Competing Interest Statement The authors have declared no competing interest.
Download data
- Downloaded 509 times
- Download rankings, all-time:
- Site-wide: 48,186
- In bioinformatics: 5,032
- Year to date:
- Site-wide: 33,185
- Since beginning of last month:
- Site-wide: 33,185
Altmetric data
Downloads over time
Distribution of downloads per paper, site-wide
PanLingua
News
- 27 Nov 2020: The website and API now include results pulled from medRxiv as well as bioRxiv.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!