Rxivist logo

Data structures based on k-mers for querying large collections of sequencing datasets

By Camille Marchet, Christina Boucher, Simon J Puglisi, Paul Medvedev, Mikaƫl Salson, Rayan Chikhi

Posted 06 Dec 2019
bioRxiv DOI: 10.1101/866756

High-throughput sequencing datasets are usually deposited in public repositories, e.g. the European Nucleotide Archive, to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow to perform online sequence searches; yet such a feature would be highly useful to investigators. Towards this goal, in the last few years several computational approaches have been introduced to index and query large collections of datasets. Here we propose an accessible survey of these approaches, which are generally based on representing datasets as sets of k -mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations. ### Competing Interest Statement The authors have declared no competing interest.

Download data

  • Downloaded 1,453 times
  • Download rankings, all-time:
    • Site-wide: 7,443 out of 103,809
    • In bioinformatics: 1,257 out of 9,474
  • Year to date:
    • Site-wide: 4,164 out of 103,809
  • Since beginning of last month:
    • Site-wide: 4,919 out of 103,809

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide


Sign up for the Rxivist weekly newsletter! (Click here for more details.)