Rxivist logo

GaKCo: a Fast GApped k-mer string Kernel using COunting

By Ritambhara Singh, Arshdeep Sekhon, Jack Lanchantin, Kamran Kowsari, Beilun Wang, Yanjun Qi

Posted 24 May 2018
bioRxiv DOI: 10.1101/329425

String Kernel (SK) techniques, especially those using gapped k-mers as features (gk), have obtained great success in classifying sequences like DNA, protein, and text. However, the state-of-the-art gk-SK runs extremely slow when we increase the dictionary size (Σ) or allow more mismatches (M). This is because current gk-SK uses a trie-based algorithm to calculate co-occurrence of mismatched substrings resulting in a time cost proportional to O(ΣM). We propose a fast algorithm for calculating Gapped k-mer Kernel using Counting (GaKCo). GaKCo uses associative arrays to calculate the co-occurrence of substrings using cumulative counting. This algorithm is fast, scalable to larger Σ and M, and naturally parallelizable. We provide a rigorous asymptotic analysis that compares GaKCo with the state-of-the-art gk-SK. Theoretically, the time cost of GaKCo is independent of the ΣM term that slows down the trie-based approach. Experimentally, we observe that GaKCo achieves the same accuracy as the state-of-the-art and outperforms its speed by factors of 2, 100, and 4, on classifying sequences of DNA (5 datasets), protein (12 datasets), and character-based English text (2 datasets), respectively. GaKCo is shared as an open source tool at https://github.com/QData/GaKCo-SVM .

Download data

  • Downloaded 201 times
  • Download rankings, all-time:
    • Site-wide: 68,782 out of 89,036
    • In bioinformatics: 7,061 out of 8,413
  • Year to date:
    • Site-wide: 85,639 out of 89,036
  • Since beginning of last month:
    • Site-wide: 80,297 out of 89,036

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide


Sign up for the Rxivist weekly newsletter! (Click here for more details.)