Rxivist logo

Enhancing Sensitivity And Controlling False Discovery Rate In Somatic Indel Discovery Using A Latent Variable Model

By Louis J. Dijkstra, Johannes Köester, Tobias Marschall, Alexander Schönhuth

Posted 29 Mar 2017
bioRxiv DOI: 10.1101/121954

Cancer is a genetic disorder in the first place. Therefore, next-generation sequencing (NGS) based discovery of somatically acquired genetic variants has gained widespread attention. Computational prediction of somatic variants, however, is affected by a variety of confounding factors. In addition to the uncertainties that one commonly encounters also in germline variation prediction, such as misplaced and/or inaccurate read alignments, cancer heterogeneity and impure samples significantly add to the issues. Overall, this hampers state-of-the-art indel discovery tools to discover somatic indels at operable performance rates, although they perform excellently when calling germline indels. While affecting all size ranges, both common and cancer-specific problems interfere in particularly unfavorable ways in the prediction of somatic midsize (30-150 bp) insertions and deletions. Here, we present a latent variable model that can take the major confounding factors and uncertainties into a unifying account. Using this modeling framework, we first demonstrate how to efficiently compute the probability for a (putative) indel to be somatic, thereby resolving a principled computational runtime bottleneck in Bayesian uncertainty quantification. Second, we show how to reliably estimate the allele frequencies for a given list of indels. Third, we also present an intuitive and effective way to control the false discovery rate, an issue in genetic variant discovery that has been found notoriously hard to deal with. As a tool that implements all methodology developed, we present PROSIC (PROcessing Somatic Indel Calls). PROSIC achieves significant improvements in particular in terms of recall when applied to deletion call sheets, as provided by prevalent state-of-the-art tools, in comparison to their integrated somatic indel calling routines.

Download data

  • Downloaded 454 times
  • Download rankings, all-time:
    • Site-wide: 47,012 out of 116,126
    • In bioinformatics: 5,100 out of 9,552
  • Year to date:
    • Site-wide: 101,872 out of 116,126
  • Since beginning of last month:
    • Site-wide: 104,734 out of 116,126

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide


Sign up for the Rxivist weekly newsletter! (Click here for more details.)