Rxivist logo

binomialRF: Interpretable combinatoric efficiency of random forests to identify biomarker interactions

By Samir Rachid Zaim, Colleen Kenost, Joanne Berghout, Wesley Chiu, Liam Wilson, Hao Helen Zhang, Yves A. Lussier

Posted 26 Jun 2019
bioRxiv DOI: 10.1101/681973

Background In this era of data science-driven bioinformatics, machine learning research has focused on feature selection as users want more interpretation and post-hoc analyses for biomarker detection. However, when there are more features (i.e., transcript) than samples (i.e., mice or human samples) in a study, this poses major statistical challenges in biomarker detection tasks as traditional statistical techniques are underpowered in high dimension. Second and third order interactions of these features pose a substantial combinatoric dimensional challenge. In computational biology, random forest[1][1] ( RF ) classifiers are widely used[2][2]–[7][3] due to their flexibility, powerful performance, and robustness to “P predictors ≫ subjects N ” difficulties and their ability to rank features. We propose binomialRF, a feature selection technique in RFs that provides an alternative interpretation for features using a correlated binomial distribution and scales efficiently to analyze multiway interactions. Methods binomialRF treats each tree in a RF as a correlated but exchangeable binary trial. It determines importance by constructing a test statistic based on a feature’s selection frequency to compute its rank, nominal p-value, and multiplicity-adjusted q-value using a one-sided hypothesis test with a correlated binomial distribution. A distributional adjustment addresses the co-dependencies among trees as these trees subsample from the same dataset. The proposed algorithm efficiently identifies multiway nonlinear interactions by generalizing the test statistic to count sub-trees. Results In simulations and in the Madelon benchmark datasets studies, binomialRF showed computational gains (up to 30 to 600 times faster) while maintaining competitive variable precision and recall in identifying biomarkers’ main effects and interactions. In two clinical studies, the binomialRF algorithm prioritizes previously-published relevant pathological molecular mechanisms (features) with high classification precision and recall using features alone, as well as with their statistical interactions alone. Conclusion binomialRF extends upon previous methods for identifying interpretable features in RFs and brings them together under a correlated binomial distribution to create an efficient hypothesis testing algorithm that identifies biomarkers’ main effects and interactions. Preliminary results in simulations demonstrate computational gains while retaining competitive model selection and classification accuracies. Future work will extend this framework to incorporate ontologies that provide path-way-level feature selection from gene expression input data. Availability Github: <https://github.com/SamirRachidZaim/binomialRF> Supplementary information Supplementary analyses and results are available at <https://github.com/SamirRachidZaim/binomialRF_simulationStudy> [1]: #ref-1 [2]: #ref-2 [3]: #ref-7

Download data

  • Downloaded 406 times
  • Download rankings, all-time:
    • Site-wide: 54,382 out of 118,150
    • In bioinformatics: 5,585 out of 9,572
  • Year to date:
    • Site-wide: 39,047 out of 118,150
  • Since beginning of last month:
    • Site-wide: 51,177 out of 118,150

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide


Sign up for the Rxivist weekly newsletter! (Click here for more details.)