Rxivist logo

Mitigating Data Scarcity in Protein Binding Prediction Using Meta-Learning

By Yunan Luo, Jianzhu Ma, Xiaoming Zhao, Yufeng Su, Yang Liu, Trey Ideker, Jian Peng

Posted 12 Jan 2019
bioRxiv DOI: 10.1101/519413

A plethora of biological functions are performed through various types of protein-peptide binding. Prime examples include the protein kinase phosphorylation on peptide substrates and the binding of major histocompatibility complex to neoantigens in the immune system. Understanding the specificity of protein-peptide interactions is critical for unraveling the architectures of functional pathways and the mechanisms of cellular processes in human cells. Despite mass-spectrometric techniques were developed for the identification of protein-peptide interactions, our understanding of the preferences of proteins on their binding peptides is still rudimentary. As a complementary direction, a line of computational prediction methods has been recently proposed to predict protein-peptide bindings which efficiently provide rich functional annotations on a large scale. To achieve a high prediction accuracy, these computational methods require a sufficient amount of data to build the prediction model. However, the number of experimentally verified protein-peptide bindings is often limited in real cases. For example, a majority of protein kinases have very few experimentally verified phosphorylation sites (e.g., less than 30 sites) in existing databases. These methods are thus limited to building accurate prediction models for only well-characterized proteins with a large volume of known binding peptides and cannot be extended to predict new binding peptides for less-studied proteins. In this paper, we introduce a generic framework to address this issue of data scarcity in protein binding prediction. We demonstrate the applicability of our framework in predicting kinase-specific phosphorylation sites. Our method uses an effective training strategy to build a prediction model with robust transferability. The model is able to predict the phosphorylation sites of a less-studied kinase, even if there is only a small number of phosphorylation sites known for this kinase. To achieve this, we train the model via a meta-learning phase followed by a few-shot learning phase. We demonstrate our framework has better transferability than state-of-the-art methods and is effective in utilizing limited data to accurately predict phosphorylation sites for less-characterized kinases. The implementation of our framework is available at https://github.com/luoyunan/MetaKinase.

Download data

  • Downloaded 986 times
  • Download rankings, all-time:
    • Site-wide: 22,995
    • In bioinformatics: 2,676
  • Year to date:
    • Site-wide: 46,833
  • Since beginning of last month:
    • Site-wide: 32,799

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide