Rxivist logo

Annotating Gene Ontology terms for protein sequences with the Transformer model

By Dat Duong, Lisa Gai, Ankith Uppunda, Don Le, Eleazar Eskin, Jingyi Jessica Li, Kai-Wei Chang

Posted 02 Feb 2020
bioRxiv DOI: 10.1101/2020.01.31.929604

Predicting functions for novel amino acid sequences is a long-standing research problem. The Uniprot database which contains protein sequences annotated with Gene Ontology (GO) terms, is one commonly used training dataset for this problem. Predicting protein functions can then be viewed as a multi-label classification problem where the input is an amino acid sequence and the output is a set of GO terms. Recently, deep convolutional neural network (CNN) models have been introduced to annotate GO terms for protein sequences. However, the CNN architecture can only model close-range interactions between amino acids in a sequence. In this paper, first, we build a novel GO annotation model based on the Transformer neural network. Unlike the CNN architecture, the Transformer models all pairwise interactions for the amino acids within a sequence, and so can capture more relevant information from the sequences. Indeed, we show that our adaptation of Transformer yields higher classification accuracy when compared to the recent CNN-based method DeepGO. Second, we modify our model to take motifs in the protein sequences found by BLAST as additional input features. Our strategy is different from other ensemble approaches that average the outcomes of BLAST-based andmachine learning predictors. Third, we integrate into our Transformer the metadata about the protein sequences such as 3D structure and protein-protein interaction (PPI) data. We show that such information can greatly improve the prediction accuracy, especially for rare GO labels.

Download data

  • Downloaded 449 times
  • Download rankings, all-time:
    • Site-wide: 56,074
    • In bioinformatics: 5,699
  • Year to date:
    • Site-wide: 28,810
  • Since beginning of last month:
    • Site-wide: 21,587

Altmetric data


Downloads over time

Distribution of downloads per paper, site-wide


PanLingua

Sign up for the Rxivist weekly newsletter! (Click here for more details.)


News