Learning Representations for Gene Ontology Terms by Contextualized Text Encoders

biorxiv(2019)

引用 8|浏览39
暂无评分
摘要
Functions of proteins are annotated by Gene Ontology (GO) terms. As the amount of new sequences being collected is rising at a faster pace than the number of sequences being annotated with GO terms, there have been efforts to develop better annotation techniques. When annotating protein sequences with GO terms, one key auxiliary resource is the GO data itself. GO terms have definitions consisted of a few sentences describing some biological events, and are also arranged in a tree structure with specific terms being child nodes of generic terms. The definitions and positions of the GO terms on the GO tree can be used to construct vector representations for the GO terms. These GO vectors can then be integrated into existing prediction models to improve the classification accuracy. In this paper, we adapt the Bidirectional Encoder Representations from Transformers (BERT) to encode GO definitions into vectors. We evaluate BERT against the previous GO encoders in three tasks: (1) measuring similarity between GO terms (2) asserting relationship for orthologs and interacting proteins based on their GO annotations and (3) predicting GO terms for protein sequences. For task 3, we show that using GO vectors as additional prediction features increases the accuracy, primarily for GO terms with low occurrences in the manually annotated dataset. In all three tasks, BERT often outperforms the previous GO encoders.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要