Hydropathy and Conformational Similarity-Based Distributed Representation of Protein Sequences for Properties Prediction

SN Computer Science(2021)

引用 0|浏览0
暂无评分
摘要
In the natural language processing community conventional features like TF-IDF are commonly employed for text mining and other applications. These conventional features lack semantic/syntactic information. Researchers in the text mining field discovered that distributed representation of words can indeed contain this information and increase the predictive power of algorithms. Word2Vec to learn word embeddings from texts is a very popular distributed representation in NLP tasks. Recently researchers introduced these distributed representations, viz., ProtVec, for various protein function annotation tasks with considerable success. We, in this work, have developed reduced protein alphabet representations employing two different reduction schemes for four different regression tasks. Employing the entire Swiss-Prot annotated sequences we have extracted the embedding vectors using skip-gram models with different embedding vector sizes, k-mer sizes and context window sizes. We then used these vectors as input to the Support Vector Machines regression algorithm to build regression models. In this way we built seven different models which include the original ProtVec model, hydropathy-based reduced alphabet model, conformational similarity-based reduced alphabet model and all possible combinations of these three aforementioned models. The performance improvement in absorption and enantioselectivity tasks indicate that grouping the alphabets on an appropriate basis can indeed play a major role in enhancing algorithm capabilities. Our results on all the four tasks indicate individual-reduced alphabet representations and certain synergistic combinations can considerably increase prediction performance. This new method exhibits multiple advantages including improved semantic/syntactic information and more compact reduced representations. This method can also provide important domain information which can be used in further experimentations to develop sequences with desired properties.
更多
查看译文
关键词
ProtVec,RA2Vec,SVM,Protein property predictions
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要