FoldToken: Learning Protein Language via Vector Quantization and Beyond
CoRR(2024)
Abstract
Is there a foreign language describing protein sequences and structures
simultaneously? Protein structures, represented by continuous 3D points, have
long posed a challenge due to the contrasting modeling paradigms of discrete
sequences. We introduce FoldTokenizer to represent protein
sequence-structure as discrete symbols. This innovative approach involves
projecting residue types and structures into a discrete space, guided by a
reconstruction loss for information preservation. We refer to the learned
discrete symbols as FoldToken, and the sequence of FoldTokens serves
as a new protein language, transforming the protein sequence-structure into a
unified modality. We apply the created protein language on general backbone
inpainting and antibody design tasks, building the first GPT-style model
(FoldGPT) for sequence-structure co-generation with promising results.
Key to our success is the substantial enhancement of the vector quantization
module, Soft Conditional Vector Quantization (SoftCVQ).
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined