Triplet Attention: Rethinking the similarity in Transformers

KDD 2021(2021)

引用 5|浏览58
暂无评分
摘要
The Transformer model has benefited various real-world applications, where the self-attention mechanism with dot-products shows superior alignment ability on building long dependency. However, the pair-wisely attended self-attention limits further performance improvement on challenging tasks. To the extent of our knowledge, this is the first work to define the Triplet Attention (A(3)) for Transformer, which introduces triplet connections as the complementary dependency. Specifically, we define the triplet attention based on the scalar triplet product, which may be interchangeably used with the canonical one within the multi-head attention. It allows the self-attention mechanism to attend to diverse triplets and capture complex dependency. Then, we utilize the permuted formulation and kernel tricks to establish a linear approximation to A(3). The proposed architecture could be smoothly integrated into the pre-training by modifying head configurations. Extensive experiments show that our methods achieve significant performance improvement on various tasks and two benchmarks.
更多
查看译文
关键词
Neural Network,Large-scale Model,Transformer,Self-attention
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要