A Comprehensive Understanding of Code-Mixed Language Semantics Using Hierarchical Transformer

Tharun Suresh,Ayan Sengupta,Md Shad Akhtar,Tanmoy Chakraborty

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS（2024）

引用 0|浏览1

暂无评分

摘要

Being a popular mode of text-based communication in multilingual communities, code mixing in online social media has become an important subject to study. Learning the semantics and morphology of code-mixed language remains a key challenge due to the scarcity of data, the unavailability of robust, and language-invariant representation learning techniques. Any morphologically rich language can benefit from character, subword, and word-level embeddings, aiding in learning meaningful correlations. In this article, we explore a hierarchical transformer (HIT)-based architecture to learn the semantics of code-mixed languages. HIT consists of multiheaded self-attention (MSA) and outer product attention components to simultaneously comprehend the semantic and syntactic structures of code-mixed texts. We evaluate the proposed method across six Indian languages (Bengali, Gujarati, Hindi, Tamil, Telugu, and Malayalam) and Spanish for nine tasks on 17 datasets. The HIT model outperforms state-of-the-art code-mixed representation learning and multilingual language models on 13 datasets across eight tasks. We further demonstrate the generalizability of the HIT architecture using masked language modeling (MLM)-based pretraining, zero-shot learning (ZSL), and transfer learning approaches. Our empirical results show that the pretraining objectives significantly improve the performance of downstream tasks.

查看译文

关键词

Task analysis,Semantics,Transformers,Representation learning,Vectors,Tagging,Machine translation,Code-mixed classification,hierarchical attention,representation learning,zero-shot learning (ZSL)

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要