A new sentence embedding framework for the education and professional training domain with application to hierarchical multi-label text classification

Guillaume Lefebvre,Haytham Elghazel,Theodore Guillet,Alexandre Aussem, Matthieu Sonnati

DATA & KNOWLEDGE ENGINEERING(2024)

引用 0|浏览2
暂无评分
摘要
In recent years, Natural Language Processing (NLP) has made significant advances through advanced general language embeddings, allowing breakthroughs in NLP tasks such as semantic similarity and text classification. However, complexity increases with hierarchical multi -label classification (HMC), where a single entity can belong to several hierarchically organized classes. In such complex situations, applied on specific -domain texts, such as the Education and professional training domain, general language embedding models often inadequately represent the unique terminologies and contextual nuances of a specialized domain. To tackle this problem, we present HMCCCProbT, a novel hierarchical multi -label text classification approach. This innovative framework chains multiple classifiers, where each individual classifier is built using a novel sentence -embedding method BERTEPro based on existing Transformer models, whose pre -training has been extended on education and professional training texts, before being fine-tuned on several NLP tasks. Each individual classifier is responsible for the predictions of a given hierarchical level and propagates local probability predictions augmented with the input feature vectors to the classifier in charge of the subsequent level. HMCCCProbT tackles issues of model scalability and semantic interpretation, offering a powerful solution to the challenges of domain -specific hierarchical multi -label classification. Experiments over three domain -specific textual HMC datasets indicate the effectiveness of HMCCCProbT to compare favorably to state-of-the-art HMC algorithms in terms of classification accuracy and also the ability of BERTEPro to obtain better probability predictions, well suited to HMCCCProbT, than three other vector representation techniques.
更多
查看译文
关键词
NLP,Transformers,Sentence similarity,Sentence embedding,Education and professional training domain,Information retrieval,Classification,Hierarchical Multi-label Classification
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要