MCCD: Generating Human Natural Language Conversational Datasets

ICEIS: PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS - VOL 2(2022)

引用 0|浏览1
暂无评分
摘要
In recent years, state-of-the-art problems related to Natural Language Processing (NLP) have been extensively explored. This includes better models for text generation and text understanding. These solutions depend highly on data to training models, such as dialogues. The limitations imposed by the lack of data in a specific language significantly limit the available datasets. This becomes worse as intensive data is required to achieve specific solutions for a particular domain. This investigation proposes MCCD, a methodology to extract human conversational datasets based on several data sources. MCCD identifies different answers to the same message differentiating various conversation flows. This enables the resulting dataset to be used in more applications. Datasets generated by MCCD can train models for different purposes, such as Questions & Answers (QA) and open-domain conversational agents. We developed a complete software tool to implement and evaluate our proposal. We applied our solution to extract human conversations from two datasets in Portuguese language.
更多
查看译文
关键词
Natural Language Processing, Data Wrangling, Data Acquisition, Human Conversation, Model Learning, Tool
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要