Towards automatic identification of self-reported COVID-19 tweets: Introducing a multilingual manually annotated dataset, baseline systems and exploratory evaluations.

2023 IEEE International Conference on Big Data (BigData)(2023)

引用 0|浏览1
暂无评分
摘要
In recent times, social networks like Twitter have emerged as vital platforms for sharing personal thoughts, opinions, and most importantly, health-related information, especially pertaining to COVID-19. Users tend to share very detailed and personal narratives that could be utilized by researchers to capture true self-reported health data. While the data is easily accessible, the process to differentiate between health-related self-reports and informal discussion is quite tricky as it relies on either manual curation or the availability of large manually annotated datasets for machine learning models to be trained on. Manually annotating data is an immensely time-consuming task since, in general, the intervention of a subject matter expert is required, even more, in languages other than English, such as Spanish. In this work, we release two manually annotated datasets, one in English and one in Spanish, comprising of 36,548 tweets containing self-reported COVID-19 symptoms to aid machine learning models in extracting self-reported COVID-19 tweets. Using a very large set of experiments, we demonstrate how these datasets can be leveraged using classical and modern machine learning algorithms to identify unlabeled self-report tweets. Additionally, we perform a stratified analysis of how (and if) data augmentation and automatic translation could help train more generalizable models.
更多
查看译文
关键词
Covid-19,machine learning,healthcare,self-reported symptoms,pandemic study
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要