Evaluating Topic Modeling Pre-processing Pipelines for Portuguese Texts.

Antônio Pereira De Souza Júnior,Pablo Cecilio,Felipe Viegas,Washington Cunha, Elisa Tuler de Albergaria,Leonardo Chaves Dutra da Rocha

Brazilian Symposium on Multimedia and the Web (WebMedia)(2022)

引用 0|浏览4
暂无评分
摘要
Topic Modeling (TM) is among the most exploited approaches to extracting and organizing information from large amounts of data. Basically, these approaches aim to find semantic topics from textual documents (e.g., product reviews, tweets). Despite the good results of these approaches in English texts, we do not observe the same semantic quality when applied in Portuguese Texts since they are more verbose, presenting varied and complex verb conjugations and many homonyms, among other specific particularities. This work intends to fill this scientific gap by exploiting and evaluating different Topic Modeling Pre-processing Pipelines for Portuguese texts, which correspond to sequences of tasks that needed to be performed before the TM strategies. More specifically, we evaluate different pre-processing pipeline configurations using different semantic data representations to overcome the challenges faced by TM strategies in Portuguese Text. In our experimentation evaluation, considering two datasets collected from Twitter and Reddit related to Brazilian political discussion, we show that our proposed extended pre-processing pipeline, especially considering semantic representations, can achieve significant gains in effectiveness when compared to the TM approaches originally proposed for English texts (up to 9x better).
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要