TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu
arxiv(2024)
摘要
News headline generation is a crucial task in increasing productivity for
both the readers and producers of news. This task can easily be aided by
automated News headline-generation models. However, the presence of irrelevant
headlines in scraped news articles results in sub-optimal performance of
generation models. We propose that relevance-based headline classification can
greatly aid the task of generating relevant headlines. Relevance-based headline
classification involves categorizing news headlines based on their relevance to
the corresponding news articles. While this task is well-established in
English, it remains under-explored in low-resource languages like Telugu due to
a lack of annotated data. To address this gap, we present TeClass, the
first-ever human-annotated Telugu news headline classification dataset,
containing 78,534 annotations across 26,178 article-headline pairs. We
experiment with various baseline models and provide a comprehensive analysis of
their results. We further demonstrate the impact of this work by fine-tuning
various headline generation models using TeClass dataset. The headlines
generated by the models fine-tuned on highly relevant article-headline pairs,
showed about a 5 point increment in the ROUGE-L scores. To encourage future
research, the annotated dataset as well as the annotation guidelines will be
made publicly available.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要