Document vector embedding based extractive text summarization system for Hindi and English text

Applied Intelligence(2022)

引用 6|浏览28
暂无评分
摘要
Nowadays, several automatic text summarization (ATS) methods have been proposed for resource-rich languages, such as English, Chinese. However, resource-limited languages like Hindi realized very little attention from researchers. The lack of resources still makes the ATS task for the Hindi language a challenging and open problem. Capturing semantic features and hidden relationships among the text units are the two main characteristics of an informative summary. In the current work, we propose an ATS model based on the document vector method to explore the semantic relations existing in the document. Moreover, we suggest two algorithms: sentence ranking and summary generation based on three main characteristics including, redundancy, diversity, and compression rate to create a clear and coherent summary. The proposed model is language-independent with some language-specific preprocessing. Further, we evaluate our model on two different language datasets as literary novels in Hindi and DUC 2007 news articles in English. We apply the ROUGE metric to measure the performance of the generated summaries. Besides, we also compare the proposed model against four baseline methods: TextRank, Lexrank, Latent Semantic Analysis (LSA), and Mudasir et al. models. The overall macro-Average F-Score (18.5% for Hindi, 26% for English) for very short length summaries of sizes 5% and 15% compression rates produced by our model is higher than the baseline approaches. In case of very lengthy summaries of size 50% compression rate, our model has the highest Macro-Average values, 18% for the Hindi novels and 25% for the English news articles against all the comparison methods. From the result analysis, we perceive that the proposed model beats all the baselines from the experimental outcomes and leads to diverse, least-redundant, semantic-rich, and compressed text summary generation.
更多
查看译文
关键词
Text summarizer, Hindi novels, Sentence vectors, Gist diversity, Redundancy ratio
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要