Leveraging Topic Models To Develop Metrics For Evaluating The Quality Of Narrative Threads Extracted From News Stories

6TH INTERNATIONAL CONFERENCE ON APPLIED HUMAN FACTORS AND ERGONOMICS (AHFE 2015) AND THE AFFILIATED CONFERENCES, AHFE 2015(2015)

引用 8|浏览11
暂无评分
摘要
Analysts and software systems are increasingly tasked with making sense of a growing amount of data to help their organizations make decisions involving risk and uncertainty. A key enabler of this work is the ability to quickly discover structure in large amounts of text such as news stories and blogs. Recent work in this area has shown it is possible to automatically link documents from a corpus together to build a narrative structure, called a story chain, without the need for prior domain knowledge [1]. This approach is an unsupervised method that discovers large numbers of story chains of variable quality. In this paper, we describe and evaluate methods to identify the most coherent and informative story chains. We explore two types of topic model based analytics. The first type is a measure of representativeness that captures how well a story chain represents the corpus from which it was generated. This is done by comparing the similarity of topics found over time in a story chain against those expressed in the corpus during the same time period. Our hypothesis is that story chains that have similar topic expression to the corpus will convey narratives that are central to the corpus. This type of analytic could help an analyst quickly focus on the key narratives in a large corpus of documents. The second type is a measure of quality of a story chain and is composed of topic consistency and topic persistence measures. Our hypothesis is that high quality chains would be composed of sequences of stories that have clearly defined primary topics that persist across significant portions of the story chain. We used these analytics to predict the clarity of story chains within one of four categories (1) very clear narrative, 2) somewhat clear narrative, 3) somewhat unclear narrative, 4) very unclear narrative, and found we were able to train a data model to label story chains with the same label as human coders 77% of the time. Our dataset was composed of 7,074 English language news stories released during the Brazil Protests of 2013 from which 5,606 story chains were generated. We randomly selected 60 story chains for hand scoring to serve as our gold standard data set for experimentation. (C) 2015 The Authors. Published by Elsevier B.V.
更多
查看译文
关键词
Sensemaking, Data analytics, Text analytics, Narrative, Machine learning, Topic modeling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要