AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
The research reported here is a first step towards adaptive information filtering systems that learn to identify documents that are novel and redundant in addition to relevant and nonrelevant

Novelty and redundancy detection in adaptive filtering

SIGIR, pp.81-88, (2002)

Cited by: 511|Views147
EI

Abstract

This paper addresses the problem of extending an adaptive information filtering system to make decisions about the novelty and redundancy of relevant documents. It argues that relevance and redundance should each be modelled explicitly and separately. A set of five redundancy measures are proposed and evaluated in experiments with and wit...More

Code:

Data:

0
Introduction
  • The decision about whether a document contains new information depends on whether the relevant information in the document is covered by information in documents delivered previously.
  • This complicates the filtering problem.
  • Decisions about redundancy and novelty depend very much on where in the stream a document appears.
Highlights
  • The five redundancy measures described in Section 4 were compared on the two datasets described in Sections 6.1 and 6.2
  • The results are shown in Figures 3, 4 and 5 in the form of average Recall-Precision graphs over the set of redundant documents
  • The research reported here is a first step towards adaptive information filtering systems that learn to identify documents that are novel and redundant in addition to relevant and nonrelevant
  • The experimental results demonstrate that it is possible to identify redundant documents with reasonable accuracy. They demonstrate the importance of a suitable redundancy-threshold algorithm, analogous to the relevancethreshold algorithm found in many information filtering systems
Methods
  • The authors created a one gigabyte dataset by combining AP News and Wall Street Journal data from TREC CDs 1, 2, and 3
  • The authors chose these corpora because they are widely available, because information needs and relevance judgements are available from NIST, and because the two newswire corpora cover the same time period (1988 to 1990) and many of the same topics, guaranteeing some redundancy in the document stream.
  • This is the approach the authors adopted when developing algorithms, but that decision was based in part on how the authors intended to collect redundancy judgements
Results
  • The five redundancy measures described in Section 4 were compared on the two datasets described in Sections 6.1 and 6.2.
  • The results are shown in Figures 3, 4 and 5 in the form of average Recall-Precision graphs over the set of redundant documents.
  • On both datasets the Set Difference measure is the least accurate.
  • The cosine similarity metric is symmetric; the authors expected asymmetric measures to be a better model of this task.
  • The authors' results add redundancy detection to the long list of tasks for which it is effective
Conclusion
  • The research reported here is a first step towards adaptive information filtering systems that learn to identify documents that are novel and redundant in addition to relevant and nonrelevant.
  • It defines a task, an evaluation methodol-.
  • The extremely small amount of training data makes it a challenging problem
Tables
  • Table1: Average performance of different redundancy measures with a simple thresholding algorithm, measured on 33 topics with the AP News & Wall Street Journal dataset. Both absolutely redundant and somewhat redundant documents are treated as redundant
  • Table2: Average performance of different redundancy measures with a simple thresholding algorithm, measured on 33 topics with the AP News & Wall Street Journal dataset. Only absolutely redundant documents are treated as redundant
  • Table3: Average performance of different redundancy measures with a simple thresholding algorithm, measured on 20 topics with the TREC Interactive dataset
Download tables as Excel
Related work
  • The research most closely related to novelty or redundancy detection in adaptive information filtering is perhaps the First Story Detection task associated with Topic Detection and Tracking (TDT) research [1]. A TDT system monitors a stream of chronologically-ordered documents, usually news stories. The First Story Detection (FSD) task is defined as detecting the first story that discusses a previouslyunknown event. An event is defined as “something that happens at some specific time and place” [14].

    Online clustering approaches have been a common solution to the FSD task [10, 3, 2, 5, 4, 13, 15, 1, 14]. New stories are compared to clusters of stories about previously-known events. If the new story matches an existing cluster, it describes a known event, otherwise it describes a new event.
Funding
  • This material is based on work supported by Air Force Research Laboratory contract F30602-98-C0110
Reference
  • J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study. In Topic Detection and Tracking Workshop Report. 2001.
    Google ScholarLocate open access versionFindings
  • J. Allan, V. Lavrenko, and H. Jin. First story detetion in TDT is hard. In Proc. of the 9th International Conference on Information and Knowledge Management, 2000.
    Google ScholarLocate open access versionFindings
  • J. Allan, R. Papka, and V. Lavrenko. On-line new event detection and tracking. In Proc. of 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998.
    Google ScholarLocate open access versionFindings
  • J. Carbonell, Y. Yang, R. Brown, C. Jin, and J. Zhang. CMU TDT report 13-14 Nov 2001. In Topic Detection and Tracking Workshop Report. 2001.
    Google ScholarLocate open access versionFindings
  • M. Franz, A. Ittycheriah, J. S. McCarley, and T. Ward. First story detection: Combining similarity and novelty based approaches. In Topic Detection and Tracking Workshop Report, 2001.
    Google ScholarLocate open access versionFindings
  • W. P. Jones and G. W. Furnas. Pictures of relevance. Journal of the American Society for Information Science, 1987.
    Google ScholarLocate open access versionFindings
  • W. Kraaij, R. Pohlmann, and D. Hiemstra. Twenty-one at TREC-8: using language technology for information retrieval. In Proceedings of the Eighth Text REtrieval Conference (TREC-8), 1999.
    Google ScholarLocate open access versionFindings
  • L. Lee. Measures of distributional similarity. In Proceedings of the 37th ACL, 1999.
    Google ScholarLocate open access versionFindings
  • A. McCallum, R. Rosenfeld, T. Mitchell, and A. Y. Ng. Improving text classification by shrinkage in a hierarchy of classes. In Proceedings of The Eighteenth International Conference on Machine Learning, 1998.
    Google ScholarLocate open access versionFindings
  • D. R. H. Miller, T. Leek, and R. Schwartz. A hidden markov model information retrieval system. In Proceedings of the 22th Annual International ACM SIGIR Conferenc eon Research and Development in Information Retrieval, pages 214–221, 2001.
    Google ScholarLocate open access versionFindings
  • S. Robertson. Threshold setting in adaptive filtering. Journal of Documentation, 2000.
    Google ScholarLocate open access versionFindings
  • S. Robertson and D. Hull. The TREC-9 Filtering track report. In The Ninth Text REtrieval Conference (TREC-9), 2001.
    Google ScholarFindings
  • M. Spitters and W. Kraaij. TNO at TDT2001: Language model-based topic detection. In Topic Detection and Tracking Workshop Report. 2001.
    Google ScholarLocate open access versionFindings
  • N. Stokes and J. Carthy. Combining semantic and syntactic document classifiers to improve first story detection. In Proceedings of the 24th Annual International ACM SIGIR Conferenc eon Research and Development in Information Retrieval, 2001.
    Google ScholarLocate open access versionFindings
  • J. Yamron, S. Knecht, and P. van Mulbregt. Dragon’s tracking and detection systems for the TDT2000 evaluation. In Proceedings of the Broadcast News Transcription and Understanding Workshop, 1998.
    Google ScholarLocate open access versionFindings
  • C. Zhai and J. Lafferty. Model-based feedback in the language modeling approach to information retrieval. In Proceedings of Tenth International Conference on Information and Knowledge Management, 2001.
    Google ScholarLocate open access versionFindings
  • C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proc. of the 24th Annual Int’l ACM SIGIR Conferenc eon Research and Development in Information Retrieval, pages 334–342, 2001.
    Google ScholarLocate open access versionFindings
  • Y. Zhang and J. Callan. Maximum likelihood estimation for filteirng thresholds. In Proc. of the 24th Annual Int’l ACM SIGIR Conferenc eon Research and Development in Information Retrieval, 2001.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科