On the Reusability of \"Living Labs\" Test Collections: A Case Study of Real-Time Summarization

SIGIR(2017)

引用 9|浏览42
暂无评分
摘要
Information retrieval test collections are typically built using data from large-scale evaluations in international forums such as TREC, CLEF, and NTCIR. Previous validation studies on pool-based test collections for ad hoc retrieval have examined their reusability to accurately assess the effectiveness of systems that did not participate in the original evaluation. To our knowledge, the reusability of test collections derived from \"living labs\" evaluations, based on logs of user activity, has not been explored. In this paper, we performed a \"leave-one-out\" analysis of human judgment data derived from the TREC 2016 Real-Time Summarization Track and show that those judgments do not appear to be reusable. While this finding is limited to one specific evaluation, it does call into question the reusability of test collections built from living labs in general, and at the very least suggests the need for additional work in validating such experimental instruments.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要