Boiling down information retrieval test collections

RIAO(2010)

引用 25|浏览4
暂无评分
摘要
Constructing large-scale test collections is costly and time-consuming, and a few relevance assessment methods have been proposed for constructing "minimal" information retrieval test collections that may still provide reliable experimental results. In contrast to building up such test collections, we take existing test collections constructed through the traditional pooling approach and empirically investigate whether they can be "boiled down." More specifically, we report on experiments with test collections from both NT-CIR and TREC to investigate the effect of reducing both the topic set size and the pool depth on the outcome of a statistical significance test between two systems, starting with (approximately) 100 topics and depth-100 pools. We define cost (of manual relevance assessment) as the pool depth multiplied by the topic set size, and error as a system pair whose outcome of statistical significance testing differs from the original result based on the full test collection. Our main findings are: (a) Cost and the number of errors are negatively correlated, and any attempt at substantially reducing cost introduces some errors; (b) The NTCIR-7 IR4QA and the TREC 2004 robust track test collections all yield a comparable and considerable number of errors in response to cost reduction, and this is true despite the fact that the TREC relevance assessments relied on more than twice as many runs as the NTCIR ones; (c) Using 100 topics with depth-30 pools generally yields fewer errors than using 30 topics with depth-100 pools; and (d) Even with depth-100 pools, using fewer than 100 topics results in false alarms, i.e. two systems are declared significantly different even though the full topic set would declare otherwise.
更多
查看译文
关键词
large-scale test collection,test collection,robust track test collection,depth-100 pool,information retrieval test collection,topic set size,pool depth,topics result,statistical significance test,full test collection,information retrieval,statistical significance
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要