Debugging A Crowdsourced Task With Low Inter-Rater Agreement

JCDL(2015)

引用 11|浏览15
暂无评分
摘要
In this paper, we describe the process we used to debug a crowdsourced labeling task with low inter-rater agreement. In the labeling task, the workers' subjective judgment was used to detect high-quality social media content-interesting tweets-with the ultimate aim of building a classifier that would automatically curate Twitter content. We describe the effects of varying the genre and recency of the dataset, of testing the reliability of the workers, and of recruiting workers from different crowdsourcing platforms. We also examined the effect of redesigning the work itself, both to make it easier and to potentially improve inter-rater agreement. As a result of the debugging process, we have developed a framework for diagnosing similar efforts and a technique to evaluate worker reliability. The technique for evaluating worker reliability, Human Intelligence Data-Driven Enquiries (HIDDENs), differs from other such schemes, in that it has the potential to produce useful secondary results and enhance performance on the main task. HIDDEN subtasks pivot around the same data as the main task, but ask workers questions with greater expected inter-rater agreement. Both the framework and the HIDDENs are currently in use in a production environment.
更多
查看译文
关键词
Crowdsourcing,labeling,inter-rater agreement,relevance judgment,debugging,Captchas,worker reliability
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要