Utilization of synergetic human-machine clouds: a big data cleaning case.


引用 9|浏览11
ABSTRACT Cloud computing and crowdsourcing are growing trends in IT. Combining the strengths of both machine and human clouds within a hybrid design enables us to overcome certain problems and achieve efficiencies. In this paper we present a case in which we developed a hybrid, throw-away prototype software system to solve a big data cleaning problem in which we corrected and normalized a data set of 53,822 academic publication records. The first step in our solution consists of utilization of external DOI query web services to label the records with matching DOIs. Then we used customized string similarity calculation algorithms based on Levensthein Distance and Jaccard Index to grade the similarity between records. Finally we used crowdsourcing to identify duplicates among the residual record set consisting of similar yet not identical records. We consider this proof of concept to be successful and report that we achieved certain results that we could not have achieved by using either human or machine clouds alone.
AI 理解论文
Chat Paper