A Flexible Algorithmic Approach For Identifying Conflicting/Deviating Data On The Web

2018 INTERNATIONAL CONFERENCE ON COMPUTER, INFORMATION AND TELECOMMUNICATION SYSTEMS (IEEE CITS 2018)(2018)

引用 0|浏览15
暂无评分
摘要
Information on the Web often contains contradictions and conflicting information, thus impacting the quality of data sources and the quality-related performance of search and retrieval. Therefore, appropriate techniques need to be developed and integrated into the infrastructure serving for the retrieval and browsing of data sources such that conflicting data are detected, can be removed or blocked, or can be highlighted to the user in order to offer an improvement of the quality of content consumed by users. This paper proposes an approach which allows to detect conflicting data by providing a technique for investigating deviation between values available from structured data on the Web. Our approach consists of multiple phases: First, some initial pre-processing of data from targeted data sources prepares the data sources to be comparable. Second, Levenshtein distance is computed between data elements to represent the degree of conflict between data elements. Third, computing the cosine similarity between vectors of Levenshtein distance values and a user-configurable sensitivity vector, encoding the characteristics of a specific kind of conflict that is subject to investigation, finally allows for a ranked detection of the conflicting data. This algorithm has been applied and tested on a data collection about movies from the Web, illustrating how the techniques can be applied for the detection of conflicting information on the Web.
更多
查看译文
关键词
Conflicting Data, Levenshtein distance, Cosine similarity
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要