Anonymization of Unstructured Data via Named-Entity Recognition.

Fadi Hassan,Josep Domingo-Ferrer,Jordi Soria-Comas

MDAI（2018）

引用 30|浏览16

暂无评分

摘要

The anonymization of structured data has been widely studied in recent years. However, anonymizing unstructured data (typically text documents) remains a highly manual task and needs more attention from researchers. The main difficulty when dealing with unstructured data is that no database schema is available that can be used to measure privacy risks. In fact, confidential data and quasi-identifier values may be spread throughout the documents to be anonymized. In this work we propose to use a named-entity recognition tagger based on machine learning. The ultimate aim is to build a system capable of detecting all attributes that have privacy implications (identifiers, quasi-identifiers and sensitive attributes). In particular, we present a proof of concept focused on the detection of confidential attributes. We consider a case study in which confidential values to be detected are disease names in medical diagnoses. Once these confidential attribute values are located, one can use standard statistical disclosure control techniques for structured data to control disclosure risk.

查看译文

关键词

Anonymization, Unstructured data, Named-entity recognition, Conditional random fields

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要