Automatic Anonymization of Textual Documents: Detecting Sensitive Information via Word Embeddings

2019 18th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/13th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE)(2019)

引用 23|浏览164
暂无评分
摘要
Data sharing is key in a wide range of activities but raises serious privacy concerns when the data contain personal information. Anonymization mechanisms provide ways to transform the data so that identities and/or sensitive data are not disclosed (i.e., data are no longer personal). Even though a variety of methods have been proposed for structured data, automatic anonymization of unstructured text it still far from being solved. Textual data anonymization consists of detecting sensitive pieces of text, which are later removed and/or generalized. The detection process is especially challenging and it is usually based on classifiers pre-trained on large quantities of manually tagged data, which are able to detect a fixed set of (sensitive) entities such as names or locations. However, this approach is severely limited because sensitive information may appear in text in many forms and not all the appearances of a certain entity type may disclose information on the individual to be protected. In this work we propose a more general solution to text anonymization based on the notion of word embedding. The idea is to represent all the entities appearing in the document as word vectors that capture their semantic relationships. Then a particular entity (e.g. an individual or an organization) can automatically be protected by removing the other entities co-occurring in the document whose vectors are similar to the particular entity's vector. Furthermore, our method does not require manually tagged training data and is language-agnostic. We empirically evaluated our proposal on a collection of biographies. Our results show a significant improvement of the detection recall in comparison with classical approaches to text anonymization based on named entity recognition.
更多
查看译文
关键词
Document anonymization, Privacy protection, Word embeddings, Named entity recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要