Double Trouble? Impact and Detection of Duplicates in Face Image Datasets
CoRR(2024)
摘要
Various face image datasets intended for facial biometrics research were
created via web-scraping, i.e. the collection of images publicly available on
the internet. This work presents an approach to detect both exactly and nearly
identical face image duplicates, using file and image hashes. The approach is
extended through the use of face image preprocessing. Additional steps based on
face recognition and face image quality assessment models reduce false
positives, and facilitate the deduplication of the face images both for intra-
and inter-subject duplicate sets. The presented approach is applied to five
datasets, namely LFW, TinyFace, Adience, CASIA-WebFace, and C-MS-Celeb (a
cleaned MS-Celeb-1M variant). Duplicates are detected within every dataset,
with hundreds to hundreds of thousands of duplicates for all except LFW. Face
recognition and quality assessment experiments indicate a minor impact on the
results through the duplicate removal. The final deduplication data is publicly
available.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要