How Does Unlabeled Data Provably Help Out-of-Distribution Detection?
CoRR(2024)
摘要
Using unlabeled data to regularize the machine learning models has
demonstrated promise for improving safety and reliability in detecting
out-of-distribution (OOD) data. Harnessing the power of unlabeled in-the-wild
data is non-trivial due to the heterogeneity of both in-distribution (ID) and
OOD data. This lack of a clean set of OOD samples poses significant challenges
in learning an optimal OOD classifier. Currently, there is a lack of research
on formally understanding how unlabeled data helps OOD detection. This paper
bridges the gap by introducing a new learning framework SAL (Separate And
Learn) that offers both strong theoretical guarantees and empirical
effectiveness. The framework separates candidate outliers from the unlabeled
data and then trains an OOD classifier using the candidate outliers and the
labeled ID data. Theoretically, we provide rigorous error bounds from the lens
of separability and learnability, formally justifying the two components in our
algorithm. Our theory shows that SAL can separate the candidate outliers with
small error rates, which leads to a generalization guarantee for the learned
OOD classifier. Empirically, SAL achieves state-of-the-art performance on
common benchmarks, reinforcing our theoretical insights. Code is publicly
available at https://github.com/deeplearning-wisc/sal.
更多查看译文
关键词
out-of-distribution detection,learnability
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要