What is different between these datasets?
arxiv(2024)
摘要
The performance of machine learning models heavily depends on the quality of
input data, yet real-world applications often encounter various data-related
challenges. One such challenge could arise when curating training data or
deploying the model in the real world - two comparable datasets in the same
domain may have different distributions. While numerous techniques exist for
detecting distribution shifts, the literature lacks comprehensive approaches
for explaining dataset differences in a human-understandable manner. To address
this gap, we propose a suite of interpretable methods (toolbox) for comparing
two datasets. We demonstrate the versatility of our approach across diverse
data modalities, including tabular data, language, images, and signals in both
low and high-dimensional settings. Our methods not only outperform comparable
and related approaches in terms of explanation quality and correctness, but
also provide actionable, complementary insights to understand and mitigate
dataset differences effectively.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要