Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees
arxiv(2022)
摘要
Real world datasets contain incorrectly labeled instances that hamper the
performance of the model and, in particular, the ability to generalize out of
distribution. Also, each example might have different contribution towards
learning. This motivates studies to better understanding of the role of data
instances with respect to their contribution in good metrics in models. In this
paper we propose a method based on metrics computed from training dynamics of
Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each
training example. We focus on datasets containing mostly tabular or structured
data, for which the use of Decision Trees ensembles are still the
state-of-the-art in terms of performance. Our methods achieved the best results
overall when compared with confident learning, direct heuristics and a robust
boosting algorithm. We show results on detecting noisy labels in order clean
datasets, improving models' metrics in synthetic and real public datasets, as
well as on a industry case in which we deployed a model based on the proposed
solution.
更多查看译文
关键词
data quality,gradient boosting,training dynamics,improving
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要