EcoVal: An Efficient Data Valuation Framework for Machine Learning
CoRR(2024)
摘要
Quantifying the value of data within a machine learning workflow can play a
pivotal role in making more strategic decisions in machine learning
initiatives. The existing Shapley value based frameworks for data valuation in
machine learning are computationally expensive as they require considerable
amount of repeated training of the model to obtain the Shapley value. In this
paper, we introduce an efficient data valuation framework EcoVal, to estimate
the value of data for machine learning models in a fast and practical manner.
Instead of directly working with individual data sample, we determine the value
of a cluster of similar data points. This value is further propagated amongst
all the member cluster points. We show that the overall data value can be
determined by estimating the intrinsic and extrinsic value of each data. This
is enabled by formulating the performance of a model as a production
function, a concept which is popularly used to estimate the amount of output
based on factors like labor and capital in a traditional free economic market.
We provide a formal proof of our valuation technique and elucidate the
principles and mechanisms that enable its accelerated performance. We
demonstrate the real-world applicability of our method by showcasing its
effectiveness for both in-distribution and out-of-sample data. This work
addresses one of the core challenges of efficient data valuation at scale in
machine learning models.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要