Machine Learning Powered A/B Testing

SIGIR(2017)

引用 2|浏览23
暂无评分
摘要
Online search evaluation, and A/B testing in particular, is an irreplaceable tool for modern search engines. Typically, online experiments last for several days or weeks and require a considerable portion of the search traffic. Despite the increasing need for running more experiments, the amount of that traffic is limited. This situation leads to the problem of finding new key performance metrics with higher sensitivity and lower variance. Recently, we proposed a number of techniques to alleviate this need for larger sample sizes in A/B experiments. One approach was based on formulating the quest for finding a sensitive metric as a data-driven machine learning problem of finding a sensitive metric combination \\cite{Kharitonov2017}. We assumed that each single observation in these experiments is assigned with a vector of metrics (features) describing it. After that, we learned a linear combination of these metrics, such that the learned combination can be considered as a metric itself, and (a) agrees with the preference direction in the seed experiments according to a baseline ground truth metric, (b) achieves a higher sensitivity than the baseline ground-truth metric. Another approach addressed the problem of delays in the treatment effects causing low sensitivity of the metrics and requiring to conduct A/B experiments with longer duration or larger set of users from a limited traffic \\cite{Drutsa2017}. We found that a delayed treatment effect of a metric could be revealed through the daily time series of the metric's measurements over the days of an A/B test. So, we proposed several metrics that learn the models of the trend in such time series and use them to quantify the changes in the user behavior. Finally, in another study \\cite{Poyarkov2016}, we addressed the problem of variance reduction for user engagement metrics and developed a general framework that allows us to incorporate both the existing state-of-the-art approaches to reduce the variance and some novel ones based on advanced machine learning techniques. The expected value of the key metric for a given user consists of two components: (1) the expected value for this user irrespectively the treatment assignment and (2) the treatment effect for this user. The expectation of the 1st component does not depend on the treatment assignment and does not contribute to the actual average treatment effect, but may increase the variance of its estimation. If we knew the value of the first component, we would subtract it from the key metric and obtain a new metric with decreased variance. However, since we cannot evaluate the first component exactly, we propose to predict it based on the attributes of the user that are independent of the treatment exposure. Therefore, we propose to utilize, instead of the average value of a key metric, its average deviation from its predicted value. In this way, the problem of variance reduction is reduced to the problem of finding the best predictor for the key metric that is not aware of the treatment exposure. In our general approach, we apply gradient boosted decision trees and achieve a significantly greater variance reduction than the state-of-the-art.
更多
查看译文
关键词
Online metrics, online evaluation, A/B testing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要