AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
A rich set of examples from diverse data mining domains given throughout this paper add to our own experience to suggest that in the absence of methodology for handling it, leakage could be the cause of many failures of data mining applications

Leakage in data mining: formulation, detection, and avoidance

ACM Transactions on Knowledge Discovery from Data (TKDD) - Special Issue on the Best of SIGKDD 2011, no. 4 (2012): ArticleNo.15-ArticleNo.15

Cited: 315|Views382
EI

Abstract

Deemed "one of the top ten data mining mistakes", leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competitions held recently...More

Code:

Data:

0
Introduction
  • Deemed “one of the top ten data mining mistakes” [7], leakage in data mining is essentially the introduction of information about the target of a data mining problem, which should not be legitimately available to mine from.
  • The introduction of this illegitimate information is unintentional, and facilitated by the data collection, aggregation and preparation process.
  • It is usually subtle and indirect, making it very hard to detect and eliminate.
  • Even identifying leakage as the reason might be highly nontrivial
Highlights
  • Deemed “one of the top ten data mining mistakes” [7], leakage in data mining is essentially the introduction of information about the target of a data mining problem, which should not be legitimately available to mine from
  • It is worth noting that leakage in training examples is not limited to the explicit use of illegitimate examples in the training process
  • It should be clear that modeling with leakage is undesirable on many levels: it is a source for poor generalization and overestimation of expected performance
  • A rich set of examples from diverse data mining domains given throughout this paper add to our own experience to suggest that in the absence of methodology for handling it, leakage could be the cause of many failures of data mining applications
  • In this paper we have described leakage as an abstract property of the relationship of observational inputs and target instances, and showed how it could be made concrete for various problems
  • Problems with fixing leakage have been discussed as an area where further research is required
Methods
  • The authors' suggested methodology for avoiding leakage is a two stage process of tagging every observation with legitimacy tags during collection and observing what the authors call a learn-predict separation.
  • At the most basic level suitable for handling the more general case of leakage in training examples, legitimacy tags are ancillary data attached to every pair of observational input instance and target instance , sufficient for answering the question “is legitimate for inferring “ under the problem‟s definition of legitimacy
  • With this tagged version of the database it is possible, for every example being studied, to roll back the state of legitimate illegitimate (a) A general separation (b) Time separation (c) Only targets are illegit.
  • To completely prevent leakage by design decisions, the modeler has to be careful not to even get exposed to information beyond the separation point, for this the authors can only prescribe self-control
Results
  • While clearly taking advantage of information from reviews given to titles during 2006, the final delivered model does not include any illegitimate feature1.
Conclusion
  • It is worth noting that leakage in training examples is not limited to the explicit use of illegitimate examples in the training process.
  • Examples could be: (i) selecting or designing features that will have predictive power in deployment, but don‟t show this power on training examples, (ii) algorithm or parametric model selection, and (iii) meta-parameter value choices.
  • This form of leakage is perhaps the most dangerous as an evaluator may not be able to identify it even when she knows what she is looking for.
  • Problems with fixing leakage have been discussed as an area where further research is required
Reference
  • Hastie T., Tibshirani, R. and Friedman, J. H. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition. Springer.
    Google ScholarFindings
  • Kohavi, R., Brodley, C., Frasca, B., Mason, L., and Zheng, Z. 2000. KDD-cup 2000 organizers‟ report: peeling the onion. ACM SIGKDD Explorations Newsletter. 2(2).
    Google ScholarLocate open access versionFindings
  • Kohavi, R. and Parekh, R. 200Ten supplementary analyses to improve e-commerce web sites. In Proceedings of the Fifth WEBKDD Workshop.
    Google ScholarLocate open access versionFindings
  • Kohavi, R., Mason L., Parekh, R. and Zheng Z. 200Lessons and challenges from mining retail e-commerce data. Machine Learning. 57(1-2).
    Google ScholarLocate open access versionFindings
  • Lo, A.W. and MacKinlay A.C. 1990. Data-snooping biases in tests of financial asset pricing models. Review of Financial Studies. 3(3) 431-467.
    Google ScholarLocate open access versionFindings
  • Narayanan, A., Shi, E., and Rubinstein, B. 2011. Link Prediction by De-anonymization: How We Won the Kaggle Social Network Challenge. Proceedings of the 2011 International Joint Conference on Neural Networks (IJCNN). Preprint.
    Google ScholarLocate open access versionFindings
  • Nisbet, R., Elder, J. and Miner, G. 2009. Handbook of Statistical Analysis and Data Mining Applications. Academic Press.
    Google ScholarFindings
  • Perlich C., Melville P., Liu Y., Swirszcz G., Lawrence R., Rosset S. 200Breast cancer identification: KDD cup winner‟s report. SIGKDD Explorations Newsletter. 10(2) 39-42.
    Google ScholarLocate open access versionFindings
  • Pyle, D. 199Data Preparation for Data Mining. Morgan Kaufmann Publishers.
    Google ScholarFindings
  • Pyle, D. 2003. Business Modeling and Data Mining. Morgan Kaufmann Publishers.
    Google ScholarFindings
  • Pyle, D. 2009. Data Mining: Know it All. Ch. 9. Morgan Kaufmann Publishers.
    Google ScholarFindings
  • Rosset, S., Perlich, C. and Liu, Y. 2007. Making the most of your data: KDD-Cup 2007 “How Many Ratings” Winner‟s Report. ACM SIGKDD Explorations Newsletter. 9(2).
    Google ScholarLocate open access versionFindings
  • Rosset, S., Perlich, C., Swirszcz, G., Liu, Y., and Prem, M. 2010. Medical data mining: lessons from winning two competitions. Data Mining and Knowledge Discovery. 20(3) 439468.
    Google ScholarLocate open access versionFindings
  • Tukey, J. 1977. Exploratory Data Analysis. Addison-Wesley.
    Google ScholarFindings
  • Widmer, G. and Kubat, M. 1996. Learning in the presence of concept drift and hidden contexts. Machine Learning. 23(1).
    Google ScholarLocate open access versionFindings
  • Xie, J. and Coggeshall, S. 2010. Prediction of transfers to tertiary care and hospital mortality: A gradient boosting decision tree approach. Statistical Analysis and Data Mining, 3: 253–258.
    Google ScholarLocate open access versionFindings
0
Your rating :

No Ratings

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn