Movatterモバイル変換

Leakage (machine learning)

From Wikipedia, the free encyclopedia

Concept in machine learning

"Data leakage" redirects here. For the unauthorized exposure, disclosure, or loss of personal information, seeData breach.

Machine learning anddata mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Semi-supervised learning Self-supervised learning Reinforcement learning Meta-learning Online learning Batch learning Curriculum learning Rule-based learning Neuro-symbolic AI Neuromorphic engineering Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Neural networks Autoencoder Deep learning Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Boltzmann machine Restricted GAN Diffusion model SOM Convolutional neural network U-Net LeNet AlexNet DeepDream Neural field Neural radiance field Physics-informed neural networks Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning Policy gradient SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop Mechanistic interpretability RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory Topological deep learning
Journals and conferences AAAI ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

Instatistics andmachine learning,leakage (also known asdata leakage ortarget leakage) refers to the use ofinformation during model training that would not be available atprediction time. This results in overly optimistic performance estimates, as the model appears to perform better during evaluation than it actually would in a production environment.^[1]

Leakage is often subtle and indirect, making it difficult to detect and eliminate. It can lead a statistician or modeler to select a suboptimal model, which may be outperformed by a leakage-free alternative.^[1]

Leakage modes

[edit]

Leakage can occur at multiple stages of the machine learning workflow. Broadly, its sources can be divided into two categories: those arising from features and those arising from training examples.^[1]

Feature leakage

[edit]

Feature or column-wise leakage is caused by the inclusion of columns which are one of the following: a duplicate label, a proxy for the label, or the label itself. These features, known asanachronisms, will not be available when the model is used for predictions, and result in leakage if included when the model is trained.^[2]

For example, including a "MonthlySalary" column when predicting "YearlySalary"; or "MinutesLate" when predicting "IsLate".

Training example leakage

[edit]

Row-wise leakage is caused by improper sharing of information between rows of data. Types of row-wise leakage include:

Prematurefeaturization; leaking from premature featurization beforeCross-validation/Train/Test split (must fitMinMax/ngrams/etc on only the train split, then transform the test set)^[3]
Duplicate rows between train/validation/test (for example,oversampling a dataset to pad its size before splitting; or, different rotations/augmentations of a single image;bootstrap sampling before splitting; or duplicating rows toup sample the minority class)
Non-independent and identically distributed random (non-IID) data
- Time leakage (for example, splitting a time-series dataset randomly instead of newer data in test set using a train/test split or rolling-origin cross-validation)
- Group leakage—not including a grouping split column (for example,Andrew Ng's group had 100k x-rays of 30k patients, meaning ~3 images per patient. The paper used random splitting instead of ensuring that all images of a patient were in the same split. Hence the model partially memorized the patients instead of learning to recognize pneumonia in chest x-rays.^[4]^[5])

A 2023 review found data leakage to be "a widespread failure mode in machine-learning (ML)-based science", having affected at least 294 academic publications across 17 disciplines, and causing a potentialreproducibility crisis.^[6]

Detection

[edit]

Data leakage in machine learning can be detected through various methods, focusing on performance analysis, feature examination, data auditing, and model behavior analysis. Performance-wise, unusually high accuracy or significant discrepancies between training and test results often indicate leakage.^[7] Inconsistent cross-validation outcomes may also signal issues.

Feature examination involves scrutinizing feature importance rankings and ensuring temporal integrity in time series data. A thorough audit of the data pipeline is crucial, reviewing pre-processing steps, feature engineering, and data splitting processes.^[8] Detecting duplicate entries across dataset splits is also important.

For language models, the Min-K% method can detect the presence of data in a pretraining dataset. It presents a sentence suspected to be present in the pretraining dataset, and computes the log-likelihood of each token, then compute the average of the lowest K of these. If this exceeds a threshold, then the sentence is likely present.^[9]^[10] This method is improved by comparing against a baseline of the mean and variance.^[11]

Analyzing model behavior can reveal leakage. Models relying heavily on counter-intuitive features or showing unexpected prediction patterns warrant investigation. Performance degradation over time when tested on new data may suggest earlier inflated metrics due to leakage.

Advanced techniques include backward feature elimination, where suspicious features are temporarily removed to observe performance changes. Using a separate hold-out dataset for final validation before deployment is advisable.^[8]

References

[edit]

^^a ^b ^cShachar Kaufman; Saharon Rosset; Claudia Perlich (January 2011)."Leakage in data mining: Formulation, detection, and avoidance".Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. Vol. 6. pp. 556–563.doi:10.1145/2020408.2020496.ISBN 978-1-4503-0813-7.S2CID 9168804. Retrieved13 January 2020.
^Soumen Chakrabarti (2008). "9".Data Mining: Know it All. Morgan Kaufmann Publishers. p. 383.ISBN 978-0-12-374629-0.Anachronistic variables are a pernicious mining problem. However, they aren't any problem at all at deployment time—unless someone expects the model to work! Anachronistic variables are out of place in time. Specifically, at data modeling time, they carry information back from the future to the past.
^Moscovich, Amit; Rosset, Saharon (September 2022). "On the Cross-Validation Bias due to Unsupervised Preprocessing".Journal of the Royal Statistical Society Series B: Statistical Methodology.84 (4):1474–1502.arXiv:1901.08974.doi:10.1111/rssb.12537.
^
Guts, Yuriy (30 October 2018).Yuriy Guts. TARGET LEAKAGE IN MACHINE LEARNING(Talk). AI Ukraine Conference. Ukraine – via YouTube.
- Yuriy Guts."Target Leakage in ML"(PDF).AI Ukraine Online Conference.
^Nick, Roberts (16 November 2017)."Replying to @AndrewYNg @pranavrajpurkar and 2 others". Brooklyn, NY, USA: Twitter.Archived from the original on 10 June 2018. Retrieved13 January 2020.Replying to @AndrewYNg @pranavrajpurkar and 2 others ... Were you concerned that the network could memorize patient anatomy since patients cross train and validation? "ChestX-ray14 dataset contains 112,120 frontal-view X-ray images of 30,805 unique patients. We randomly split the entire dataset into 80% training, and 20% validation."
^Kapoor, Sayash; Narayanan, Arvind (August 2023)."Leakage and the reproducibility crisis in machine-learning-based science".Patterns.4 (9) 100804.doi:10.1016/j.patter.2023.100804.ISSN 2666-3899.PMC 10499856.PMID 37720327.
^Batutin, Andrew (2024-06-20)."Data Leakage in Machine Learning Models".Shelf. Retrieved2024-10-18.
^^a ^b"What is Data Leakage in Machine Learning? | IBM".www.ibm.com. 2024-09-30. Retrieved2024-10-18.
^Shi, Weijia; Ajith, Anirudh; Xia, Mengzhou; Huang, Yangsibo; Liu, Daogao; Blevins, Terra; Chen, Danqi; Zettlemoyer, Luke (2023). "Detecting Pretraining Data from Large Language Models".arXiv:2310.16789 [cs.CL].
^"Detecting Pretraining Data from Large Language Models".swj0419.github.io. Retrieved2025-03-10.
^Zhang, Jingyang; Sun, Jingwei; Yeats, Eric; Ouyang, Yang; Kuo, Martin; Zhang, Jianyi; Hao Frank Yang; Li, Hai (2024). "Min-K%++: Improved Baseline for Detecting Pre-Training Data from Large Language Models".arXiv:2404.02936 [cs.CL].

Retrieved from "https://en.wikipedia.org/w/index.php?title=Leakage_(machine_learning)&oldid=1336451150"

Categories:

Hidden categories:

[8]ページ先頭

Movatterモバイル変換

Leakage modes

Feature leakage

Training example leakage

Detection

See also

References