Movatterモバイル変換

Data augmentation

From Wikipedia, the free encyclopedia

Data analysis technique

Machine learning anddata mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Semi-supervised learning Self-supervised learning Reinforcement learning Meta-learning Online learning Batch learning Curriculum learning Rule-based learning Neuro-symbolic AI Neuromorphic engineering Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Neural networks Autoencoder Deep learning Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Boltzmann machine Restricted GAN Diffusion model SOM Convolutional neural network U-Net LeNet AlexNet DeepDream Neural field Neural radiance field Physics-informed neural networks Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning Policy gradient SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop Mechanistic interpretability RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory Topological deep learning
Journals and conferences AAAI ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

This article'stone or style may not reflect theencyclopedic tone used on Wikipedia. See Wikipedia'sguide to writing better articles for suggestions.(February 2024) (Learn how and when to remove this message)

Data augmentation is a statistical technique which allowsmaximum likelihood estimation from incomplete data.^[1]^[2] Data augmentation has important applications inBayesian analysis,^[3] and the technique is widely used inmachine learning to reduceoverfitting when training machine learning models,^[4] achieved by training models on several slightly-modified copies of existing data.

Synthetic oversampling techniques for traditional machine learning

[edit]

Main article:Oversampling and undersampling in data analysis § Oversampling techniques for classification problems

Synthetic Minority Over-sampling Technique (SMOTE) is a method used to address imbalanceddatasets in machine learning. In such datasets, the number of samples in different classes varies significantly, leading to biased model performance. For example, in a medical diagnosis dataset with 90 samples representing healthy individuals and only 10 samples representing individuals with a particular disease, traditional algorithms may struggle to accurately classify the minority class. SMOTE rebalances the dataset by generating synthetic samples for the minority class. For instance, if there are 100 samples in the majority class and 10 in the minority class, SMOTE can create synthetic samples by randomly selecting a minority class sample and its nearest neighbors, then generating new samples along the line segments joining these neighbors. This process helps increase the representation of the minority class, improving model performance.^[5]

Data augmentation for image classification

[edit]

Whenconvolutional neural networks grew larger in mid-1990s, there was a lack of data to use, especially considering that some part of the overall dataset should be spared for later testing. It was proposed to perturb existing data withaffine transformations to create new examples with the same labels,^[6] which were complemented by so-called elastic distortions in 2003,^[7] and the technique was widely used as of 2010s.^[8] Data augmentation can enhance CNN performance and acts as a countermeasure against CNN profiling attacks.^[9]

Data augmentation has become fundamental in image classification, enrichingtraining dataset diversity to improve model generalization and performance. The evolution of this practice has introduced a broad spectrum of techniques, including geometric transformations, color space adjustments, and noise injection.^[10]

Geometric Transformations

[edit]

Geometric transformations alter the spatial properties of images to simulate different perspectives, orientations, and scales. Common techniques include:

Affine Transformation
- Rotation: Rotating images by a specified degree to help models recognize objects at various angles.
- Reflection: Reflecting images horizontally or vertically to introduce variability in orientation.
- Translation: Shifting images in different directions to teach models positional invariance.
- Scaling
- Shear Mapping
Cropping: Removing sections of the image to focus on particular features or simulate closer views.
Elastic Distortion^[7]
Morphingwithin the same class: Generating new samples by applying morphing techniques between two images belonging to the same class, thereby increasing intra-class diversity.^[11]

Color Space Transformations

[edit]

Color space transformations modify the color properties of images, addressing variations in lighting,color saturation, and contrast. Techniques include:

Brightness Adjustment: Varying the image's brightness to simulate different lighting conditions.
Contrast Adjustment: Changing the contrast to help models recognize objects under various clarity levels.
Saturation Adjustment: Altering saturation to prepare models for images with diverse color intensities.
Color Jittering: Randomly adjusting brightness, contrast, saturation, andhue to introduce color variability.

Noise Injection

[edit]

Injecting noise into images simulates real-world imperfections, teaching models to ignore irrelevant variations. Techniques involve:

Gaussian Noise: AddingGaussian noise mimics sensor noise or graininess.
Salt and Pepper Noise: Introducing black or whitepixels at random simulates sensor dust ordead pixels.

Data augmentation for signal processing

[edit]

Residual or block bootstrap can be used for time series augmentation.

Biological signals

[edit]

Synthetic data augmentation is of paramount importance for machine learning classification, particularly for biological data, which tend to be high dimensional and scarce. The applications of robotic control and augmentation in disabled and able-bodied subjects still rely mainly on subject-specific analyses. Data scarcity is notable in signal processing problems such as for Parkinson's DiseaseElectromyography signals, which are difficult to source - Zanini, et al. noted that it is possible to use agenerative adversarial network (in particular, a DCGAN) to perform style transfer in order to generate synthetic electromyographic signals that corresponded to those exhibited by sufferers of Parkinson's Disease.^[12]

The approaches are also important inelectroencephalography (brainwaves). Wang, et al. explored the idea of using deepconvolutional neural networks for EEG-Based Emotion Recognition, results show that emotion recognition was improved when data augmentation was used.^[13]

A common approach is to generate synthetic signals by re-arranging components of real data. Lotte^[14] proposed a method of"Artificial Trial Generation Based on Analogy" where three data examples $x_{1},x_{2},x_{3}$ provide examples and an artificial $x_{synthetic}$ is formed which is to $x_{3}$ what $x_{2}$ is to $x_{1}$ . A transformation is applied to $x_{1}$ to make it more similar to $x_{2}$ , the same transformation is then applied to $x_{3}$ which generates $x_{synthetic}$ . This approach was shown to improve performance of a Linear Discriminant Analysis classifier on three different datasets.

Current research shows great impact can be derived from relatively simple techniques. For example, Freer^[15] observed that introducing noise into gathered data to form additional data points improved the learning ability of several models which otherwise performed relatively poorly. Tsinganos et al.^[16] studied the approaches of magnitude warping, wavelet decomposition, and synthetic surface EMG models (generative approaches) for hand gesture recognition, finding classification performance increases of up to +16% when augmented data was introduced during training. More recently, data augmentation studies have begun to focus on the field of deep learning, more specifically on the ability of generative models to create artificial data which is then introduced during the classification model training process. In 2018, Luo et al.^[17] observed that useful EEG signal data could be generated by Conditional Wasserstein Generative Adversarial Networks (GANs) which was then introduced to the training set in a classical train-test learning framework. The authors found classification performance was improved when such techniques were introduced.

Mechanical signals

[edit]

The prediction of mechanical signals based on data augmentation brings a new generation oftechnological innovations, such as new energy dispatch, 5G communication field, and robotics control engineering.^[18] In 2022, Yang et al.^[18] integrate constraints, optimization and control into a deep network framework based on data augmentation and data pruning with spatio-temporal data correlation, and improve the interpretability, safety and controllability of deep learning in real industrial projects through explicit mathematical programming equations and analytical solutions.

References

[edit]

^Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977)."Maximum Likelihood from Incomplete Data Via the EM Algorithm".Journal of the Royal Statistical Society. Series B (Methodological).39 (1):1–22.doi:10.1111/j.2517-6161.1977.tb01600.x.Archived from the original on 2022-10-10. Retrieved2024-08-28.
^Rubin, Donald (1987)."Comment: The Calculation of Posterior Distributions by Data Augmentation".Journal of the American Statistical Association.82 (398).doi:10.2307/2289460.JSTOR 2289460.Archived from the original on 2024-08-07. Retrieved2024-08-28.
^Jackman, Simon (2009).Bayesian Analysis for the Social Sciences. John Wiley & Sons. p. 236.ISBN 978-0-470-01154-6.
^Shorten, Connor; Khoshgoftaar, Taghi M. (2019)."A survey on Image Data Augmentation for Deep Learning".Mathematics and Computers in Simulation.6 60. springer.doi:10.1186/s40537-019-0197-0.
^Wang, Shujuan; Dai, Yuntao; Shen, Jihong; Xuan, Jingxue (2021-12-15)."Research on expansion and classification of imbalanced data based on SMOTE algorithm".Scientific Reports.11 (1): 24039.Bibcode:2021NatSR..1124039W.doi:10.1038/s41598-021-03430-5.ISSN 2045-2322.PMC 8674253.PMID 34912009.
^Yann Lecun; et al. (1995).Learning algorithms for classification: A comparison on handwritten digit recognition(Conference paper). World Scientific. pp. 261–276. Retrieved14 May 2023.{{cite book}}:|website= ignored (help)
^^a ^bSimard, P.Y.; Steinkraus, D.; Platt, J.C. (2003). "Best practices for convolutional neural networks applied to visual document analysis".Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings. Vol. 1. pp. 958–963.doi:10.1109/ICDAR.2003.1227801.ISBN 0-7695-1960-1.S2CID 4659176.
^Hinton, Geoffrey E.; Srivastava, Nitish; Krizhevsky, Alex; Sutskever, Ilya; Salakhutdinov, Ruslan R. (2012). "Improving neural networks by preventing co-adaptation of feature detectors".arXiv:1207.0580 [cs.NE].
^Cagli, Eleonora; Dumas, Cécile; Prouff, Emmanuel (2017)."Convolutional Neural Networks with Data Augmentation Against Jitter-Based Countermeasures: Profiling Attacks Without Pre-processing". In Fischer, Wieland; Homma, Naofumi (eds.).Cryptographic Hardware and Embedded Systems – CHES 2017. Lecture Notes in Computer Science. Vol. 10529. Cham: Springer International Publishing. pp. 45–68.doi:10.1007/978-3-319-66787-4_3.ISBN 978-3-319-66787-4.S2CID 54088207.
^Shorten, Connor; Khoshgoftaar, Taghi M. (2019-07-06)."A survey on Image Data Augmentation for Deep Learning".Journal of Big Data.6 (1): 60.doi:10.1186/s40537-019-0197-0.ISSN 2196-1115.
^Ghorbel, Emna; Ghorbel, Faouzi (2024-06-01)."Data augmentation based on shape space exploration for low-size datasets: application to 2D shape classification".Neural Computing and Applications.36 (17):10031–10054.doi:10.1007/s00521-024-09798-5.ISSN 1433-3058.
^Anicet Zanini, Rafael; Luna Colombini, Esther (2020)."Parkinson's Disease EMG Data Augmentation and Simulation with DCGANs and Style Transfer".Sensors.20 (9): 2605.Bibcode:2020Senso..20.2605A.doi:10.3390/s20092605.ISSN 1424-8220.PMC 7248755.PMID 32375217.
^Wang, Fang; Zhong, Sheng-hua; Peng, Jianfeng; Jiang, Jianmin; Liu, Yan (2018). "Data Augmentation for EEG-Based Emotion Recognition with Deep Convolutional Neural Networks".MultiMedia Modeling. Lecture Notes in Computer Science. Vol. 10705. pp. 82–93.doi:10.1007/978-3-319-73600-6_8.ISBN 978-3-319-73599-3.ISSN 0302-9743.
^Lotte, Fabien (2015)."Signal Processing Approaches to Minimize or Suppress Calibration Time in Oscillatory Activity-Based Brain–Computer Interfaces"(PDF).Proceedings of the IEEE.103 (6):871–890.doi:10.1109/JPROC.2015.2404941.ISSN 0018-9219.S2CID 22472204.Archived(PDF) from the original on 2023-04-03. Retrieved2022-11-05.
^Freer, Daniel; Yang, Guang-Zhong (2020). "Data augmentation for self-paced motor imagery classification with C-LSTM".Journal of Neural Engineering.17 (1): 016041.Bibcode:2020JNEng..17a6041F.doi:10.1088/1741-2552/ab57c0.hdl:10044/1/75376.ISSN 1741-2552.PMID 31726440.S2CID 208034533.
^Tsinganos, Panagiotis; Cornelis, Bruno; Cornelis, Jan; Jansen, Bart; Skodras, Athanassios (2020)."Data Augmentation of Surface Electromyography for Hand Gesture Recognition".Sensors.20 (17): 4892.Bibcode:2020Senso..20.4892T.doi:10.3390/s20174892.ISSN 1424-8220.PMC 7506981.PMID 32872508.
^Luo, Yun; Lu, Bao-Liang (2018). "EEG Data Augmentation for Emotion Recognition Using a Conditional Wasserstein GAN".2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). Vol. 2018. pp. 2535–2538.doi:10.1109/EMBC.2018.8512865.ISBN 978-1-5386-3646-6.PMID 30440924.S2CID 53105445.
^^a ^bYang, Yang (2022). "Wind speed forecasting with correlation network pruning and augmentation: A two-phase deep learning method".Renewable Energy.198 (1):267–282.arXiv:2306.01986.Bibcode:2022REne..198..267Y.doi:10.1016/j.renene.2022.07.125.ISSN 0960-1481.S2CID 251511199.

v t e Data
Acquisition Augmentation Analysis Anonymization Archaeology Big Cleansing Collection Compression Corruption Curation Deduplication Degradation De-identification Ecosystem Editing Engineering Erasure ETL/ELT Extract Transform Load Ethics Exhaust Exploration Farming Format management Fusion Governance Cooperatives Infrastructure Integration Integrity Library Lineage Loss Management Meta Migration Mining Philanthropy Pre-processing Preservation Processing Protection (privacy) Publishing Open data Recovery Reduction Redundancy Re-identification Remanence Rescue Retention Quality Science Scraping Scrubbing Security Sharing Stewardship Storage Structure Synchronization Topological data analysis Type Validation Warehouse Wrangling/munging

Data

Artificial intelligence (AI)

Concepts

Applications

Implementations

Audio–visual	AlexNet WaveNet Human image synthesis HWR OCR Computer vision Speech synthesis 15.ai ElevenLabs Speech recognition Whisper Facial recognition AlphaFold Text-to-image models Aurora DALL-E Firefly Flux Ideogram Imagen Midjourney Recraft Stable Diffusion Text-to-video models Dream Machine Runway Gen Hailuo AI Kling Sora Veo Music generation Riffusion Suno AI Udio
Text	Word2vec Seq2seq GloVe BERT T5 Llama Chinchilla AI PaLM GPT 1 2 3 J ChatGPT 4 4o o1 o3 4.5 4.1 o4-mini 5 5.1 Claude Gemini Gemini (language model) Gemma Grok LaMDA BLOOM DBRX Project Debater IBM Watson IBM Watsonx Granite PanGu-Σ DeepSeek Qwen
Decisional	AlphaGo AlphaZero OpenAI Five Self-driving car MuZero Action selection AutoGPT Robot control