Movatterモバイル変換

[0]ホーム

Jump to content

Boosting (machine learning)

Edit links

From Wikipedia, the free encyclopedia

Ensemble learning method

This articlemay be too technical for most readers to understand. Pleasehelp improve it tomake it understandable to non-experts, without removing the technical details.(September 2023) (Learn how and when to remove this message)

Machine learning anddata mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Semi-supervised learning Self-supervised learning Reinforcement learning Meta-learning Online learning Batch learning Curriculum learning Rule-based learning Neuro-symbolic AI Neuromorphic engineering Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Neural networks Autoencoder Deep learning Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Boltzmann machine Restricted GAN Diffusion model SOM Convolutional neural network U-Net LeNet AlexNet DeepDream Neural field Neural radiance field Physics-informed neural networks Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning Policy gradient SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop Mechanistic interpretability RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory Topological deep learning
Journals and conferences AAAI ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

Inmachine learning (ML),boosting is anensemble learning method that combines a set of less accurate models (called "weak learners") to create a single, highly accurate model (a "strong learner"). Unlike other ensemble methods that build models in parallel (such asbagging), boosting algorithms build models sequentially. Each new model in the sequence is trained to correct the errors made by its predecessors. This iterative process allows the overall model to improve its accuracy, particularly by reducingbias.^[1] Boosting is a popular and effective technique used insupervised learning for bothclassification andregression tasks.^[2]

The theoretical foundation for boosting came from a question posed byKearns andValiant (1988, 1989):^[3]^[4] "Can a set of weak learners create a single strong learner?" A weak learner is defined as aclassifier that performs only slightly better than random guessing, whereas a strong learner is a classifier that is highly correlated with the true classification.Robert Schapire's affirmative answer to this question in a 1990 paper led to the development of practical boosting algorithms.^[5]^[6] The first such algorithm was developed by Schapire, withFreund and Schapire later developingAdaBoost, which remains a foundational example of boosting.^[7]

Algorithms

[edit]

While boosting is not algorithmically constrained, most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. When they are added, they are weighted in a way that is related to the weak learners' accuracy. After a weak learner is added, the data weights are readjusted, known as "re-weighting". Misclassified input data gain a higher weight and examples that are classified correctly lose weight.^{[note 1]} Thus, future weak learners focus more on the examples that previous weak learners misclassified.

An illustration presenting the intuition behind the boosting algorithm, consisting of the parallel learners and weighted dataset

There are many boosting algorithms. The original ones, proposed byRobert Schapire (arecursive majority gate formulation),^[8] andYoav Freund (boost by majority),^[9] were notadaptive and could not take full advantage of the weak learners. Schapire and Freund then developedAdaBoost, an adaptive boosting algorithm that won the prestigiousGödel Prize.

Only algorithms that are provable boosting algorithms in theprobably approximately correct learning formulation can accurately be calledboosting algorithms. Other algorithms that are similar in spirit^{[clarification needed]} to boosting algorithms are sometimes called "leveraging algorithms", although they are also sometimes incorrectly called boosting algorithms.^[9]

The main variation between many boosting algorithms is their method ofweighting training data points andhypotheses.AdaBoost is very popular and the most significant historically as it was the first algorithm that could adapt to the weak learners. It is often the basis of introductory coverage of boosting in university machine learning courses.^[10] There are many more recent algorithms such asLPBoost, TotalBoost,BrownBoost,xgboost, MadaBoost,LogitBoost,CatBoost and others. Many boosting algorithms fit into the AnyBoost framework,^[9] which shows that boosting performsgradient descent in afunction space using aconvex cost function.

Object categorization in computer vision

[edit]

Main article:Object categorization from image search

Given images containing various known objects in the world, a classifier can be learned from them to automaticallyclassify the objects in future images. Simple classifiers built based on someimage feature of the object tend to be weak in categorization performance. Using boosting methods for object categorization is a way to unify the weak classifiers in a special way to boost the overall ability of categorization.^{[citation needed]}

Problem of object categorization

[edit]

Object categorization is a typical task ofcomputer vision that involves determining whether or not an image contains some specific category of object. The idea is closely related with recognition, identification, and detection. Appearance based object categorization typically containsfeature extraction,learning aclassifier, and applying the classifier to new examples. There are many ways to represent a category of objects, e.g. fromshape analysis,bag of words models, or local descriptors such asSIFT, etc. Examples ofsupervised classifiers areNaive Bayes classifiers,support vector machines,mixtures of Gaussians, andneural networks. However, research^[which?] has shown that object categories and their locations in images can be discovered in anunsupervised manner as well.^[11]

Status quo for object categorization

[edit]

The recognition of object categories in images is a challenging problem incomputer vision, especially when the number of categories is large. This is due to high intra class variability and the need for generalization across variations of objects within the same category. Objects within one category may look quite different. Even the same object may appear unalike under different viewpoint,scale, andillumination. Background clutter and partial occlusion add difficulties to recognition as well.^[12] Humans are able to recognize thousands of object types, whereas most of the existingobject recognition systems are trained to recognize only a few,^[quantify] e.g.human faces,cars, simple objects, etc.^[13]^{[needs update?]} Research has been very active on dealing with more categories and enabling incremental additions of new categories, and although the general problem remains unsolved, several multi-category objects detectors (for up to hundreds or thousands of categories^[14]) have been developed. One means is byfeature sharing and boosting.

Boosting for binary categorization

[edit]

AdaBoost can be used for face detection as an example ofbinary categorization. The two categories are faces versus background. The general algorithm is as follows:

Form a large set of simple features
Initialize weights for training images
For T rounds
1. Normalize the weights
2. For available features from the set, train a classifier using a single feature and evaluate the training error
3. Choose the classifier with the lowest error
4. Update the weights of the training images: increase if classified wrongly by this classifier, decrease if correctly
Form the final strong classifier as the linear combination of the T classifiers (coefficient larger if training error is small)

After boosting, a classifier constructed from 200 features could yield a 95% detection rate under a $10^{-5}$ false positive rate.^[15]

Another application of boosting for binary categorization is a system that detects pedestrians usingpatterns of motion and appearance.^[16] This work is the first to combine both motion information and appearance information as features to detect a walking person. It takes a similar approach to theViola-Jones object detection framework.

Boosting for multi-class categorization

[edit]

Compared with binary categorization,multi-class categorization looks for common features that can be shared across the categories at the same time. They turn to be more genericedge like features. During learning, the detectors for each category can be trained jointly. Compared with training separately, itgeneralizes better, needs less training data, and requires fewer features to achieve the same performance.

The main flow of the algorithm is similar to the binary case. What is different is that a measure of the joint training error shall be defined in advance. During each iteration the algorithm chooses a classifier of a single feature (features that can be shared by more categories shall be encouraged). This can be done via convertingmulti-class classification into a binary one (a set of categories versus the rest),^[17] or by introducing a penalty error from the categories that do not have the feature of the classifier.^[18]

In the paper "Sharing visual features for multiclass and multiview object detection", A. Torralba et al. usedGentleBoost for boosting and showed that when training data is limited, learning via sharing features does a much better job than no sharing, given same boosting rounds. Also, for a given performance level, the total number of features required (and therefore the run time cost of the classifier) for the feature sharing detectors, is observed to scale approximatelylogarithmically with the number of class, i.e., slower thanlinear growth in the non-sharing case. Similar results are shown in the paper "Incremental learning of object detectors using a visual shape alphabet", yet the authors usedAdaBoost for boosting.

Convex vs. non-convex boosting algorithms

[edit]

Boosting algorithms can be based onconvex or non-convex optimization algorithms. Convex algorithms, such asAdaBoost andLogitBoost, can be "defeated" by random noise such that they can't learn basic and learnable combinations of weak hypotheses.^[19]^[20] This limitation was pointed out by Long & Servedio in 2008. However, by 2009, multiple authors demonstrated that boosting algorithms based on non-convex optimization, such asBrownBoost, can learn from noisy datasets and can specifically learn the underlying classifier of the Long–Servedio dataset.

Implementations

[edit]

scikit-learn, an open source machine learning library forPython
Orange, a free data mining software suite, moduleOrange.ensemble
Weka is a machine learning set of tools that offers variate implementations of boosting algorithms like AdaBoost and LogitBoost
R packageGBM (Generalized Boosted Regression Models) implements extensions to Freund and Schapire's AdaBoost algorithm and Friedman's gradient boosting machine.
jboost; AdaBoost, LogitBoost, RobustBoost, Boostexter and alternating decision trees
R packageadabag: Applies Multiclass AdaBoost.M1, AdaBoost-SAMME and Bagging
R packagexgboost: An implementation of gradient boosting for linear and tree-based models.

Notes

[edit]

^Some boosting-based classification algorithms actually decrease the weight of repeatedly misclassified examples; for example boost by majority andBrownBoost.

References

[edit]

^Leo Breiman (1996)."BIAS, VARIANCE, AND ARCING CLASSIFIERS"(PDF). TECHNICAL REPORT. Archived fromthe original(PDF) on 2015-01-19. Retrieved19 January 2015.Arcing [Boosting] is more successful than bagging in variance reduction
^Zhou Zhi-Hua (2012).Ensemble Methods: Foundations and Algorithms. Chapman and Hall/CRC. p. 23.ISBN 978-1439830031.The term boosting refers to a family of algorithms that are able to convert weak learners to strong learners
^Michael Kearns(1988);Thoughts on Hypothesis Boosting, Unpublished manuscript (Machine Learning class project, December 1988)
^Michael Kearns;Leslie Valiant (1989). "Cryptographic limitations on learning Boolean formulae and finite automata".Proceedings of the twenty-first annual ACM symposium on Theory of computing - STOC '89. Vol. 21. ACM. pp. 433–444.doi:10.1145/73007.73049.ISBN 978-0897913072.S2CID 536357.
^Schapire, Robert E. (1990)."The Strength of Weak Learnability"(PDF).Machine Learning.5 (2):197–227.CiteSeerX 10.1.1.20.723.doi:10.1007/bf00116037.S2CID 53304535. Archived fromthe original(PDF) on 2012-10-10. Retrieved2012-08-23.
^Leo Breiman (1998)."Arcing classifier (with discussion and a rejoinder by the author)".Ann. Stat.26 (3):801–849.doi:10.1214/aos/1024691079.Schapire (1990) proved that boosting is possible. (Page 823)
^Yoav Freund and Robert E. Schapire (1997);A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, Journal of Computer and System Sciences, 55(1):119-139
^Schapire, Robert E. (1990)."The Strength of Weak Learnability"(PDF).Machine Learning.5 (2):197–227.CiteSeerX 10.1.1.20.723.doi:10.1007/bf00116037.S2CID 53304535. Archived fromthe original(PDF) on 2012-10-10. Retrieved2012-08-23.
^^a ^b ^cLlew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean (2000);Boosting Algorithms as Gradient Descent, inS. A. Solla, T. K. Leen, and K.-R. Muller, editors,Advances in Neural Information Processing Systems 12, pp. 512-518, MIT Press
^Emer, Eric."Boosting (AdaBoost algorithm)"(PDF).MIT.Archived(PDF) from the original on 2022-10-09. Retrieved2018-10-10.
^Sivic, Russell, Efros, Freeman & Zisserman, "Discovering objects and their location in images", ICCV 2005
^A. Opelt, A. Pinz, et al., "Generic Object Recognition with Boosting", IEEE Transactions on PAMI 2006
^M. Marszalek, "Semantic Hierarchies for Visual Object Recognition", 2007
^"Large Scale Visual Recognition Challenge". December 2017.
^P. Viola, M. Jones, "Robust Real-time Object Detection", 2001
^Viola, P.; Jones, M.; Snow, D. (2003).Detecting Pedestrians Using Patterns of Motion and Appearance(PDF). ICCV.Archived(PDF) from the original on 2022-10-09.
^A. Torralba, K. P. Murphy, et al., "Sharing visual features for multiclass and multiview object detection", IEEE Transactions on PAMI 2006
^A. Opelt, et al., "Incremental learning of object detectors using a visual shape alphabet", CVPR 2006
^P. Long and R. Servedio. 25th International Conference on Machine Learning (ICML), 2008, pp. 608--615.
^Long, Philip M.; Servedio, Rocco A. (March 2010)."Random classification noise defeats all convex potential boosters"(PDF).Machine Learning.78 (3):287–304.doi:10.1007/s10994-009-5165-z.S2CID 53861.Archived(PDF) from the original on 2022-10-09. Retrieved2015-11-17.