Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

AdaBoost

From Wikipedia, the free encyclopedia
Adaptive boosting based classification algorithm
Part of a series on
Machine learning
anddata mining

AdaBoost (short forAdaptiveBoosting) is astatistical classificationmeta-algorithm formulated byYoav Freund andRobert Schapire in 1995, who won the 2003Gödel Prize for their work. It can be used in conjunction with many types of learning algorithm to improve performance. The output of multipleweak learners is combined into a weighted sum that represents the final output of the boosted classifier. Usually, AdaBoost is presented forbinary classification, although it can be generalized to multiple classes or bounded intervals ofreal values.[1][2]

AdaBoost is adaptive in the sense that subsequent weak learners (models) are adjusted in favor of instances misclassified by previous models. In some problems, it can be less susceptible tooverfitting than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing, the final model can be proven to converge to a strong learner.

Although AdaBoost is typically used to combine weak base learners (such asdecision stumps), it has been shown to also effectively combine strong base learners (such as deeperdecision trees), producing an even more accurate model.[3]

Every learning algorithm tends to suit some problem types better than others, and typically has many different parameters and configurations to adjust before it achieves optimal performance on adataset. AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier.[4][5] When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree-growing algorithm such that later trees tend to focus on harder-to-classify examples.

Training

[edit]

AdaBoost refers to a particular method of training a boosted classifier. A boosted classifier is a classifier of the formFT(x)=t=1Tft(x){\displaystyle F_{T}(x)=\sum _{t=1}^{T}f_{t}(x)}where eachft{\displaystyle f_{t}} is a weak learner that takes an objectx{\displaystyle x} as input and returns a value indicating the class of the object. For example, in the two-class problem, the sign of the weak learner's output identifies the predicted object class and the absolute value gives the confidence in that classification.

Each weak learner produces an output hypothesish{\displaystyle h} which fixes a predictionh(xi){\displaystyle h(x_{i})} for each sample in the training set. At each iterationt{\displaystyle t}, a weak learner is selected and assigned a coefficientαt{\displaystyle \alpha _{t}} such that the total training errorEt{\displaystyle E_{t}} of the resultingt{\displaystyle t}-stage boosted classifier is minimized.

Et=iE[Ft1(xi)+αth(xi)]{\displaystyle E_{t}=\sum _{i}E[F_{t-1}(x_{i})+\alpha _{t}h(x_{i})]}

HereFt1(x){\displaystyle F_{t-1}(x)} is the boosted classifier that has been built up to the previous stage of training andft(x)=αth(x){\displaystyle f_{t}(x)=\alpha _{t}h(x)} is the weak learner that is being considered for addition to the final classifier.

Weighting

[edit]

At each iteration of the training process, a weightwi,t{\displaystyle w_{i,t}} is assigned to each sample in the training set equal to the current errorE(Ft1(xi)){\displaystyle E(F_{t-1}(x_{i}))} on that sample. These weights can be used in the training of the weak learner. For instance, decision trees can be grown which favor the splitting of sets of samples with large weights.

Derivation

[edit]

This derivation follows Rojas (2009):[6]

Suppose we have a data set{(x1,y1),,(xN,yN)}{\displaystyle \{(x_{1},y_{1}),\ldots ,(x_{N},y_{N})\}} where each itemxi{\displaystyle x_{i}} has an associated classyi{1,1}{\displaystyle y_{i}\in \{-1,1\}}, and a set of weak classifiers{k1,,kL}{\displaystyle \{k_{1},\ldots ,k_{L}\}} each of which outputs a classificationkj(xi){1,1}{\displaystyle k_{j}(x_{i})\in \{-1,1\}} for each item. After the(m1){\displaystyle (m-1)}-th iteration our boosted classifier is a linear combination of the weak classifiers of the form:C(m1)(xi)=α1k1(xi)++αm1km1(xi),{\displaystyle C_{(m-1)}(x_{i})=\alpha _{1}k_{1}(x_{i})+\cdots +\alpha _{m-1}k_{m-1}(x_{i}),}where the class will be the sign ofC(m1)(xi){\displaystyle C_{(m-1)}(x_{i})}. At them{\displaystyle m}-th iteration we want to extend this to a better boosted classifier by adding another weak classifierkm{\displaystyle k_{m}}, with another weightαm{\displaystyle \alpha _{m}}:Cm(xi)=C(m1)(xi)+αmkm(xi){\displaystyle C_{m}(x_{i})=C_{(m-1)}(x_{i})+\alpha _{m}k_{m}(x_{i})}

So it remains to determine which weak classifier is the best choice forkm{\displaystyle k_{m}}, and what its weightαm{\displaystyle \alpha _{m}} should be. We define the total errorE{\displaystyle E} ofCm{\displaystyle C_{m}} as the sum of itsexponential loss on each data point, given as follows:E=i=1NeyiCm(xi)=i=1NeyiC(m1)(xi)eyiαmkm(xi){\displaystyle E=\sum _{i=1}^{N}e^{-y_{i}C_{m}(x_{i})}=\sum _{i=1}^{N}e^{-y_{i}C_{(m-1)}(x_{i})}e^{-y_{i}\alpha _{m}k_{m}(x_{i})}}

Lettingwi(1)=1{\displaystyle w_{i}^{(1)}=1} andwi(m)=eyiCm1(xi){\displaystyle w_{i}^{(m)}=e^{-y_{i}C_{m-1}(x_{i})}} form>1{\displaystyle m>1}, we have:E=i=1Nwi(m)eyiαmkm(xi){\displaystyle E=\sum _{i=1}^{N}w_{i}^{(m)}e^{-y_{i}\alpha _{m}k_{m}(x_{i})}}

We can split this summation between those data points that are correctly classified bykm{\displaystyle k_{m}} (soyikm(xi)=1{\displaystyle y_{i}k_{m}(x_{i})=1}) and those that are misclassified (soyikm(xi)=1{\displaystyle y_{i}k_{m}(x_{i})=-1}):E=yi=km(xi)wi(m)eαm+yikm(xi)wi(m)eαm=i=1Nwi(m)eαm+yikm(xi)wi(m)(eαmeαm){\displaystyle {\begin{aligned}E&=\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}e^{\alpha _{m}}\\&=\sum _{i=1}^{N}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}\left(e^{\alpha _{m}}-e^{-\alpha _{m}}\right)\end{aligned}}}

Since the only part of the right-hand side of this equation that depends onkm{\displaystyle k_{m}} isyikm(xi)wi(m){\textstyle \sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}, we see that thekm{\displaystyle k_{m}} that minimizesE{\displaystyle E} is the one in the set{k1,,kL}{\displaystyle \{k_{1},\ldots ,k_{L}\}} that minimizesyikm(xi)wi(m){\textstyle \sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}} [assuming thatαm>0{\displaystyle \alpha _{m}>0}], i.e. the weak classifier with the lowest weighted error (with weightswi(m)=eyiCm1(xi){\displaystyle w_{i}^{(m)}=e^{-y_{i}C_{m-1}(x_{i})}}).

To determine the desired weightαm{\displaystyle \alpha _{m}} that minimizesE{\displaystyle E} with thekm{\displaystyle k_{m}} that we just determined, we differentiate:dEdαm=d(yi=km(xi)wi(m)eαm+yikm(xi)wi(m)eαm)dαm{\displaystyle {\frac {dE}{d\alpha _{m}}}={\frac {d(\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}e^{\alpha _{m}})}{d\alpha _{m}}}}

The value ofαm{\displaystyle \alpha _{m}} that minimizes the above expression is:αm=12ln(yi=km(xi)wi(m)yikm(xi)wi(m)){\displaystyle \alpha _{m}={\frac {1}{2}}\ln \left({\frac {\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}}\right)}

Proof

dEdαm=yi=km(xi)wi(m)eαm+yikm(xi)wi(m)eαm=0{\displaystyle {\frac {dE}{d\alpha _{m}}}=-\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}e^{\alpha _{m}}=0}becauseeαm{\displaystyle e^{-\alpha _{m}}} does not depend oni{\displaystyle i}eαmyi=km(xi)wi(m)=eαmyikm(xi)wi(m){\displaystyle e^{-\alpha _{m}}\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}=e^{\alpha _{m}}\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}αm+ln(yi=km(xi)wi(m))=αm+ln(yikm(xi)wi(m)){\displaystyle -\alpha _{m}+\ln \left(\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}\right)=\alpha _{m}+\ln \left(\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}\right)}2αm=ln(yikm(xi)wi(m)yi=km(xi)wi(m)){\displaystyle -2\alpha _{m}=\ln \left({\dfrac {\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}}}\right)}αm=12ln(yikm(xi)wi(m)yi=km(xi)wi(m)){\displaystyle \alpha _{m}=-{\dfrac {1}{2}}\ln \left({\dfrac {\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}}}\right)}αm=12ln(yi=km(xi)wi(m)yikm(xi)wi(m)){\displaystyle \alpha _{m}={\dfrac {1}{2}}\ln \left({\dfrac {\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}}\right)}

We calculate the weighted error rate of the weak classifier to beϵm=yikm(xi)wi(m)i=1Nwi(m){\displaystyle \epsilon _{m}={\frac {\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{i=1}^{N}w_{i}^{(m)}}}}, so it follows that:αm=12ln(1ϵmϵm){\displaystyle \alpha _{m}={\frac {1}{2}}\ln \left({\frac {1-\epsilon _{m}}{\epsilon _{m}}}\right)}which is the negative logit function multiplied by 0.5. Due to the convexity ofE{\displaystyle E} as a function ofαm{\displaystyle \alpha _{m}}, this new expression forαm{\displaystyle \alpha _{m}} gives the global minimum of the loss function.

Note: This derivation only applies whenkm(xi){1,1}{\displaystyle k_{m}(x_{i})\in \{-1,1\}}, though it can be a good starting guess in other cases, such as when the weak learner is biased (km(x){a,b},ab{\displaystyle k_{m}(x)\in \{a,b\},a\neq -b}), has multiple leaves (km(x){a,b,,n}{\displaystyle k_{m}(x)\in \{a,b,\dots ,n\}}) or is some other functionkm(x)R{\displaystyle k_{m}(x)\in \mathbb {R} }.

Thus we have derived the AdaBoost algorithm: At each iteration, choose the classifierkm{\displaystyle k_{m}}, which minimizes the total weighted erroryikm(xi)wi(m){\textstyle \sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}, use this to calculate the error rateϵm=yikm(xi)wi(m)i=1Nwi(m){\displaystyle \epsilon _{m}={\frac {\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{i=1}^{N}w_{i}^{(m)}}}}, use this to calculate the weightαm=12ln(1ϵmϵm){\displaystyle \alpha _{m}={\frac {1}{2}}\ln \left({\frac {1-\epsilon _{m}}{\epsilon _{m}}}\right)}, and finally use this to improve the boosted classifierCm1{\displaystyle C_{m-1}} toCm=C(m1)+αmkm{\displaystyle C_{m}=C_{(m-1)}+\alpha _{m}k_{m}}.

Statistical understanding of boosting

[edit]
icon
This sectionneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources in this section. Unsourced material may be challenged and removed.(May 2016) (Learn how and when to remove this message)

Boosting is a form of linearregression in which the features of each samplexi{\displaystyle x_{i}} are the outputs of some weak learnerh{\displaystyle h} applied toxi{\displaystyle x_{i}}.

While regression tries to fitF(x){\displaystyle F(x)} toy(x){\displaystyle y(x)} as precisely as possible without loss of generalization, typically usingleast square errorE(f)=(y(x)f(x))2{\displaystyle E(f)=(y(x)-f(x))^{2}}, whereas the AdaBoost error functionE(f)=ey(x)f(x){\displaystyle E(f)=e^{-y(x)f(x)}} takes into account the fact that only the sign of the final result is used, thus|F(x)|{\displaystyle |F(x)|} can be far larger than 1 without increasing error. However, the exponential increase in the error for samplexi{\displaystyle x_{i}} asy(xi)f(xi){\displaystyle -y(x_{i})f(x_{i})} increases, resulting in excessive weights being assigned to outliers.

One feature of the choice of exponential error function is that the error of the final additive model is the product of the error of each stage, that is,eiyif(xi)=ieyif(xi){\displaystyle e^{\sum _{i}-y_{i}f(x_{i})}=\prod _{i}e^{-y_{i}f(x_{i})}}. Thus it can be seen that the weight update in the AdaBoost algorithm is equivalent to recalculating the error onFt(x){\displaystyle F_{t}(x)} after each stage.

There is a lot of flexibility allowed in the choice of loss function. As long as the loss function ismonotonic andcontinuously differentiable, the classifier is always driven toward purer solutions.[7] Zhang (2004) provides a loss function based on least squares, a modifiedHuber loss function:ϕ(y,f(x))={4yf(x)if yf(x)<1,(yf(x)1)2if 1yf(x)1.0if yf(x)>1{\displaystyle \phi (y,f(x))={\begin{cases}-4yf(x)&{\text{if }}yf(x)<-1,\\(yf(x)-1)^{2}&{\text{if }}-1\leq yf(x)\leq 1.\\0&{\text{if }}yf(x)>1\end{cases}}}

This function is more well-behaved than LogitBoost forf(x){\displaystyle f(x)} close to 1 or -1, does not penalise ‘overconfident’ predictions (yf(x)>1{\displaystyle yf(x)>1}), unlike unmodified least squares, and only penalises samples misclassified with confidence greater than 1 linearly, as opposed to quadratically or exponentially, and is thus less susceptible to the effects of outliers.

Boosting as gradient descent

[edit]
Main article:Gradient boosting

Boosting can be seen as minimization of aconvex loss function over aconvex set of functions.[8] Specifically, the loss being minimized by AdaBoost is the exponential lossiϕ(i,y,f)=ieyif(xi),{\displaystyle \sum _{i}\phi (i,y,f)=\sum _{i}e^{-y_{i}f(x_{i})},}whereas LogitBoost performs logistic regression, minimizingiϕ(i,y,f)=iln(1+eyif(xi)).{\displaystyle \sum _{i}\phi (i,y,f)=\sum _{i}\ln \left(1+e^{-y_{i}f(x_{i})}\right).}

In the gradient descent analogy, the output of the classifier for each training point is considered a point(Ft(x1),,Ft(xn)){\displaystyle \left(F_{t}(x_{1}),\dots ,F_{t}(x_{n})\right)} in n-dimensional space, where each axis corresponds to a training sample, each weak learnerh(x){\displaystyle h(x)} corresponds to a vector of fixed orientation and length, and the goal is to reach the target point(y1,,yn){\displaystyle (y_{1},\dots ,y_{n})} (or any region where the value of loss functionET(x1,,xn){\displaystyle E_{T}(x_{1},\dots ,x_{n})} is less than the value at that point), in the fewest steps. Thus AdaBoost algorithms perform eitherCauchy (findh(x){\displaystyle h(x)} with the steepest gradient, chooseα{\displaystyle \alpha } to minimize test error) orNewton (choose some target point, findαh(x){\displaystyle \alpha h(x)} that bringsFt{\displaystyle F_{t}} closest to that point) optimization of training error.

Example algorithm (Discrete AdaBoost)

[edit]

With:

Fort{\displaystyle t} in1T{\displaystyle 1\dots T}:

Variants

[edit]

Real AdaBoost

[edit]

The output of decision trees is a class probability estimatep(x)=P(y=1|x){\displaystyle p(x)=P(y=1|x)}, the probability thatx{\displaystyle x} is in the positive class.[7] Friedman, Hastie and Tibshirani derive an analytical minimizer forey(Ft1(x)+ft(p(x))){\displaystyle e^{-y\left(F_{t-1}(x)+f_{t}(p(x))\right)}} for some fixedp(x){\displaystyle p(x)} (typically chosen using weighted least squares error):

ft(x)=12ln(x1x){\displaystyle f_{t}(x)={\frac {1}{2}}\ln \left({\frac {x}{1-x}}\right)}.

Thus, rather than multiplying the output of the entire tree by some fixed value, each leaf node is changed to output half thelogit transform of its previous value.

LogitBoost

[edit]
Main article:LogitBoost

LogitBoost represents an application of establishedlogistic regression techniques to the AdaBoost method. Rather than minimizing error with respect to y, weak learners are chosen to minimize the (weighted least-squares) error offt(x){\displaystyle f_{t}(x)} with respect tozt=ypt(x)2pt(x)(1pt(x)),{\displaystyle z_{t}={\frac {y^{*}-p_{t}(x)}{2p_{t}(x)(1-p_{t}(x))}},}wherept(x)=eFt1(x)eFt1(x)+eFt1(x),{\displaystyle p_{t}(x)={\frac {e^{F_{t-1}(x)}}{e^{F_{t-1}(x)}+e^{-F_{t-1}(x)}}},}wt=pt(x)(1pt(x)){\displaystyle w_{t}=p_{t}(x)(1-p_{t}(x))}y=y+12.{\displaystyle y^{*}={\frac {y+1}{2}}.}

That iszt{\displaystyle z_{t}} is theNewton–Raphson approximation of the minimizer of the log-likelihood error at staget{\displaystyle t}, and the weak learnerft{\displaystyle f_{t}} is chosen as the learner that best approximateszt{\displaystyle z_{t}} by weighted least squares.

As p approaches either 1 or 0, the value ofpt(xi)(1pt(xi)){\displaystyle p_{t}(x_{i})(1-p_{t}(x_{i}))} becomes very small and thez term, which is large for misclassified samples, can becomenumerically unstable, due to machine precision rounding errors. This can be overcome by enforcing some limit on the absolute value ofz and the minimum value of w

Gentle AdaBoost

[edit]

While previous boosting algorithms chooseft{\displaystyle f_{t}} greedily, minimizing the overall test error as much as possible at each step, GentleBoost features a bounded step size.ft{\displaystyle f_{t}} is chosen to minimizeiwt,i(yift(xi))2{\textstyle \sum _{i}w_{t,i}(y_{i}-f_{t}(x_{i}))^{2}}, and no further coefficient is applied. Thus, in the case where a weak learner exhibits perfect classification performance, GentleBoost choosesft(x)=αtht(x){\displaystyle f_{t}(x)=\alpha _{t}h_{t}(x)} exactly equal toy{\displaystyle y}, while steepest descent algorithms try to setαt={\displaystyle \alpha _{t}=\infty }. Empirical observations about the good performance of GentleBoost appear to back up Schapire and Singer's remark that allowing excessively large values ofα{\displaystyle \alpha } can lead to poor generalization performance.[9][10]

Early termination

[edit]

A technique for speeding up processing of boosted classifiers, early termination refers to only testing each potential object with as many layers of the final classifier necessary to meet some confidence threshold, speeding up computation for cases where the class of the object can easily be determined. One such scheme is the object detection framework introduced by Viola and Jones:[11] in an application with significantly more negative samples than positive, a cascade of separate boost classifiers is trained, the output of each stage biased such that some acceptably small fraction of positive samples is mislabeled as negative, and all samples marked as negative after each stage are discarded. If 50% of negative samples are filtered out by each stage, only a very small number of objects would pass through the entire classifier, reducing computation effort. This method has since been generalized, with a formula provided for choosing optimal thresholds at each stage to achieve some desired false positive and false negative rate.[12]

In the field of statistics, where AdaBoost is more commonly applied to problems of moderate dimensionality,early stopping is used as a strategy to reduceoverfitting.[13] A validation set of samples is separated from the training set, performance of the classifier on the samples used for training is compared to performance on the validation samples, and training is terminated if performance on the validation sample is seen to decrease even as performance on the training set continues to improve.

Totally corrective algorithms

[edit]

For steepest descent versions of AdaBoost, whereαt{\displaystyle \alpha _{t}} is chosen at each layert to minimize test error, the next layer added is said to bemaximally independent of layert:[14] it is unlikely to choose a weak learnert+1 that is similar to learnert. However, there remains the possibility thatt+1 produces similar information to some other earlier layer. Totally corrective algorithms, such asLPBoost, optimize the value of every coefficient after each step, such that new layers added are always maximally independent of every previous layer. This can be accomplished by backfitting,linear programming or some other method.

Pruning

[edit]

Pruning is the process of removing poorly performing weak classifiers to improve memory and execution-time cost of the boosted classifier. The simplest methods, which can be particularly effective in conjunction with totally corrective training, are weight- or margin-trimming: when the coefficient, or the contribution to the total test error, of some weak classifier falls below a certain threshold, that classifier is dropped. Margineantu & Dietterich[15] suggested an alternative criterion for trimming: weak classifiers should be selected such that the diversity of the ensemble is maximized. If two weak learners produce very similar outputs, efficiency can be improved by removing one of them and increasing the coefficient of the remaining weak learner.[16]

See also

[edit]

References

[edit]
  1. ^Freund, Yoav; Schapire, Robert E. (1995),A desicion-theoretic [sic] generalization of on-line learning and an application to boosting, Lecture Notes in Computer Science, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 23–37,doi:10.1007/3-540-59119-2_166,ISBN 978-3-540-59119-1, retrieved2022-06-24
  2. ^Hastie, Trevor; Rosset, Saharon; Zhu, Ji; Zou, Hui (2009)."Multi-class AdaBoost".Statistics and Its Interface.2 (3):349–360.doi:10.4310/sii.2009.v2.n3.a8.ISSN 1938-7989.
  3. ^Wyner, Abraham J.; Olson, Matthew; Bleich, Justin; Mease, David (2017)."Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers".Journal of Machine Learning Research.18 (48):1–33. Retrieved17 March 2022.
  4. ^Kégl, Balázs (20 December 2013). "The return of AdaBoost.MH: multi-class Hamming trees".arXiv:1312.6086 [cs.LG].
  5. ^Joglekar, Sachin (6 March 2016)."adaboost – Sachin Joglekar's blog".codesachin.wordpress.com. Retrieved3 August 2016.
  6. ^Rojas, Raúl (2009)."AdaBoost and the super bowl of classifiers a tutorial introduction to adaptive boosting"(Tech. Rep.). Freie University, Berlin.
  7. ^abFriedman, Jerome; Hastie, Trevor; Tibshirani, Robert (1998). "Additive Logistic Regression: A Statistical View of Boosting".Annals of Statistics.28: 2000.CiteSeerX 10.1.1.51.9525.
  8. ^Zhang, T. (2004)."Statistical behavior and consistency of classification methods based on convex risk minimization".Annals of Statistics.32 (1):56–85.doi:10.1214/aos/1079120130.JSTOR 3448494.
  9. ^Schapire, Robert; Singer, Yoram (1999). "Improved Boosting Algorithms Using Confidence-rated Predictions":80–91.CiteSeerX 10.1.1.33.4002.{{cite journal}}:Cite journal requires|journal= (help)
  10. ^Freund; Schapire (1999)."A Short Introduction to Boosting"(PDF):
  11. ^Viola, Paul; Jones, Robert (2001). "Rapid Object Detection Using a Boosted Cascade of Simple Features".CiteSeerX 10.1.1.10.6807.{{cite journal}}:Cite journal requires|journal= (help)
  12. ^McCane, Brendan; Novins, Kevin; Albert, Michael (2005). "Optimizing cascade classifiers".{{cite journal}}:Cite journal requires|journal= (help)
  13. ^Trevor Hastie; Robert Tibshirani; Jerome Friedman (2009).The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). New York: Springer.ISBN 978-0-387-84858-7. Archived fromthe original on 2015-03-13. Retrieved2014-03-13.
  14. ^Šochman, Jan; Matas, Jiří (2004).Adaboost with Totally Corrective Updates for Fast Face Detection. IEEE Computer Society.ISBN 978-0-7695-2122-0.
  15. ^Margineantu, Dragos; Dietterich, Thomas (1997). "Pruning Adaptive Boosting".CiteSeerX 10.1.1.38.7017.{{cite journal}}:Cite journal requires|journal= (help)
  16. ^Tamon, Christino; Xiang, Jie (2000). "On the Boosting Pruning Problem".{{cite journal}}:Cite journal requires|journal= (help)

Further reading

[edit]
Retrieved from "https://en.wikipedia.org/w/index.php?title=AdaBoost&oldid=1332946437"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp