Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Online machine learning

From Wikipedia, the free encyclopedia
(Redirected fromBatch learning)
Not to be confused withonline and offline.
Method of machine learning
Part of a series on
Machine learning
anddata mining
Journals and conferences

Incomputer science,online machine learning is a method ofmachine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entiretraining data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need ofout-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., prediction of prices in the financial international markets. Online learning algorithms may be prone tocatastrophic interference, a problem that can be addressed byincremental learning approaches.

Introduction

[edit]

In the setting ofsupervised learning, a function off:XY{\displaystyle f:X\to Y} is to be learned, whereX{\displaystyle X} is thought of as a space of inputs andY{\displaystyle Y} as a space of outputs, that predicts well on instances that are drawn from ajoint probability distributionp(x,y){\displaystyle p(x,y)} onX×Y{\displaystyle X\times Y}. In reality, the learner never knows the true distributionp(x,y){\displaystyle p(x,y)} over instances. Instead, the learner usually has access to atraining set of examples(x1,y1),,(xn,yn){\displaystyle (x_{1},y_{1}),\ldots ,(x_{n},y_{n})}. In this setting, theloss function is given asV:Y×YR{\displaystyle V:Y\times Y\to \mathbb {R} }, such thatV(f(x),y){\displaystyle V(f(x),y)} measures the difference between the predicted valuef(x){\displaystyle f(x)} and the true valuey{\displaystyle y}. The ideal goal is to select a functionfH{\displaystyle f\in {\mathcal {H}}}, whereH{\displaystyle {\mathcal {H}}} is a space of functions called a hypothesis space, so that some notion of total loss is minimized. Depending on the type of model (statistical or adversarial), one can devise different notions of loss, which lead to different learning algorithms.

Statistical view of online learning

[edit]

In statistical learning models, the training sample(xi,yi){\displaystyle (x_{i},y_{i})} are assumed to have been drawn from the true distributionp(x,y){\displaystyle p(x,y)} and the objective is to minimize the expected "risk"I[f]=E[V(f(x),y)]=V(f(x),y)dp(x,y) .{\displaystyle I[f]=\mathbb {E} [V(f(x),y)]=\int V(f(x),y)\,dp(x,y)\ .}A common paradigm in this situation is to estimate a functionf^{\displaystyle {\hat {f}}} throughempirical risk minimization or regularized empirical risk minimization (usuallyTikhonov regularization). The choice of loss function here gives rise to several well-known learning algorithms such as regularizedleast squares andsupport vector machines.A purely online model in this category would learn based on just the new input(xt+1,yt+1){\displaystyle (x_{t+1},y_{t+1})}, the current best predictorft{\displaystyle f_{t}} and some extra stored information (which is usually expected to have storage requirements independent of training data size). For many formulations, for example nonlinearkernel methods, true online learning is not possible, though a form of hybrid online learning with recursive algorithms can be used whereft+1{\displaystyle f_{t+1}} is permitted to depend onft{\displaystyle f_{t}} and all previous data points(x1,y1),,(xt,yt){\displaystyle (x_{1},y_{1}),\ldots ,(x_{t},y_{t})}. In this case, the space requirements are no longer guaranteed to be constant since it requires storing all previous data points, but the solution may take less time to compute with the addition of a new data point, as compared to batch learning techniques.

A common strategy to overcome the above issues is to learn using mini-batches, which process a small batch ofb1{\displaystyle b\geq 1} data points at a time, this can be considered as pseudo-online learning forb{\displaystyle b} much smaller than the total number of training points. Mini-batch techniques are used with repeated passing over the training data to obtain optimizedout-of-core versions of machine learning algorithms, for example,stochastic gradient descent. When combined withbackpropagation, this is currently the de facto training method for trainingartificial neural networks.

Example: linear least squares

[edit]
Main article:Linear least squares (mathematics)

The simple example of linear least squares is used to explain a variety of ideas in online learning. The ideas are general enough to be applied to other settings, for example, with other convex loss functions.

Batch learning

[edit]

Consider the setting of supervised learning withf{\displaystyle f} being a linear function to be learned:f(xj)=w,xj=wxj{\displaystyle f(x_{j})=\langle w,x_{j}\rangle =w\cdot x_{j}}wherexjRd{\displaystyle x_{j}\in \mathbb {R} ^{d}} is a vector of inputs (data points) andwRd{\displaystyle w\in \mathbb {R} ^{d}} is a linear filter vector.The goal is to compute the filter vectorw{\displaystyle w}.To this end, a square loss functionV(f(xj),yj)=(f(xj)yj)2=(w,xjyj)2{\displaystyle V(f(x_{j}),y_{j})=(f(x_{j})-y_{j})^{2}=(\langle w,x_{j}\rangle -y_{j})^{2}}is used to compute the vectorw{\displaystyle w} that minimizes the empirical lossIn[w]=j=1nV(w,xj,yj)=j=1n(xjTwyj)2{\displaystyle I_{n}[w]=\sum _{j=1}^{n}V(\langle w,x_{j}\rangle ,y_{j})=\sum _{j=1}^{n}(x_{j}^{\mathsf {T}}w-y_{j})^{2}} whereyjR.{\displaystyle y_{j}\in \mathbb {R} .}

LetX{\displaystyle X} be thei×d{\displaystyle i\times d} data matrix andyRi{\displaystyle y\in \mathbb {R} ^{i}} is the column vector of target values after the arrival of the firsti{\displaystyle i} data points.Assuming that the covariance matrixΣi=XTX{\displaystyle \Sigma _{i}=X^{\mathsf {T}}X} is invertible (otherwise it is preferential to proceed in a similar fashion with Tikhonov regularization), the best solutionf(x)=w,x{\displaystyle f^{*}(x)=\langle w^{*},x\rangle } to the linear least squares problem is given byw=(XTX)1XTy=Σi1j=1ixjyj.{\displaystyle w^{*}=(X^{\mathsf {T}}X)^{-1}X^{\mathsf {T}}y=\Sigma _{i}^{-1}\sum _{j=1}^{i}x_{j}y_{j}.}

Now, calculating the covariance matrixΣi=j=1ixjxjT{\displaystyle \Sigma _{i}=\sum _{j=1}^{i}x_{j}x_{j}^{\mathsf {T}}} takes timeO(id2){\displaystyle O(id^{2})}, inverting thed×d{\displaystyle d\times d} matrix takes timeO(d3){\displaystyle O(d^{3})}, while the rest of the multiplication takes timeO(d2){\displaystyle O(d^{2})}, giving a total time ofO(id2+d3){\displaystyle O(id^{2}+d^{3})}. When there aren{\displaystyle n} total points in the dataset, to recompute the solution after the arrival of every datapointi=1,,n{\displaystyle i=1,\ldots ,n}, the naive approach will have a total complexityO(n2d2+nd3){\displaystyle O(n^{2}d^{2}+nd^{3})}. Note that when storing the matrixΣi{\displaystyle \Sigma _{i}}, then updating it at each step needs only addingxi+1xi+1T{\displaystyle x_{i+1}x_{i+1}^{\mathsf {T}}}, which takesO(d2){\displaystyle O(d^{2})} time, reducing the total time toO(nd2+nd3)=O(nd3){\displaystyle O(nd^{2}+nd^{3})=O(nd^{3})}, but with an additional storage space ofO(d2){\displaystyle O(d^{2})} to storeΣi{\displaystyle \Sigma _{i}}.[1]

Online learning: recursive least squares

[edit]

The recursive least squares (RLS) algorithm considers an online approach to the least squares problem. It can be shown that by initialisingw0=0Rd{\displaystyle \textstyle w_{0}=0\in \mathbb {R} ^{d}} andΓ0=IRd×d{\displaystyle \textstyle \Gamma _{0}=I\in \mathbb {R} ^{d\times d}}, the solution of the linear least squares problem given in the previous section can be computed by the following iteration:Γi=Γi1Γi1xixiTΓi11+xiTΓi1xi{\displaystyle \Gamma _{i}=\Gamma _{i-1}-{\frac {\Gamma _{i-1}x_{i}x_{i}^{\mathsf {T}}\Gamma _{i-1}}{1+x_{i}^{\mathsf {T}}\Gamma _{i-1}x_{i}}}}wi=wi1Γixi(xiTwi1yi){\displaystyle w_{i}=w_{i-1}-\Gamma _{i}x_{i}\left(x_{i}^{\mathsf {T}}w_{i-1}-y_{i}\right)}The above iteration algorithm can be proved using induction oni{\displaystyle i}.[2] The proof also shows thatΓi=Σi1{\displaystyle \Gamma _{i}=\Sigma _{i}^{-1}}. One can look at RLS also in the context of adaptive filters (seeRLS).

The complexity forn{\displaystyle n} steps of this algorithm isO(nd2){\displaystyle O(nd^{2})}, which is an order of magnitude faster than the corresponding batch learning complexity. The storage requirements at every stepi{\displaystyle i} here are to store the matrixΓi{\displaystyle \Gamma _{i}}, which is constant atO(d2){\displaystyle O(d^{2})}. For the case whenΣi{\displaystyle \Sigma _{i}} is not invertible, consider the regularised version of the problem loss functionj=1n(xjTwyj)2+λw22{\displaystyle \sum _{j=1}^{n}\left(x_{j}^{\mathsf {T}}w-y_{j}\right)^{2}+\lambda \left\|w\right\|_{2}^{2}}. Then, it's easy to show that the same algorithm works withΓ0=(I+λI)1{\displaystyle \Gamma _{0}=(I+\lambda I)^{-1}}, and the iterations proceed to giveΓi=(Σi+λI)1{\displaystyle \Gamma _{i}=(\Sigma _{i}+\lambda I)^{-1}}.[1]

Stochastic gradient descent

[edit]
Main article:Stochastic gradient descent

When thiswi=wi1Γixi(xiTwi1yi){\displaystyle w_{i}=w_{i-1}-\Gamma _{i}x_{i}\left(x_{i}^{\mathsf {T}}w_{i-1}-y_{i}\right)} is replaced bywi=wi1γixi(xiTwi1yi)=wi1γiV(wi1,xi,yi){\displaystyle w_{i}=w_{i-1}-\gamma _{i}x_{i}\left(x_{i}^{\mathsf {T}}w_{i-1}-y_{i}\right)=w_{i-1}-\gamma _{i}\nabla V(\langle w_{i-1},x_{i}\rangle ,y_{i})} orΓiRd×d{\displaystyle \Gamma _{i}\in \mathbb {R} ^{d\times d}} byγiR{\displaystyle \gamma _{i}\in \mathbb {R} }, this becomes the stochastic gradient descent algorithm. In this case, the complexity forn{\displaystyle n} steps of this algorithm reduces toO(nd){\displaystyle O(nd)}. The storage requirements at every stepi{\displaystyle i} are constant atO(d){\displaystyle O(d)}.

However, the stepsizeγi{\displaystyle \gamma _{i}} needs to be chosen carefully to solve the expected risk minimization problem, as detailed above. By choosing a decaying step sizeγi1i,{\displaystyle \gamma _{i}\approx {\frac {1}{\sqrt {i}}},} one can prove the convergence of the average iteratew¯n=1ni=1nwi{\textstyle {\overline {w}}_{n}={\frac {1}{n}}\sum _{i=1}^{n}w_{i}}. This setting is a special case ofstochastic optimization, a well known problem in optimization.[1]

Incremental stochastic gradient descent

[edit]

In practice, one can perform multiple stochastic gradient passes (also called cycles or epochs) over the data. The algorithm thus obtained is called incremental gradient method and corresponds to an iterationwi=wi1γiV(wi1,xti,yti){\displaystyle w_{i}=w_{i-1}-\gamma _{i}\nabla V(\langle w_{i-1},x_{t_{i}}\rangle ,y_{t_{i}})} The main difference with the stochastic gradient method is that here a sequenceti{\displaystyle t_{i}} is chosen to decide which training point is visited in thei{\displaystyle i}-th step. Such a sequence can be stochastic or deterministic. The number of iterations is then decoupled to the number of points (each point can be considered more than once). The incremental gradient method can be shown to provide a minimizer to the empirical risk.[3] Incremental techniques can be advantageous when considering objective functions made up of a sum of many terms e.g. an empirical error corresponding to a very large dataset.[1]

Kernel methods

[edit]
See also:Kernel method

Kernels can be used to extend the above algorithms to non-parametric models (or models where the parameters form an infinite dimensional space). The corresponding procedure will no longer be truly online and instead involve storing all the data points, but is still faster than the brute force method. This discussion is restricted to the case of the square loss, though it can be extended to any convex loss. It can be shown by an easy induction[1] that ifXi{\displaystyle X_{i}} is the data matrix andwi{\displaystyle w_{i}} is the output afteri{\displaystyle i} steps of theSGD algorithm, then,wi=XiTci{\displaystyle w_{i}=X_{i}^{\mathsf {T}}c_{i}} whereci=((ci)1,(ci)2,...,(ci)i)Ri{\displaystyle c_{i}=((c_{i})_{1},(c_{i})_{2},...,(c_{i})_{i})\in \mathbb {R} ^{i}} and the sequenceci{\displaystyle c_{i}} satisfies the recursion:c0=0{\displaystyle c_{0}=0}(ci)j=(ci1)j,j=1,2,...,i1{\displaystyle (c_{i})_{j}=(c_{i-1})_{j},j=1,2,...,i-1} and(ci)i=γi(yij=1i1(ci1)jxj,xi){\displaystyle (c_{i})_{i}=\gamma _{i}{\Big (}y_{i}-\sum _{j=1}^{i-1}(c_{i-1})_{j}\langle x_{j},x_{i}\rangle {\Big )}}Notice that herexj,xi{\displaystyle \langle x_{j},x_{i}\rangle } is just the standard Kernel onRd{\displaystyle \mathbb {R} ^{d}}, and the predictor is of the formfi(x)=wi1,x=j=1i1(ci1)jxj,x.{\displaystyle f_{i}(x)=\langle w_{i-1},x\rangle =\sum _{j=1}^{i-1}(c_{i-1})_{j}\langle x_{j},x\rangle .}


Now, if a general kernelK{\displaystyle K} is introduced instead and let the predictor befi(x)=j=1i1(ci1)jK(xj,x){\displaystyle f_{i}(x)=\sum _{j=1}^{i-1}(c_{i-1})_{j}K(x_{j},x)}then the same proof will also show that predictor minimising the least squares loss is obtained by changing the above recursion to(ci)i=γi(yij=1i1(ci1)jK(xj,xi)){\displaystyle (c_{i})_{i}=\gamma _{i}{\Big (}y_{i}-\sum _{j=1}^{i-1}(c_{i-1})_{j}K(x_{j},x_{i}){\Big )}}The above expression requires storing all the data for updatingci{\displaystyle c_{i}}. The total time complexity for the recursion when evaluating for then{\displaystyle n}-th datapoint isO(n2dk){\displaystyle O(n^{2}dk)}, wherek{\displaystyle k} is the cost of evaluating the kernel on a single pair of points.[1] Thus, the use of the kernel has allowed the movement from a finite dimensional parameter spacewiRd{\displaystyle \textstyle w_{i}\in \mathbb {R} ^{d}} to a possibly infinite dimensional feature represented by a kernelK{\displaystyle K} by instead performing the recursion on the space of parametersciRi{\displaystyle \textstyle c_{i}\in \mathbb {R} ^{i}}, whose dimension is the same as the size of the training dataset. In general, this is a consequence of therepresenter theorem.[1]

Online convex optimization

[edit]

Online convex optimization (OCO)[4] is a general framework for decision making which leveragesconvex optimization to allow for efficient algorithms. The framework is that of repeated game playing as follows:

Fort=1,2,...,T{\displaystyle t=1,2,...,T}

The goal is to minimizeregret, or the difference between cumulative loss and the loss of the best fixed pointuS{\displaystyle u\in S} in hindsight. As an example, consider the case of online least squares linear regression. Here, the weight vectors come from the convex setS=Rd{\displaystyle S=\mathbb {R} ^{d}}, and nature sends back the convex loss functionvt(w)=(w,xtyt)2{\displaystyle v_{t}(w)=(\langle w,x_{t}\rangle -y_{t})^{2}}. Note here thatyt{\displaystyle y_{t}} is implicitly sent withvt{\displaystyle v_{t}}.

Some online prediction problems however cannot fit in the framework of OCO. For example, in online classification, the prediction domain and the loss functions are not convex. In such scenarios, two simple techniques forconvexification are used:randomisation and surrogate loss functions.[citation needed]

Some simple online convex optimisation algorithms are:

Follow the leader (FTL)

[edit]

The simplest learning rule to try is to select (at the current step) the hypothesis that has the least loss over all past rounds. This algorithm is called Follow the leader, and roundt{\displaystyle t} is simply given by:wt=argminwSi=1t1vi(w){\displaystyle w_{t}=\mathop {\operatorname {arg\,min} } _{w\in S}\sum _{i=1}^{t-1}v_{i}(w)}This method can thus be looked as agreedy algorithm. For the case of online quadratic optimization (where the loss function isvt(w)=wxt22{\displaystyle v_{t}(w)=\left\|w-x_{t}\right\|_{2}^{2}}), one can show a regret bound that grows aslog(T){\displaystyle \log(T)}. However, similar bounds cannot be obtained for the FTL algorithm for other important families of models like online linear optimization. To do so, one modifies FTL by adding regularisation.

Follow the regularised leader (FTRL)

[edit]

This is a natural modification of FTL that is used to stabilise the FTL solutions and obtain better regret bounds. A regularisation functionR:SR{\displaystyle R:S\to \mathbb {R} } is chosen and learning performed in roundt as follows:wt=argminwSi=1t1vi(w)+R(w){\displaystyle w_{t}=\mathop {\operatorname {arg\,min} } _{w\in S}\sum _{i=1}^{t-1}v_{i}(w)+R(w)}As a special example, consider the case of online linear optimisation i.e. where nature sends back loss functions of the formvt(w)=w,zt{\displaystyle v_{t}(w)=\langle w,z_{t}\rangle }. Also, letS=Rd{\displaystyle S=\mathbb {R} ^{d}}. Suppose the regularisation functionR(w)=12ηw22{\textstyle R(w)={\frac {1}{2\eta }}\left\|w\right\|_{2}^{2}} is chosen for some positive numberη{\displaystyle \eta }. Then, one can show that the regret minimising iteration becomeswt+1=ηi=1tzi=wtηzt{\displaystyle w_{t+1}=-\eta \sum _{i=1}^{t}z_{i}=w_{t}-\eta z_{t}}Note that this can be rewritten aswt+1=wtηvt(wt){\displaystyle w_{t+1}=w_{t}-\eta \nabla v_{t}(w_{t})}, which looks exactly like online gradient descent.

IfS is instead some convex subspace ofRd{\displaystyle \mathbb {R} ^{d}},S would need to be projected onto, leading to the modified update rulewt+1=ΠS(ηi=1tzi)=ΠS(ηθt+1){\displaystyle w_{t+1}=\Pi _{S}(-\eta \sum _{i=1}^{t}z_{i})=\Pi _{S}(\eta \theta _{t+1})}This algorithm is known as lazy projection, as the vectorθt+1{\displaystyle \theta _{t+1}} accumulates the gradients. It is also known as Nesterov's dual averaging algorithm. In this scenario of linear loss functions and quadratic regularisation, the regret is bounded byO(T){\displaystyle O({\sqrt {T}})}, and thus the average regret goes to0 as desired.

Online subgradient descent (OSD)

[edit]
See also:Subgradient method

The above proved a regret bound for linear loss functionsvt(w)=w,zt{\displaystyle v_{t}(w)=\langle w,z_{t}\rangle }. To generalise the algorithm to any convex loss function, thesubgradientvt(wt){\displaystyle \partial v_{t}(w_{t})} ofvt{\displaystyle v_{t}} is used as a linear approximation tovt{\displaystyle v_{t}} nearwt{\displaystyle w_{t}}, leading to the online subgradient descent algorithm:

Initialise parameterη,w1=0{\displaystyle \eta ,w_{1}=0}

Fort=1,2,...,T{\displaystyle t=1,2,...,T}

One can use the OSD algorithm to deriveO(T){\displaystyle O({\sqrt {T}})} regret bounds for the online version ofSVM's for classification, which use thehinge lossvt(w)=max{0,1yt(wxt)}{\displaystyle v_{t}(w)=\max\{0,1-y_{t}(w\cdot x_{t})\}}

Other algorithms

[edit]

Quadratically regularised FTRL algorithms lead to lazily projected gradient algorithms as described above. To use the above for arbitrary convex functions and regularisers, one usesonline mirror descent. The optimal regularization in hindsight can be derived for linear loss functions, this leads to theAdaGrad algorithm. For the Euclidean regularisation, one can show a regret bound ofO(T){\displaystyle O({\sqrt {T}})}, which can be improved further to aO(logT){\displaystyle O(\log T)} for strongly convex and exp-concave loss functions.

Continual learning

[edit]

Continual learning means constantly improving the learned model by processing continuous streams of information.[5] Continual learning capabilities are essential for software systems and autonomous agents interacting in an ever changing real world. However, continual learning is a challenge for machine learning and neural network models since the continual acquisition of incrementally available information from non-stationary data distributions generally leads tocatastrophic forgetting.

Interpretations of online learning

[edit]

The paradigm of online learning has different interpretations depending on the choice of the learning model, each of which has distinct implications about the predictive quality of the sequence of functionsf1,f2,,fn{\displaystyle f_{1},f_{2},\ldots ,f_{n}}. The prototypical stochastic gradient descent algorithm is used for this discussion. As noted above, its recursion is given bywt=wt1γtV(wt1,xt,yt){\displaystyle w_{t}=w_{t-1}-\gamma _{t}\nabla V(\langle w_{t-1},x_{t}\rangle ,y_{t})}

The first interpretation consider thestochastic gradient descent method as applied to the problem of minimizing the expected riskI[w]{\displaystyle I[w]} defined above.[6] Indeed, in the case of an infinite stream of data, since the examples(x1,y1),(x2,y2),{\displaystyle (x_{1},y_{1}),(x_{2},y_{2}),\ldots } are assumed to be drawn i.i.d. from the distributionp(x,y){\displaystyle p(x,y)}, the sequence of gradients ofV(,){\displaystyle V(\cdot ,\cdot )} in the above iteration are an i.i.d. sample of stochastic estimates of the gradient of the expected riskI[w]{\displaystyle I[w]} and therefore one can apply complexity results for the stochastic gradient descent method to bound the deviationI[wt]I[w]{\displaystyle I[w_{t}]-I[w^{\ast }]}, wherew{\displaystyle w^{\ast }} is the minimizer ofI[w]{\displaystyle I[w]}.[7] This interpretation is also valid in the case of a finite training set; although with multiple passes through the data the gradients are no longer independent, still complexity results can be obtained in special cases.

The second interpretation applies to the case of a finite training set and considers the SGD algorithm as an instance of incremental gradient descent method.[3] In this case, one instead looks at the empirical risk:In[w]=1ni=1nV(w,xi,yi) .{\displaystyle I_{n}[w]={\frac {1}{n}}\sum _{i=1}^{n}V(\langle w,x_{i}\rangle ,y_{i})\ .}Since the gradients ofV(,){\displaystyle V(\cdot ,\cdot )} in the incremental gradient descent iterations are also stochastic estimates of the gradient ofIn[w]{\displaystyle I_{n}[w]}, this interpretation is also related to the stochastic gradient descent method, but applied to minimize the empirical risk as opposed to the expected risk. Since this interpretation concerns the empirical risk and not the expected risk, multiple passes through the data are readily allowed and actually lead to tighter bounds on the deviationsIn[wt]In[wn]{\displaystyle I_{n}[w_{t}]-I_{n}[w_{n}^{\ast }]}, wherewn{\displaystyle w_{n}^{\ast }} is the minimizer ofIn[w]{\displaystyle I_{n}[w]}.

Implementations

[edit]

See also

[edit]

Learning paradigms

General algorithms

Learning models

References

[edit]
  1. ^abcdefgL. Rosasco, T. Poggio, Machine Learning: a Regularization Approach, MIT-9.520 Lectures Notes, Manuscript, Dec. 2015. Chapter 7 - Online Learning
  2. ^Kushner, Harold J.; Yin, G. George (2003).Stochastic Approximation and Recursive Algorithms with Applications (Second ed.). New York: Springer. pp. 8–12.ISBN 978-0-387-21769-7.
  3. ^abBertsekas, D. P. (2011). Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optimization for Machine Learning, 85.
  4. ^Hazan, Elad (2015).Introduction to Online Convex Optimization(PDF). Foundations and Trends in Optimization.
  5. ^Parisi, German I.; Kemker, Ronald; Part, Jose L.; Kanan, Christopher; Wermter, Stefan (2019)."Continual lifelong learning with neural networks: A review".Neural Networks.113:54–71.arXiv:1802.07569.doi:10.1016/j.neunet.2019.01.012.ISSN 0893-6080.PMID 30780045.
  6. ^Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations".Online Learning and Neural Networks. Cambridge University Press.ISBN 978-0-521-65263-6.
  7. ^Stochastic Approximation Algorithms and Applications, Harold J. Kushner and G. George Yin, New York: Springer-Verlag, 1997.ISBN 0-387-94916-X; 2nd ed., titledStochastic Approximation and Recursive Algorithms and Applications, 2003,ISBN 0-387-00894-2.

External links

[edit]
Retrieved from "https://en.wikipedia.org/w/index.php?title=Online_machine_learning&oldid=1262412613#Batch_learning"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp