Movatterモバイル変換

Learning rate

From Wikipedia, the free encyclopedia

Tuning parameter (hyperparameter) in optimization

Machine learning anddata mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Semi-supervised learning Self-supervised learning Reinforcement learning Meta-learning Online learning Batch learning Curriculum learning Rule-based learning Neuro-symbolic AI Neuromorphic engineering Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Neural networks Autoencoder Deep learning Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Boltzmann machine Restricted GAN Diffusion model SOM Convolutional neural network U-Net LeNet AlexNet DeepDream Neural field Neural radiance field Physics-informed neural networks Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning Policy gradient SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop Mechanistic interpretability RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory Topological deep learning
Journals and conferences AAAI ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

Inmachine learning andstatistics, thelearning rate is atuning parameter in anoptimization algorithm that determines the step size at each iteration while moving toward a minimum of aloss function.^[1] Since it influences to what extent newly acquired information overrides old information, it metaphorically represents the speed at which a machine learning model "learns". In theadaptive control literature, the learning rate is commonly referred to asgain.^[2]

In setting a learning rate, there is a trade-off between the rate of convergence andovershooting. While thedescent direction is usually determined from thegradient of the loss function, the learning rate determines how big a step is taken in that direction. A too high learning rate will make the learning jump over minima but a too low learning rate will either take too long to converge or get stuck in an undesirable local minimum.^[3]

In order to achieve faster convergence, prevent oscillations and getting stuck in undesirable local minima the learning rate is often varied during training either in accordance to a learning rate schedule or by using an adaptive learning rate.^[4] The learning rate and its adjustments may also differ per parameter, in which case it is adiagonal matrix that can be interpreted as an approximation to theinverse of theHessian matrix inNewton's method.^[5] The learning rate is related to the step length determined by inexactline search inquasi-Newton methods and related optimization algorithms.^[6]^[7]

Learning rate schedule

[edit]

Initial rate can be left as system default or can be selected using a range of techniques.^[8] A learning rate schedule changes the learning rate during learning and is most often changed between epochs/iterations. This is mainly done with two parameters:decay andmomentum. There are many different learning rate schedules but the most common aretime-based, step-based andexponential.^[4]

Decay serves to settle the learning in a nice place and avoid oscillations, a situation that may arise when a too high constant learning rate makes the learning jump back and forth over a minimum, and is controlled by a hyperparameter.

Momentum is analogous to a ball rolling down a hill; we want the ball to settle at the lowest point of the hill (corresponding to the lowest error). Momentum both speeds up the learning (increasing the learning rate) when the error cost gradient is heading in the same direction for a long time and also avoids local minima by 'rolling over' small bumps. Momentum is controlled by a hyperparameter analogous to a ball's mass which must be chosen manually—too high and the ball will roll over minima which we wish to find, too low and it will not fulfil its purpose.The formula for factoring in the momentum is more complex than for decay but is most often built in with deep learning libraries such asKeras.

Time-based learning schedules alter the learning rate depending on the learning rate of the previous time iteration. Factoring in the decay the mathematical formula for the learning rate is:

$\eta _{n+1}={\frac {\eta _{n}}{1+dn}}$

where $\eta$ is the learning rate, $d {\displaystyle d}$ is a decay parameter and $n {\displaystyle n}$ is the iteration step.

Step-based learning schedules changes the learning rate according to some predefined steps. The decay application formula is here defined as:

$\eta _{n}=\eta _{0}d^{\left\lfloor {\frac {1+n}{r}}\right\rfloor }$

where $\eta _{n}$ is the learning rate at iteration $n {\displaystyle n}$ , $\eta _{0}$ is the initial learning rate, $d {\displaystyle d}$ is how much the learning rate should change at each drop (0.5 corresponds to a halving) and $r {\displaystyle r}$ corresponds to thedrop rate, or how often the rate should be dropped (10 corresponds to a drop every 10 iterations). Thefloor function ( $\lfloor \dots \rfloor$ ) here drops the value of its input to 0 for all values smaller than 1.

Exponential learning schedules are similar to step-based, but instead of steps, a decreasing exponential function is used. The mathematical formula for factoring in the decay is:

$\eta _{n}=\eta _{0}e^{-dn}$

where $d {\displaystyle d}$ is a decay parameter.

Adaptive learning rate

[edit]

The issue with learning rate schedules is that they all depend on hyperparameters that must be manually chosen for each given learning session and may vary greatly depending on the problem at hand or the model used. To combat this, there are many different types ofadaptive gradient descent algorithms such asAdagrad, Adadelta,RMSprop, andAdam^[9] which are generally built into deep learning libraries such asKeras.^[10]

References

[edit]

^Murphy, Kevin P. (2012).Machine Learning: A Probabilistic Perspective. Cambridge: MIT Press. p. 247.ISBN 978-0-262-01802-9.
^Delyon, Bernard (2000). "Stochastic Approximation with Decreasing Gain: Convergence and Asymptotic Theory".Unpublished Lecture Notes. Université de Rennes.CiteSeerX 10.1.1.29.4428.
^Buduma, Nikhil; Locascio, Nicholas (2017).Fundamentals of Deep Learning : Designing Next-Generation Machine Intelligence Algorithms. O'Reilly. p. 21.ISBN 978-1-4919-2558-4.
^^a ^bPatterson, Josh; Gibson, Adam (2017). "Understanding Learning Rates".Deep Learning : A Practitioner's Approach. O'Reilly. pp. 258–263.ISBN 978-1-4919-1425-0.
^Ruder, Sebastian (2017). "An Overview of Gradient Descent Optimization Algorithms".arXiv:1609.04747 [cs.LG].
^Nesterov, Y. (2004).Introductory Lectures on Convex Optimization: A Basic Course. Boston: Kluwer. p. 25.ISBN 1-4020-7553-7.
^Dixon, L. C. W. (1972). "The Choice of Step Length, a Crucial Factor in the Performance of Variable Metric Algorithms".Numerical Methods for Non-linear Optimization. London: Academic Press. pp. 149–170.ISBN 0-12-455650-7.
^Smith, Leslie N. (4 April 2017). "Cyclical Learning Rates for Training Neural Networks".arXiv:1506.01186 [cs.CV].
^Murphy, Kevin (2021).Probabilistic Machine Learning: An Introduction. MIT Press. Retrieved10 April 2021.
^Brownlee, Jason (22 January 2019)."How to Configure the Learning Rate When Training Deep Learning Neural Networks".Machine Learning Mastery. Retrieved4 January 2021.