For the formalism used to approximate the influence of an extracellular electrical field on neurons, seeactivating function. For a linear system’s transfer function, seetransfer function.
Inartificial neural networks, theactivation function of a node is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation function isnonlinear.[1]
Modern activation functions include the logistic (sigmoid) function used in the 2012speech recognition model developed byHinton et al;[2] theReLU used in the 2012AlexNet computer vision model[3][4] and in the 2015ResNet model; and the smooth version of the ReLU, theGELU, which was used in the 2018BERT model.[5]
Aside from their empirical performance, activation functions also have different mathematical properties:
Nonlinear
When the activation function is non-linear, then a two-layer neural network can be proven to be a universal function approximator.[6] This is known as theUniversal Approximation Theorem. The identity activation function does not satisfy this property. When multiple layers use the identity activation function, the entire network is equivalent to a single-layer model.
Range
When the range of the activation function is finite, gradient-based training methods tend to be more stable, because pattern presentations significantly affect only limited weights. When the range is infinite, training is generally more efficient because pattern presentations significantly affect most of the weights. In the latter case, smallerlearning rates are typically necessary.[citation needed]
Continuously differentiable
This property is desirable for enabling gradient-based optimization methods (ReLU is not continuously differentiable and has some issues with gradient-based optimization, but it is still possible). The binary step activation function is not differentiable at 0, and it differentiates to 0 for all other values, so gradient-based methods can make no progress with it.[7]
These properties do not decisively influence performance, nor are they the only mathematical properties that may be useful. For instance, the strictly positive range of thesoftplus makes it suitable for predicting variances invariational autoencoders.
An activation function issaturating if. It isnonsaturating if. Non-saturating activation functions, such asReLU, may be better than saturating activation functions, because they are less likely to suffer from thevanishing gradient problem.[8]
Inbiologically inspired neural networks, the activation function is usually an abstraction representing the rate ofaction potential firing in the cell.[9] In its simplest form, this function isbinary—that is, either theneuron is firing or not. Neurons also cannot fire faster than a certain rate, motivatingsigmoid activation functions whose range is a finite interval.
If a line has a positiveslope, on the other hand, it may reflect the increase in firing rate that occurs as input current increases. Such a function would be of the form.
Rectified linear unit and Gaussian error linear unit activation functions
A special class of activation functions known asradial basis functions (RBFs) are used inRBF networks. These activation functions can take many forms, but they are usually found as one of the following functions:
Periodic functions can serve as activation functions. Usually thesinusoid is used, as any periodic function is decomposable into sinusoids by theFourier transform.[10]
^ For instance, could be iterating through the number of kernels of the previous neural network layer while iterates through the number of kernels of the current layer.
Inquantum neural networks programmed on gate-modelquantum computers, based on quantum perceptrons instead of variational quantum circuits, the non-linearity of the activation function can be implemented with no need of measuring the output of eachperceptron at each layer. The quantum properties loaded within the circuit such assuperposition can be preserved by creating theTaylor series of the argument computed by the perceptron itself, with suitable quantum circuits computing the powers up to a wanted approximation degree. Because of the flexibility of such quantum circuits, they can be designed in order to approximate any arbitrary classical activation function.[25]
^Hinkelmann, Knut."Neural Networks, p. 7"(PDF).University of Applied Sciences Northwestern Switzerland. Archived fromthe original(PDF) on 6 October 2018. Retrieved6 October 2018.
^Hinton, Geoffrey; Deng, Li; Deng, Li; Yu, Dong; Dahl, George; Mohamed, Abdel-rahman; Jaitly, Navdeep; Senior, Andrew; Vanhoucke, Vincent; Nguyen, Patrick;Sainath, Tara; Kingsbury, Brian (2012). "Deep Neural Networks for Acoustic Modeling in Speech Recognition".IEEE Signal Processing Magazine.29 (6):82–97.doi:10.1109/MSP.2012.2205597.S2CID206485943.
^Glorot, Xavier; Bordes, Antoine; Bengio, Yoshua (2011)."Deep sparse rectifier neural networks"(PDF).International Conference on Artificial Intelligence and Statistics.
^Clevert, Djork-Arné; Unterthiner, Thomas; Hochreiter, Sepp (23 November 2015). "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)".arXiv:1511.07289 [cs.LG].
^Klambauer, Günter; Unterthiner, Thomas; Mayr, Andreas; Hochreiter, Sepp (8 June 2017). "Self-Normalizing Neural Networks".Advances in Neural Information Processing Systems.30 (2017).arXiv:1706.02515.
^Maas, Andrew L.; Hannun, Awni Y.; Ng, Andrew Y. (June 2013). "Rectifier nonlinearities improve neural network acoustic models".Proc. ICML.30 (1).S2CID16489696.
^He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (6 February 2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification".arXiv:1502.01852 [cs.CV].
Kunc, Vladimír; Kléma, Jiří (14 February 2024),Three Decades of Activations: A Comprehensive Survey of 400 Activation Functions for Neural Networks,arXiv:2402.09092
Nwankpa, Chigozie; Ijomah, Winifred; Gachagan, Anthony; Marshall, Stephen (8 November 2018). "Activation Functions: Comparison of trends in Practice and Research for Deep Learning".arXiv:1811.03378 [cs.LG].