Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Softmax function

From Wikipedia, the free encyclopedia
Smooth approximation of one-hot arg max
This article is about the smooth approximation of one-hot arg max. For the smooth approximation of max, seeLogSumExp.
"Softmax" redirects here. For the Korean video game company, seeESA (company).
Part of a series on
Machine learning
anddata mining
Journals and conferences

Thesoftmax function, also known assoftargmax[1]: 184  ornormalized exponential function,[2]: 198  converts a vector ofK real numbers into aprobability distribution ofK possible outcomes. It is a generalization of thelogistic function to multiple dimensions, and is used inmultinomial logistic regression. The softmax function is often used as the lastactivation function of aneural network to normalize the output of a network to aprobability distribution over predicted output classes.

Definition

[edit]

The softmax function takes as input a vectorz ofK real numbers, and normalizes it into aprobability distribution consisting ofK probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in theinterval(0,1){\displaystyle (0,1)}, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.

Formally, the standard (unit) softmax functionσ:RK(0,1)K{\displaystyle \sigma \colon \mathbb {R} ^{K}\to (0,1)^{K}}, whereK>1{\displaystyle K>1}, takes a vectorz=(z1,,zK)RK{\displaystyle \mathbf {z} =(z_{1},\dotsc ,z_{K})\in \mathbb {R} ^{K}} and computes each component of vectorσ(z)(0,1)K{\displaystyle \sigma (\mathbf {z} )\in (0,1)^{K}} with

σ(z)i=ezij=1Kezj.{\displaystyle \sigma (\mathbf {z} )_{i}={\frac {e^{z_{i}}}{\sum _{j=1}^{K}e^{z_{j}}}}\,.}

In words, the softmax applies the standardexponential function to each elementzi{\displaystyle z_{i}} of the input vectorz{\displaystyle \mathbf {z} } (consisting ofK{\displaystyle K} real numbers), and normalizes these values by dividing by the sum of all these exponentials. The normalization ensures that the sum of the components of the output vectorσ(z){\displaystyle \sigma (\mathbf {z} )} is 1. The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input vector. For example, the standard softmax of(1,2,8){\displaystyle (1,2,8)} is approximately(0.001,0.002,0.997){\displaystyle (0.001,0.002,0.997)}, which amounts to assigning almost all of the total unit weight in the result to the position of the vector's maximal element (of 8).

In general, instead ofe a differentbaseb > 0 can be used. As above, ifb > 1 then larger input components will result in larger output probabilities, and increasing the value ofb will create probability distributions that are more concentrated around the positions of the largest input values. Conversely, if0 < b < 1 then smaller input components will result in larger output probabilities, and decreasing the value ofb will create probability distributions that are more concentrated around the positions of the smallest input values. Writingb=eβ{\displaystyle b=e^{\beta }} orb=eβ{\displaystyle b=e^{-\beta }}[a] (for realβ)[b] yields the expressions:[c]

σ(z)i=eβzij=1Keβzj or σ(z)i=eβzij=1Keβzj for i=1,,K.{\displaystyle \sigma (\mathbf {z} )_{i}={\frac {e^{\beta z_{i}}}{\sum _{j=1}^{K}e^{\beta z_{j}}}}{\text{ or }}\sigma (\mathbf {z} )_{i}={\frac {e^{-\beta z_{i}}}{\sum _{j=1}^{K}e^{-\beta z_{j}}}}{\text{ for }}i=1,\dotsc ,K.}

A value proportional to the reciprocal ofβ is sometimes referred to as thetemperature:β=1/kT{\textstyle \beta =1/kT}, wherek is typically 1 or theBoltzmann constant andT is the temperature. A higher temperature results in a more uniform output distribution (i.e. with higherentropy; it is "more random"), while a lower temperature results in a sharper output distribution, with one value dominating.

In some fields, the base is fixed, corresponding to a fixed scale,[d] while in others the parameterβ (orT) is varied.

Interpretations

[edit]

Smooth arg max

[edit]
See also:Arg max

The Softmax function is a smooth approximation to thearg max function: the function whose value is theindex of a vector's largest element. The name "softmax" may be misleading. Softmax is not asmooth maximum (that is, asmooth approximation to themaximum function). The term "softmax" is also used for the closely relatedLogSumExp function, which is a smooth maximum. For this reason, some prefer the more accurate term "softargmax", though the term "softmax" is conventional in machine learning.[3][4] This section uses the term "softargmax" for clarity.

Formally, instead of considering the arg max as a function with categorical output1,,n{\displaystyle 1,\dots ,n} (corresponding to the index), consider the arg max function withone-hot representation of the output (assuming there is a unique maximum arg):argmax(z1,,zn)=(y1,,yn)=(0,,0,1,0,,0),{\displaystyle \operatorname {arg\,max} (z_{1},\,\dots ,\,z_{n})=(y_{1},\,\dots ,\,y_{n})=(0,\,\dots ,\,0,\,1,\,0,\,\dots ,\,0),}where the output coordinateyi=1{\displaystyle y_{i}=1} if and only ifi{\displaystyle i} is the arg max of(z1,,zn){\displaystyle (z_{1},\dots ,z_{n})}, meaningzi{\displaystyle z_{i}} is the unique maximum value of(z1,,zn){\displaystyle (z_{1},\,\dots ,\,z_{n})}. For example, in this encodingargmax(1,5,10)=(0,0,1),{\displaystyle \operatorname {arg\,max} (1,5,10)=(0,0,1),} since the third argument is the maximum.

This can be generalized to multiple arg max values (multiple equalzi{\displaystyle z_{i}} being the maximum) by dividing the 1 between all max args; formally1/k wherek is the number of arguments assuming the maximum. For example,argmax(1,5,5)=(0,1/2,1/2),{\displaystyle \operatorname {arg\,max} (1,\,5,\,5)=(0,\,1/2,\,1/2),} since the second and third argument are both the maximum. In case all arguments are equal, this is simplyargmax(z,,z)=(1/n,,1/n).{\displaystyle \operatorname {arg\,max} (z,\dots ,z)=(1/n,\dots ,1/n).} Pointsz with multiple arg max values aresingular points (or singularities, and form the singular set) – these are the points where arg max is discontinuous (with ajump discontinuity) – while points with a single arg max are known as non-singular or regular points.

With the last expression given in the introduction, softargmax is now a smooth approximation of arg max: asβ{\displaystyle \beta \to \infty }, softargmax converges to arg max. There are various notions of convergence of a function; softargmax converges to arg maxpointwise, meaning for each fixed inputz asβ{\displaystyle \beta \to \infty },σβ(z)argmax(z).{\displaystyle \sigma _{\beta }(\mathbf {z} )\to \operatorname {arg\,max} (\mathbf {z} ).} However, softargmax does notconverge uniformly to arg max, meaning intuitively that different points converge at different rates, and may converge arbitrarily slowly. In fact, softargmax is continuous, but arg max is not continuous at the singular set where two coordinates are equal, while the uniform limit of continuous functions is continuous. The reason it fails to converge uniformly is that for inputs where two coordinates are almost equal (and one is the maximum), the arg max is the index of one or the other, so a small change in input yields a large change in output. For example,σβ(1,1.0001)(0,1),{\displaystyle \sigma _{\beta }(1,\,1.0001)\to (0,1),} butσβ(1,0.9999)(1,0),{\displaystyle \sigma _{\beta }(1,\,0.9999)\to (1,\,0),} andσβ(1,1)=1/2{\displaystyle \sigma _{\beta }(1,\,1)=1/2} for all inputs: the closer the points are to the singular set(x,x){\displaystyle (x,x)}, the slower they converge. However, softargmax doesconverge compactly on the non-singular set.

Conversely, asβ{\displaystyle \beta \to -\infty }, softargmax converges to arg min in the same way, where here the singular set is points with two argmin values. In the language oftropical analysis, the softmax is adeformation or "quantization" of arg max and arg min, corresponding to using thelog semiring instead of themax-plus semiring (respectivelymin-plus semiring), and recovering the arg max or arg min by taking the limit is called "tropicalization" or "dequantization".

It is also the case that, for any fixedβ, if one inputzi{\displaystyle z_{i}} is much larger than the othersrelative to the temperature,T=1/β{\displaystyle T=1/\beta }, the output is approximately the arg max. For example, a difference of 10 is large relative to a temperature of 1:σ(0,10):=σ1(0,10)=(1/(1+e10),e10/(1+e10))(0.00005,0.99995){\displaystyle \sigma (0,\,10):=\sigma _{1}(0,\,10)=\left(1/\left(1+e^{10}\right),\,e^{10}/\left(1+e^{10}\right)\right)\approx (0.00005,\,0.99995)}However, if the difference is small relative to the temperature, the value is not close to the arg max. For example, a difference of 10 is small relative to a temperature of 100:σ1/100(0,10)=(1/(1+e1/10),e1/10/(1+e1/10))(0.475,0.525).{\displaystyle \sigma _{1/100}(0,\,10)=\left(1/\left(1+e^{1/10}\right),\,e^{1/10}/\left(1+e^{1/10}\right)\right)\approx (0.475,\,0.525).}Asβ{\displaystyle \beta \to \infty }, temperature goes to zero,T=1/β0{\displaystyle T=1/\beta \to 0}, so eventually all differences become large (relative to a shrinking temperature), which gives another interpretation for the limit behavior.

Statistical mechanics

[edit]

Instatistical mechanics, the softargmax function is known as theBoltzmann distribution (orGibbs distribution):[5]: 7  the index set1,,k{\displaystyle {1,\,\dots ,\,k}} are themicrostates of the system; the inputszi{\displaystyle z_{i}} are the energies of that state; the denominator is known as thepartition function, often denoted byZ; and the factorβ is called thecoldness (orthermodynamic beta, orinverse temperature).

Applications

[edit]

The softmax function is used in variousmulticlass classification methods, such asmultinomial logistic regression (also known as softmax regression),[2]: 206–209 [6] multiclasslinear discriminant analysis,naive Bayes classifiers, andartificial neural networks.[7] Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result ofK distinctlinear functions, and the predicted probability for thejth class given a sample vectorx and a weighting vectorw is:

P(y=jx)=exTwjk=1KexTwk{\displaystyle P(y=j\mid \mathbf {x} )={\frac {e^{\mathbf {x} ^{\mathsf {T}}\mathbf {w} _{j}}}{\sum _{k=1}^{K}e^{\mathbf {x} ^{\mathsf {T}}\mathbf {w} _{k}}}}}

This can be seen as thecomposition ofK linear functionsxxTw1,,xxTwK{\displaystyle \mathbf {x} \mapsto \mathbf {x} ^{\mathsf {T}}\mathbf {w} _{1},\ldots ,\mathbf {x} \mapsto \mathbf {x} ^{\mathsf {T}}\mathbf {w} _{K}} and the softmax function (wherexTw{\displaystyle \mathbf {x} ^{\mathsf {T}}\mathbf {w} } denotes the inner product ofx{\displaystyle \mathbf {x} } andw{\displaystyle \mathbf {w} }). The operation is equivalent to applying a linear operator defined byw{\displaystyle \mathbf {w} } to vectorsx{\displaystyle \mathbf {x} }, thus transforming the original, probably highly-dimensional, input to vectors in aK-dimensional spaceRK{\displaystyle \mathbb {R} ^{K}}.

Neural networks

[edit]

The standard softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under alog loss (orcross-entropy) regime, giving a non-linear variant of multinomial logistic regression.

Since the function maps a vector and a specific indexi{\displaystyle i} to a real value, the derivative needs to take the index into account:

qkσ(q,i)=σ(q,i)(δikσ(q,k)).{\displaystyle {\frac {\partial }{\partial q_{k}}}\sigma ({\textbf {q}},i)=\sigma ({\textbf {q}},i)(\delta _{ik}-\sigma ({\textbf {q}},k)).}

This expression is symmetrical in the indexesi,k{\displaystyle i,k} and thus may also be expressed as

qkσ(q,i)=σ(q,k)(δikσ(q,i)).{\displaystyle {\frac {\partial }{\partial q_{k}}}\sigma ({\textbf {q}},i)=\sigma ({\textbf {q}},k)(\delta _{ik}-\sigma ({\textbf {q}},i)).}

Here, theKronecker delta is used for simplicity (cf. the derivative of asigmoid function, being expressed via the function itself).

To ensure stable numerical computations subtracting the maximum value from the input vector is common. This approach, while not altering the output or the derivative theoretically, enhances stability by directly controlling the maximum exponent value computed.

If the function is scaled with the parameterβ{\displaystyle \beta }, then these expressions must be multiplied byβ{\displaystyle \beta }.

Seemultinomial logit for a probability model which uses the softmax activation function.

Reinforcement learning

[edit]

In the field ofreinforcement learning, a softmax function can be used to convert values into action probabilities. The function commonly used is:[8]Pt(a)=exp(qt(a)/τ)i=1nexp(qt(i)/τ),{\displaystyle P_{t}(a)={\frac {\exp(q_{t}(a)/\tau )}{\sum _{i=1}^{n}\exp(q_{t}(i)/\tau )}}{\text{,}}}

where the action valueqt(a){\displaystyle q_{t}(a)} corresponds to the expected reward of following action a andτ{\displaystyle \tau } is called a temperature parameter (in allusion tostatistical mechanics). For high temperatures (τ{\displaystyle \tau \to \infty }), all actions have nearly the same probability and the lower the temperature, the more expected rewards affect the probability. For a low temperature (τ0+{\displaystyle \tau \to 0^{+}}), the probability of the action with the highest expected reward tends to 1.

Computational complexity and remedies

[edit]

In neural network applications, the numberK of possible outcomes is often large, e.g. in case ofneural language models that predict the most likely outcome out of a vocabulary which might contain millions of possible words.[9] This can make the calculations for the softmax layer (i.e. the matrix multiplications to determine thezi{\displaystyle z_{i}}, followed by the application of the softmax function itself) computationally expensive.[9][10] What's more, thegradient descentbackpropagation method for training such a neural network involves calculating the softmax for every training example, and the number of training examples can also become large. The computational effort for the softmax became a major limiting factor in the development of larger neural language models, motivating various remedies to reduce training times.[9][10]

Approaches that reorganize the softmax layer for more efficient calculation include thehierarchical softmax and thedifferentiated softmax.[9] The hierarchical softmax (introduced by Morin andBengio in 2005) uses a binary tree structure where the outcomes (vocabulary words) are the leaves and the intermediate nodes are suitably selected "classes" of outcomes, forminglatent variables.[10][11] The desired probability (softmax value) of a leaf (outcome) can then be calculated as the product of the probabilities of all nodes on the path from the root to that leaf.[10] Ideally, when the tree is balanced, this would reduce thecomputational complexity fromO(K){\displaystyle O(K)} toO(log2K){\displaystyle O(\log _{2}K)}.[11] In practice, results depend on choosing a good strategy for clustering the outcomes into classes.[10][11] AHuffman tree was used for this in Google'sword2vec models (introduced in 2013) to achieve scalability.[9]

A second kind of remedies is based on approximating the softmax (during training) with modified loss functions that avoid the calculation of the full normalization factor.[9] These include methods that restrict the normalization sum to a sample of outcomes (e.g. Importance Sampling, Target Sampling).[9][10]

Numerical algorithms

[edit]

The standard softmax is numerically unstable because of large exponentiations. Thesafe softmax method calculates insteadσ(z)i=eβ(zim)j=1Keβ(zjm){\displaystyle \sigma (\mathbf {z} )_{i}={\frac {e^{\beta (z_{i}-m)}}{\sum _{j=1}^{K}e^{\beta (z_{j}-m)}}}}wherem=maxizi{\displaystyle m=\max _{i}z_{i}} is the largest factor involved. Subtracting by it guarantees that the exponentiations result in at most 1.

Theattention mechanism inTransformers takes three arguments: a "query vector"q{\displaystyle q}, a list of "key vectors"k1,,kN{\displaystyle k_{1},\dots ,k_{N}}, and a list of "value vectors"v1,,vN{\displaystyle v_{1},\dots ,v_{N}}, and outputs a softmax-weighted sum over value vectors:o=i=1NeqTkimj=1NeqTkjmvi{\displaystyle o=\sum _{i=1}^{N}{\frac {e^{q^{T}k_{i}-m}}{\sum _{j=1}^{N}e^{q^{T}k_{j}-m}}}v_{i}}The standard softmax method involves several loops over the inputs, which would bebottlenecked by memory bandwidth. TheFlashAttention method is acommunication-avoiding algorithm that fuses these operations into a single loop, increasing thearithmetic intensity. It is anonline algorithm that computes the following quantities:[12][13]zi=qTkimi=max(z1,,zi)=max(mi1,zi)li=ez1mi++ezimi=emi1mili1+ezimioi=ez1miv1++ezimivi=emi1mioi1+ezimivi{\displaystyle {\begin{aligned}z_{i}&=q^{T}k_{i}&\\m_{i}&=\max(z_{1},\dots ,z_{i})&=&\max(m_{i-1},z_{i})\\l_{i}&=e^{z_{1}-m_{i}}+\dots +e^{z_{i}-m_{i}}&=&e^{m_{i-1}-m_{i}}l_{i-1}+e^{z_{i}-m_{i}}\\o_{i}&=e^{z_{1}-m_{i}}v_{1}+\dots +e^{z_{i}-m_{i}}v_{i}&=&e^{m_{i-1}-m_{i}}o_{i-1}+e^{z_{i}-m_{i}}v_{i}\end{aligned}}}and returnsoN/lN{\displaystyle o_{N}/l_{N}}. In practice, FlashAttention operates over multiple queries and keys per loop iteration, in a similar way asblocked matrix multiplication. Ifbackpropagation is needed, then the output vectors and the intermediate arrays[m1,,mN],[l1,,lN]{\displaystyle [m_{1},\dots ,m_{N}],[l_{1},\dots ,l_{N}]} are cached, and during the backward pass, attention matrices arerematerialized from these, making it a form of gradient checkpointing.

Mathematical properties

[edit]

Geometrically the softmax function maps thevector spaceRK{\displaystyle \mathbb {R} ^{K}} to theboundary of thestandard(K1){\displaystyle (K-1)}-simplex, cutting the dimension by one (the range is a(K1){\displaystyle (K-1)}-dimensional simplex inK{\displaystyle K}-dimensional space), due to thelinear constraint that all output sum to 1 meaning it lies on ahyperplane.

Along the main diagonal(x,x,,x),{\displaystyle (x,\,x,\,\dots ,\,x),} softmax is just the uniform distribution on outputs,(1/n,,1/n){\displaystyle (1/n,\dots ,1/n)}: equal scores yield equal probabilities.

More generally, softmax is invariant under translation by the same value in each coordinate: addingc=(c,,c){\displaystyle \mathbf {c} =(c,\,\dots ,\,c)} to the inputsz{\displaystyle \mathbf {z} } yieldsσ(z+c)=σ(z){\displaystyle \sigma (\mathbf {z} +\mathbf {c} )=\sigma (\mathbf {z} )}, because it multiplies each exponent by the same factor,ec{\displaystyle e^{c}} (becauseezi+c=eziec{\displaystyle e^{z_{i}+c}=e^{z_{i}}\cdot e^{c}}), so the ratios do not change:σ(z+c)j=ezj+ck=1Kezk+c=ezjeck=1Kezkec=σ(z)j.{\displaystyle \sigma (\mathbf {z} +\mathbf {c} )_{j}={\frac {e^{z_{j}+c}}{\sum _{k=1}^{K}e^{z_{k}+c}}}={\frac {e^{z_{j}}\cdot e^{c}}{\sum _{k=1}^{K}e^{z_{k}}\cdot e^{c}}}=\sigma (\mathbf {z} )_{j}.}

Geometrically, softmax is constant along diagonals: this is the dimension that is eliminated, and corresponds to the softmax output being independent of a translation in the input scores (a choice of 0 score). One can normalize input scores by assuming that the sum is zero (subtract the average:c{\displaystyle \mathbf {c} } wherec=1nzi{\textstyle c={\frac {1}{n}}\sum z_{i}}), and then the softmax takes the hyperplane of points that sum to zero,zi=0{\textstyle \sum z_{i}=0}, to the open simplex of positive values that sum to 1σ(z)i=1{\textstyle \sum \sigma (\mathbf {z} )_{i}=1}, analogously to how the exponent takes 0 to 1,e0=1{\displaystyle e^{0}=1} and is positive.

By contrast, softmax is not invariant under scaling. For instance,σ((0,1))=(1/(1+e),e/(1+e)){\displaystyle \sigma {\bigl (}(0,\,1){\bigr )}={\bigl (}1/(1+e),\,e/(1+e){\bigr )}} butσ((0,2))=(1/(1+e2),e2/(1+e2)).{\displaystyle \sigma {\bigl (}(0,2){\bigr )}={\bigl (}1/\left(1+e^{2}\right),\,e^{2}/\left(1+e^{2}\right){\bigr )}.}

Thestandard logistic function is the special case for a 1-dimensional axis in 2-dimensional space, say thex-axis in the(x, y) plane. One variable is fixed at 0 (sayz2=0{\displaystyle z_{2}=0}), soe0=1{\displaystyle e^{0}=1}, and the other variable can vary, denote itz1=x{\displaystyle z_{1}=x}, soez1/k=12ezk=ex/(ex+1),{\textstyle e^{z_{1}}/\sum _{k=1}^{2}e^{z_{k}}=e^{x}/\left(e^{x}+1\right),} the standard logistic function, andez2/k=12ezk=1/(ex+1),{\textstyle e^{z_{2}}/\sum _{k=1}^{2}e^{z_{k}}=1/\left(e^{x}+1\right),} its complement (meaning they add up to 1). The 1-dimensional input could alternatively be expressed as the line(x/2,x/2){\displaystyle (x/2,\,-x/2)}, with outputsex/2/(ex/2+ex/2)=ex/(ex+1){\displaystyle e^{x/2}/\left(e^{x/2}+e^{-x/2}\right)=e^{x}/\left(e^{x}+1\right)} andex/2/(ex/2+ex/2)=1/(ex+1).{\displaystyle e^{-x/2}/\left(e^{x/2}+e^{-x/2}\right)=1/\left(e^{x}+1\right).}

Gradients

[edit]

The softmax function is also the gradient of theLogSumExp function:ziLSE(z)=expzij=1Kexpzj=σ(z)i, for i=1,,K,z=(z1,,zK)RK,{\displaystyle {\frac {\partial }{\partial z_{i}}}\operatorname {LSE} (\mathbf {z} )={\frac {\exp z_{i}}{\sum _{j=1}^{K}\exp z_{j}}}=\sigma (\mathbf {z} )_{i},\quad {\text{ for }}i=1,\dotsc ,K,\quad \mathbf {z} =(z_{1},\,\dotsc ,\,z_{K})\in \mathbb {R} ^{K},}where the LogSumExp function is defined asLSE(z1,,zn)=log(exp(z1)++exp(zn)){\displaystyle \operatorname {LSE} (z_{1},\,\dots ,\,z_{n})=\log \left(\exp(z_{1})+\cdots +\exp(z_{n})\right)}.

The gradient of softmax is thuszjσi=σi(δijσj){\displaystyle \partial _{z_{j}}\sigma _{i}=\sigma _{i}(\delta _{ij}-\sigma _{j})}.

History

[edit]

The softmax function was used instatistical mechanics as theBoltzmann distribution in the foundational paperBoltzmann (1868),[14] formalized and popularized in the influential textbookGibbs (1902).[15]

The use of the softmax indecision theory is credited toR. Duncan Luce,[16]: 1  who used the axiom ofindependence of irrelevant alternatives inrational choice theory to deduce the softmax inLuce's choice axiom for relative preferences.[citation needed]

In machine learning, the term "softmax" is credited to John S. Bridle in two 1989 conference papers,Bridle (1990a):[16]: 1  andBridle (1990b):[3]

We are concerned with feed-forward non-linear networks (multi-layer perceptrons, or MLPs) with multiple outputs. We wish to treat the outputs of the network as probabilities of alternatives (e.g. pattern classes), conditioned on the inputs. We look for appropriate output non-linearities and for appropriate criteria for adaptation of the parameters of the network (e.g. weights). We explain two modifications: probability scoring, which is an alternative to squared error minimisation, and a normalised exponential (softmax) multi-input generalisation of the logistic non-linearity.[17]: 227 

For any input, the outputs must all be positive and they must sum to unity. ...

Given a set of unconstrained values,Vj(x){\displaystyle V_{j}(x)}, we can ensure both conditions by using a Normalised Exponential transformation:Qj(x)=eVj(x)/keVk(x){\displaystyle Q_{j}(x)=\left.e^{V_{j}(x)}\right/\sum _{k}e^{V_{k}(x)}}This transformation can be considered a multi-input generalisation of the logistic, operating on the whole output layer. It preserves the rank order of its input values, and is a differentiable generalisation of the 'winner-take-all' operation of picking the maximum value. For this reason we like to refer to it assoftmax.[18]: 213 

Example

[edit]

With an input of(1, 2, 3, 4, 1, 2, 3), the softmax is approximately(0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175). The output has most of its weight where the "4" was in the original input. This is what the function is normally used for: to highlight the largest values and suppress values which are significantly below the maximum value. But note: a change oftemperature changes the output. When the temperature is multiplied by 10, the inputs are effectively(0.1, 0.2, 0.3, 0.4, 0.1, 0.2, 0.3) and the softmax is approximately(0.125, 0.138, 0.153, 0.169, 0.125, 0.138, 0.153). This shows that high temperatures de-emphasize the maximum value.

Computation of this example usingPython code:

>>>importnumpyasnp>>>z=np.array([1.0,2.0,3.0,4.0,1.0,2.0,3.0])>>>beta=1.0>>>np.exp(beta*z)/np.sum(np.exp(beta*z))array([0.02364054, 0.06426166, 0.1746813, 0.474833, 0.02364054,       0.06426166, 0.1746813])

Alternatives

[edit]

The softmax function generates probability predictions densely distributed over itssupport. Other functions likesparsemax or α-entmax can be used when sparse probability predictions are desired.[19] Also theGumbel-softmax reparametrization trick can be used when sampling from a discrete-discrete distribution needs to be mimicked in a differentiable manner.

See also

[edit]

Notes

[edit]
  1. ^Positiveβ corresponds to the maximum convention, and is usual in machine learning, corresponding to the highest score having highest probability. The negative−β corresponds to the minimum convention, and is conventional in thermodynamics, corresponding to the lowest energy state having the highest probability; this matches the convention in theGibbs distribution, interpretingβ ascoldness.
  2. ^The notationβ is for thethermodynamic beta, which is inversetemperature:β=1/T{\displaystyle \beta =1/T},T=1/β.{\displaystyle T=1/\beta .}
  3. ^Forβ=0{\displaystyle \beta =0} (coldness zero, infinite temperature),b=eβ=e0=1{\displaystyle b=e^{\beta }=e^{0}=1}, and this becomes the constant function(1/n,,1/n){\displaystyle (1/n,\dots ,1/n)}, corresponding to thediscrete uniform distribution.
  4. ^In statistical mechanics, fixingβ is interpreted as having coldness and temperature of 1.

References

[edit]
  1. ^Goodfellow, Ian;Bengio, Yoshua; Courville, Aaron (2016)."6.2.2.3 Softmax Units for Multinoulli Output Distributions".Deep Learning. MIT Press. pp. 180–184.ISBN 978-0-26203561-3.
  2. ^abBishop, Christopher M. (2006).Pattern Recognition and Machine Learning. Springer.ISBN 0-387-31073-8.
  3. ^abSako, Yusaku (2018-06-02)."Is the term "softmax" driving you nuts?".Medium.
  4. ^Goodfellow, Bengio & Courville 2016, pp. 183–184: The name "softmax" can be somewhat confusing. The function is more closely related to the arg max function than the max function. The term "soft" derives from the fact that the softmax function is continuous and differentiable. The arg max function, with its result represented as a one-hot vector, is not continuous nor differentiable. The softmax function thus provides a "softened" version of the arg max. The corresponding soft version of the maximum function issoftmax(z)z{\displaystyle \operatorname {softmax} (\mathbf {z} )^{\top }\mathbf {z} }. It would perhaps be better to call the softmax function "softargmax," but the current name is an entrenched convention.
  5. ^LeCun, Yann; Chopra, Sumit; Hadsell, Raia; Ranzato, Marc’Aurelio; Huang, Fu Jie (2006)."A Tutorial on Energy-Based Learning"(PDF). In Gökhan Bakır; Thomas Hofmann; Bernhard Schölkopf; Alexander J. Smola; Ben Taskar; S.V.N Vishwanathan (eds.).Predicting Structured Data. Neural Information Processing series. MIT Press.ISBN 978-0-26202617-8.
  6. ^"Unsupervised Feature Learning and Deep Learning Tutorial".ufldl.stanford.edu. Retrieved2024-03-25.
  7. ^ai-faqWhat is a softmax activation function?
  8. ^Sutton, R. S. and Barto A. G.Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998.Softmax Action Selection
  9. ^abcdefgOnal, Kezban Dilek; Zhang, Ye; Altingovde, Ismail Sengor; Rahman, Md Mustafizur; Karagoz, Pinar; Braylan, Alex; Dang, Brandon; Chang, Heng-Lu; Kim, Henna; McNamara, Quinten; Angert, Aaron (2018-06-01)."Neural information retrieval: at the end of the early years".Information Retrieval Journal.21 (2):111–182.doi:10.1007/s10791-017-9321-y.hdl:11245.1/008d6e8f-df13-4abf-8ae9-6ff2e17377f3.ISSN 1573-7659.S2CID 21684923.
  10. ^abcdefChen, Wenlin; Grangier, David; Auli, Michael (August 2016)."Strategies for Training Large Vocabulary Neural Language Models".Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics:1975–1985.arXiv:1512.04906.doi:10.18653/v1/P16-1186.S2CID 6035643.
  11. ^abcMorin, Frederic; Bengio, Yoshua (2005-01-06)."Hierarchical Probabilistic Neural Network Language Model"(PDF).International Workshop on Artificial Intelligence and Statistics. PMLR:246–252.
  12. ^Milakov, Maxim; Gimelshein, Natalia (2018-07-28),Online normalizer calculation for softmax, arXiv,doi:10.48550/arXiv.1805.02867, arXiv:1805.02867
  13. ^Dao, Tri; Fu, Dan; Ermon, Stefano; Rudra, Atri; Ré, Christopher (2022-12-06)."FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness".Advances in Neural Information Processing Systems.35:16344–16359.
  14. ^Boltzmann, Ludwig (1868). "Studien über das Gleichgewicht der lebendigen Kraft zwischen bewegten materiellen Punkten" [Studies on the balance of living force between moving material points].Wiener Berichte.58:517–560.
  15. ^Gibbs, Josiah Willard (1902).Elementary Principles in Statistical Mechanics.
  16. ^abGao, Bolin;Pavel, Lacra (2017). "On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning".arXiv:1704.00805 [math.OC].
  17. ^Bridle, John S. (1990a). Soulié F.F.; Hérault J. (eds.).Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition. Neurocomputing: Algorithms, Architectures and Applications (1989). NATO ASI Series (Series F: Computer and Systems Sciences). Vol. 68. Berlin, Heidelberg: Springer. pp. 227–236.doi:10.1007/978-3-642-76153-9_28.
  18. ^Bridle, John S. (1990b). D. S. Touretzky (ed.).Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters.Advances in Neural Information Processing Systems 2 (1989). Morgan-Kaufmann.
  19. ^"Speeding Up Entmax" by Maxat Tezekbayev, Vassilina Nikoulina, Matthias Gallé, Zhenisbek Assylbekov,https://arxiv.org/abs/2111.06832v3
Concepts
Applications
Implementations
Audio–visual
Text
Decisional
People
Architectures
Retrieved from "https://en.wikipedia.org/w/index.php?title=Softmax_function&oldid=1277701851"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp