Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Normalization (machine learning)

From Wikipedia, the free encyclopedia
Machine learning technique
Part of a series on
Machine learning
anddata mining

Inmachine learning,normalization is a statistical technique with various applications. There are two main forms of normalization, namelydata normalization andactivation normalization. Data normalization (orfeature scaling) includes methods that rescale input data so that thefeatures have the same range, mean, variance, or other statistical properties. For instance, a popular choice of feature scaling method ismin-max normalization, where each feature is transformed to have the same range (typically[0,1]{\displaystyle [0,1]} or[1,1]{\displaystyle [-1,1]}). This solves the problem of different features having vastly different scales, for example if one feature is measured in kilometers and another in nanometers.

Activation normalization, on the other hand, is specific todeep learning, and includes methods that rescale the activation ofhidden neurons insideneural networks.

Normalization is often used to:

  • increase the speed of training convergence,
  • reduce sensitivity to variations and feature scales in input data,
  • reduceoverfitting,
  • and produce better model generalization to unseen data.

Normalization techniques are often theoretically justified as reducing covariance shift, smoothing optimization landscapes, and increasingregularization, though they are mainly justified by empirical success.[1]

Batch normalization

[edit]
Main article:Batch normalization

Batch normalization (BatchNorm)[2] operates on the activations of a layer for each mini-batch.

Consider a simple feedforward network, defined by chaining together modules:

x(0)x(1)x(2){\displaystyle x^{(0)}\mapsto x^{(1)}\mapsto x^{(2)}\mapsto \cdots }

where each network module can be a linear transform, a nonlinear activation function, a convolution, etc.x(0){\displaystyle x^{(0)}} is the input vector,x(1){\displaystyle x^{(1)}} is the output vector from the first module, etc.

BatchNorm is a module that can be inserted at any point in the feedforward network. For example, suppose it is inserted just afterx(l){\displaystyle x^{(l)}}, then the network would operate accordingly:

x(l)BN(x(l))x(l+1){\displaystyle \cdots \mapsto x^{(l)}\mapsto \mathrm {BN} (x^{(l)})\mapsto x^{(l+1)}\mapsto \cdots }

The BatchNorm module does not operate over individual inputs. Instead, it must operate over one batch of inputs at a time.

Concretely, suppose we have a batch of inputsx(1)(0),x(2)(0),,x(B)(0){\displaystyle x_{(1)}^{(0)},x_{(2)}^{(0)},\dots ,x_{(B)}^{(0)}}, fed all at once into the network. We would obtain in the middle of the network some vectors:

x(1)(l),x(2)(l),,x(B)(l){\displaystyle x_{(1)}^{(l)},x_{(2)}^{(l)},\dots ,x_{(B)}^{(l)}}

The BatchNorm module computes the coordinate-wise mean and variance of these vectors:

μi(l)=1Bb=1Bx(b),i(l)(σi(l))2=1Bb=1B(x(b),i(l)μi(l))2{\displaystyle {\begin{aligned}\mu _{i}^{(l)}&={\frac {1}{B}}\sum _{b=1}^{B}x_{(b),i}^{(l)}\\(\sigma _{i}^{(l)})^{2}&={\frac {1}{B}}\sum _{b=1}^{B}(x_{(b),i}^{(l)}-\mu _{i}^{(l)})^{2}\end{aligned}}}

wherei{\displaystyle i} indexes the coordinates of the vectors, andb{\displaystyle b} indexes the elements of the batch. In other words, we are considering thei{\displaystyle i}-th coordinate of each vector in the batch, and computing the mean and variance of these numbers.

It then normalizes each coordinate to have zero mean and unit variance:

x^(b),i(l)=x(b),i(l)μi(l)(σi(l))2+ϵ{\displaystyle {\hat {x}}_{(b),i}^{(l)}={\frac {x_{(b),i}^{(l)}-\mu _{i}^{(l)}}{\sqrt {(\sigma _{i}^{(l)})^{2}+\epsilon }}}}

Theϵ{\displaystyle \epsilon } is a small positive constant such as109{\displaystyle 10^{-9}} added to the variance for numerical stability, to avoiddivision by zero.

Finally, it applies a linear transformation:

y(b),i(l)=γix^(b),i(l)+βi{\displaystyle y_{(b),i}^{(l)}=\gamma _{i}{\hat {x}}_{(b),i}^{(l)}+\beta _{i}}

Here,γ{\displaystyle \gamma } andβ{\displaystyle \beta } are parameters inside the BatchNorm module. They are learnable parameters, typically trained bygradient descent.

The following is aPython implementation of BatchNorm:

importnumpyasnpdefbatchnorm(x,gamma,beta,epsilon=1e-9):# Mean and variance of each featuremu=np.mean(x,axis=0)# shape (N,)var=np.var(x,axis=0)# shape (N,)# Normalize the activationsx_hat=(x-mu)/np.sqrt(var+epsilon)# shape (B, N)# Apply the linear transformy=gamma*x_hat+beta# shape (B, N)returny

Interpretation

[edit]

γ{\displaystyle \gamma } andβ{\displaystyle \beta } allow the network to learn to undo the normalization, if this is beneficial.[3] BatchNorm can be interpreted as removing the purely linear transformations, so that its layers focus solely on modelling the nonlinear aspects of data, which may be beneficial, as a neural network can always be augmented with a linear transformation layer on top.[4][3]

It is claimed in the original publication that BatchNorm works by reducing internal covariance shift, though the claim has both supporters[5][6] and detractors.[7][8]

Special cases

[edit]

The original paper[2] recommended to only use BatchNorms after a linear transform, not after a nonlinear activation. That is,ϕ(BN(Wx+b)){\displaystyle \phi (\mathrm {BN} (Wx+b))}, notBN(ϕ(Wx+b)){\displaystyle \mathrm {BN} (\phi (Wx+b))}. Also, the biasb{\displaystyle b} does not matter, since it would be canceled by the subsequent mean subtraction, so it is of the formBN(Wx){\displaystyle \mathrm {BN} (Wx)}. That is, if a BatchNorm is preceded by a linear transform, then that linear transform's bias term is set to zero.[2]

Forconvolutional neural networks (CNNs), BatchNorm must preserve the translation-invariance of these models, meaning that it must treat all outputs of the samekernel as if they are different data points within a batch.[2] This is sometimes called Spatial BatchNorm, or BatchNorm2D, or per-channel BatchNorm.[9][10]

Concretely, suppose we have a 2-dimensional convolutional layer defined by:

xh,w,c(l)=h,w,cKhh,ww,c,c(l)xh,w,c(l1)+bc(l){\displaystyle x_{h,w,c}^{(l)}=\sum _{h',w',c'}K_{h'-h,w'-w,c,c'}^{(l)}x_{h',w',c'}^{(l-1)}+b_{c}^{(l)}}

where:

In order to preserve the translational invariance, BatchNorm treats all outputs from the same kernel in the same batch as more data in a batch. That is, it is applied once perkernelc{\displaystyle c} (equivalently, once per channelc{\displaystyle c}), not peractivationxh,w,c(l+1){\displaystyle x_{h,w,c}^{(l+1)}}:

μc(l)=1BHWb=1Bh=1Hw=1Wx(b),h,w,c(l)(σc(l))2=1BHWb=1Bh=1Hw=1W(x(b),h,w,c(l)μc(l))2{\displaystyle {\begin{aligned}\mu _{c}^{(l)}&={\frac {1}{BHW}}\sum _{b=1}^{B}\sum _{h=1}^{H}\sum _{w=1}^{W}x_{(b),h,w,c}^{(l)}\\(\sigma _{c}^{(l)})^{2}&={\frac {1}{BHW}}\sum _{b=1}^{B}\sum _{h=1}^{H}\sum _{w=1}^{W}(x_{(b),h,w,c}^{(l)}-\mu _{c}^{(l)})^{2}\end{aligned}}}

whereB{\displaystyle B} is the batch size,H{\displaystyle H} is the height of the feature map, andW{\displaystyle W} is the width of the feature map.

That is, even though there are onlyB{\displaystyle B} data points in a batch, allBHW{\displaystyle BHW} outputs from the kernel in this batch are treated equally.[2]

Subsequently, normalization and the linear transform is also done per kernel:

x^(b),h,w,c(l)=x(b),h,w,c(l)μc(l)(σc(l))2+ϵy(b),h,w,c(l)=γcx^(b),h,w,c(l)+βc{\displaystyle {\begin{aligned}{\hat {x}}_{(b),h,w,c}^{(l)}&={\frac {x_{(b),h,w,c}^{(l)}-\mu _{c}^{(l)}}{\sqrt {(\sigma _{c}^{(l)})^{2}+\epsilon }}}\\y_{(b),h,w,c}^{(l)}&=\gamma _{c}{\hat {x}}_{(b),h,w,c}^{(l)}+\beta _{c}\end{aligned}}}

Similar considerations apply for BatchNorm forn-dimensional convolutions.

The following is a Python implementation of BatchNorm for 2D convolutions:

importnumpyasnpdefbatchnorm_cnn(x,gamma,beta,epsilon=1e-9):# Calculate the mean and variance for each channel.mean=np.mean(x,axis=(0,1,2),keepdims=True)var=np.var(x,axis=(0,1,2),keepdims=True)# Normalize the input tensor.x_hat=(x-mean)/np.sqrt(var+epsilon)# Scale and shift the normalized tensor.y=gamma*x_hat+betareturny

For multilayeredrecurrent neural networks (RNN), BatchNorm is usually applied only for theinput-to-hidden part, not thehidden-to-hidden part.[11] Let the hidden state of thel{\displaystyle l}-th layer at timet{\displaystyle t} beht(l){\displaystyle h_{t}^{(l)}}. The standard RNN, without normalization, satisfiesht(l)=ϕ(W(l)htl1+U(l)ht1l+b(l)){\displaystyle h_{t}^{(l)}=\phi (W^{(l)}h_{t}^{l-1}+U^{(l)}h_{t-1}^{l}+b^{(l)})}whereW(l),U(l),b(l){\displaystyle W^{(l)},U^{(l)},b^{(l)}} are weights and biases, andϕ{\displaystyle \phi } is the activation function. Applying BatchNorm, this becomesht(l)=ϕ(BN(W(l)htl1)+U(l)ht1l){\displaystyle h_{t}^{(l)}=\phi (\mathrm {BN} (W^{(l)}h_{t}^{l-1})+U^{(l)}h_{t-1}^{l})}There are two possible ways to define what a "batch" is in BatchNorm for RNNs:frame-wise andsequence-wise. Concretely, consider applying an RNN to process a batch of sentences. Lethb,t(l){\displaystyle h_{b,t}^{(l)}} be the hidden state of thel{\displaystyle l}-th layer for thet{\displaystyle t}-th token of theb{\displaystyle b}-th input sentence. Then frame-wise BatchNorm means normalizing overb{\displaystyle b}:μt(l)=1Bb=1Bhi,t(l)(σt(l))2=1Bb=1B(ht(l)μt(l))2{\displaystyle {\begin{aligned}\mu _{t}^{(l)}&={\frac {1}{B}}\sum _{b=1}^{B}h_{i,t}^{(l)}\\(\sigma _{t}^{(l)})^{2}&={\frac {1}{B}}\sum _{b=1}^{B}(h_{t}^{(l)}-\mu _{t}^{(l)})^{2}\end{aligned}}}and sequence-wise means normalizing over(b,t){\displaystyle (b,t)}:μ(l)=1BTb=1Bt=1Thi,t(l)(σ(l))2=1BTb=1Bt=1T(ht(l)μ(l))2{\displaystyle {\begin{aligned}\mu ^{(l)}&={\frac {1}{BT}}\sum _{b=1}^{B}\sum _{t=1}^{T}h_{i,t}^{(l)}\\(\sigma ^{(l)})^{2}&={\frac {1}{BT}}\sum _{b=1}^{B}\sum _{t=1}^{T}(h_{t}^{(l)}-\mu ^{(l)})^{2}\end{aligned}}}Frame-wise BatchNorm is suited for causal tasks such as next-character prediction, where future frames are unavailable, forcing normalization per frame. Sequence-wise BatchNorm is suited for tasks such as speech recognition, where the entire sequences are available, but with variable lengths. In a batch, the smaller sequences are padded with zeroes to match the size of the longest sequence of the batch. In such setups, frame-wise is not recommended, because the number of unpadded frames decreases along the time axis, leading to increasingly poorer statistics estimates.[11]

It is also possible to apply BatchNorm toLSTMs.[12]

Improvements

[edit]

BatchNorm has been very popular and there were many attempted improvements. Some examples include:[13]

A particular problem with BatchNorm is that during training, the mean and variance are calculated on the fly for each batch (usually as anexponential moving average), but during inference, the mean and variance were frozen from those calculated during training. This train-test disparity degrades performance. The disparity can be decreased by simulating the moving average during inference:[13]: Eq. 3 

μ=αE[x]+(1α)μx, trainσ2=(αE[x]2+(1α)μx2, train)μ2{\displaystyle {\begin{aligned}\mu &=\alpha E[x]+(1-\alpha )\mu _{x,{\text{ train}}}\\\sigma ^{2}&=(\alpha E[x]^{2}+(1-\alpha )\mu _{x^{2},{\text{ train}}})-\mu ^{2}\end{aligned}}}

whereα{\displaystyle \alpha } is a hyperparameter to be optimized on a validation set.

Other works attempt to eliminate BatchNorm, such as the Normalizer-Free ResNet.[14]

Layer normalization

[edit]

Layer normalization (LayerNorm)[15] is a popular alternative to BatchNorm. Unlike BatchNorm, which normalizes activations across the batch dimension for a given feature, LayerNorm normalizes across all the features within a single data sample. Compared to BatchNorm, LayerNorm's performance is not affected by batch size. It is a key component oftransformer models.

For a given data input and layer, LayerNorm computes the meanμ{\displaystyle \mu } and varianceσ2{\displaystyle \sigma ^{2}} over all the neurons in the layer. Similar to BatchNorm, learnable parametersγ{\displaystyle \gamma } (scale) andβ{\displaystyle \beta } (shift) are applied. It is defined by:

xi^=xiμσ2+ϵ,yi=γixi^+βi{\displaystyle {\hat {x_{i}}}={\frac {x_{i}-\mu }{\sqrt {\sigma ^{2}+\epsilon }}},\quad y_{i}=\gamma _{i}{\hat {x_{i}}}+\beta _{i}}

where:

μ=1Di=1Dxi,σ2=1Di=1D(xiμ)2{\displaystyle \mu ={\frac {1}{D}}\sum _{i=1}^{D}x_{i},\quad \sigma ^{2}={\frac {1}{D}}\sum _{i=1}^{D}(x_{i}-\mu )^{2}}

and the indexi{\displaystyle i} ranges over the neurons in that layer.

Examples

[edit]

For example, in CNN, a LayerNorm applies to all activations in a layer. In the previous notation, we have:

μ(l)=1HWCh=1Hw=1Wc=1Cxh,w,c(l)(σ(l))2=1HWCh=1Hw=1Wc=1C(xh,w,c(l)μ(l))2x^h,w,c(l)=x^h,w,c(l)μ(l)(σ(l))2+ϵyh,w,c(l)=γ(l)x^h,w,c(l)+β(l){\displaystyle {\begin{aligned}\mu ^{(l)}&={\frac {1}{HWC}}\sum _{h=1}^{H}\sum _{w=1}^{W}\sum _{c=1}^{C}x_{h,w,c}^{(l)}\\(\sigma ^{(l)})^{2}&={\frac {1}{HWC}}\sum _{h=1}^{H}\sum _{w=1}^{W}\sum _{c=1}^{C}(x_{h,w,c}^{(l)}-\mu ^{(l)})^{2}\\{\hat {x}}_{h,w,c}^{(l)}&={\frac {{\hat {x}}_{h,w,c}^{(l)}-\mu ^{(l)}}{\sqrt {(\sigma ^{(l)})^{2}+\epsilon }}}\\y_{h,w,c}^{(l)}&=\gamma ^{(l)}{\hat {x}}_{h,w,c}^{(l)}+\beta ^{(l)}\end{aligned}}}

Notice that the batch indexb{\displaystyle b} is removed, while the channel indexc{\displaystyle c} is added.

Inrecurrent neural networks[15] andtransformers,[16] LayerNorm is applied individually to each timestep. For example, if the hidden vector in an RNN at timestept{\displaystyle t} isx(t)RD{\displaystyle x^{(t)}\in \mathbb {R} ^{D}}, whereD{\displaystyle D} is the dimension of the hidden vector, then LayerNorm will be applied with:

xi^(t)=xi(t)μ(t)(σ(t))2+ϵ,yi(t)=γixi^(t)+βi{\displaystyle {\hat {x_{i}}}^{(t)}={\frac {x_{i}^{(t)}-\mu ^{(t)}}{\sqrt {(\sigma ^{(t)})^{2}+\epsilon }}},\quad y_{i}^{(t)}=\gamma _{i}{\hat {x_{i}}}^{(t)}+\beta _{i}}

where:

μ(t)=1Di=1Dxi(t),(σ(t))2=1Di=1D(xi(t)μ(t))2{\displaystyle \mu ^{(t)}={\frac {1}{D}}\sum _{i=1}^{D}x_{i}^{(t)},\quad (\sigma ^{(t)})^{2}={\frac {1}{D}}\sum _{i=1}^{D}(x_{i}^{(t)}-\mu ^{(t)})^{2}}

Root mean square layer normalization

[edit]

Root mean square layer normalization (RMSNorm):[17]

xi^=xi1Di=1Dxi2,yi=γxi^+β{\displaystyle {\hat {x_{i}}}={\frac {x_{i}}{\sqrt {{\frac {1}{D}}\sum _{i=1}^{D}x_{i}^{2}}}},\quad y_{i}=\gamma {\hat {x_{i}}}+\beta }

Essentially, it is LayerNorm where we enforceμ,ϵ=0{\displaystyle \mu ,\epsilon =0}. It is also calledL2 normalization. It is a special case ofLp normalization, orpower normalization:xi^=xi(1Di=1D|xi|p)1/p,yi=γxi^+β{\displaystyle {\hat {x_{i}}}={\frac {x_{i}}{\left({\frac {1}{D}}\sum _{i=1}^{D}|x_{i}|^{p}\right)^{1/p}}},\quad y_{i}=\gamma {\hat {x_{i}}}+\beta }wherep>0{\displaystyle p>0} is a constant.

Adaptive

[edit]

Adaptive layer norm (adaLN) computes theγ,β{\displaystyle \gamma ,\beta } in a LayerNorm not from the layer activation itself, but from other data. It was first proposed for CNNs,[18] and has been used effectively indiffusion transformers (DiTs).[19] For example, in a DiT, the conditioning information (such as a text encoding vector) is processed by amultilayer perceptron intoγ,β{\displaystyle \gamma ,\beta }, which is then applied in the LayerNorm module of a transformer.

Weight normalization

[edit]

Weight normalization (WeightNorm)[20] is a technique inspired by BatchNorm that normalizes weight matrices in a neural network, rather than its activations.

One example isspectral normalization, which divides weight matrices by theirspectral norm. The spectral normalization is used ingenerative adversarial networks (GANs) such as theWasserstein GAN.[21] The spectral radius can be efficiently computed by the following algorithm:

INPUT matrixW{\displaystyle W} and initial guessx{\displaystyle x}

Iteratex1Wx2Wx{\displaystyle x\mapsto {\frac {1}{\|Wx\|_{2}}}Wx} to convergencex{\displaystyle x^{*}}. This is the eigenvector ofW{\displaystyle W} with eigenvalueWs{\displaystyle \|W\|_{s}}.

RETURNx,Wx2{\displaystyle x^{*},\|Wx^{*}\|_{2}}

By reassigningWiWiWis{\displaystyle W_{i}\leftarrow {\frac {W_{i}}{\|W_{i}\|_{s}}}} after each update of the discriminator, we can upper-boundWis1{\displaystyle \|W_{i}\|_{s}\leq 1}, and thus upper-boundDL{\displaystyle \|D\|_{L}}.

The algorithm can be further accelerated bymemoization: at stept{\displaystyle t}, storexi(t){\displaystyle x_{i}^{*}(t)}. Then, at stept+1{\displaystyle t+1}, usexi(t){\displaystyle x_{i}^{*}(t)} as the initial guess for the algorithm. SinceWi(t+1){\displaystyle W_{i}(t+1)} is very close toWi(t){\displaystyle W_{i}(t)}, so isxi(t){\displaystyle x_{i}^{*}(t)} toxi(t+1){\displaystyle x_{i}^{*}(t+1)}, thus allowing rapid convergence.

CNN-specific normalization

[edit]

There are some activation normalization techniques that are only used for CNNs.

Response normalization

[edit]

Local response normalization[22] was used inAlexNet. It was applied in a convolutional layer, just after a nonlinear activation function. It was defined by:

bx,yi=ax,yi(k+αj=max(0,in/2)min(N1,i+n/2)(ax,yj)2)β{\displaystyle b_{x,y}^{i}={\frac {a_{x,y}^{i}}{\left(k+\alpha \sum _{j=\max(0,i-n/2)}^{\min(N-1,i+n/2)}\left(a_{x,y}^{j}\right)^{2}\right)^{\beta }}}}

whereax,yi{\displaystyle a_{x,y}^{i}} is the activation of the neuron at location(x,y){\displaystyle (x,y)} and channeli{\displaystyle i}. I.e., each pixel in a channel is suppressed by the activations of the same pixel in its adjacent channels.

k,n,α,β{\displaystyle k,n,\alpha ,\beta } are hyperparameters picked by using a validation set.

It was a variant of the earlierlocal contrast normalization.[23]

bx,yi=ax,yi(k+αj=max(0,in/2)min(N1,i+n/2)(ax,yja¯x,yj)2)β{\displaystyle b_{x,y}^{i}={\frac {a_{x,y}^{i}}{\left(k+\alpha \sum _{j=\max(0,i-n/2)}^{\min(N-1,i+n/2)}\left(a_{x,y}^{j}-{\bar {a}}_{x,y}^{j}\right)^{2}\right)^{\beta }}}}

wherea¯x,yj{\displaystyle {\bar {a}}_{x,y}^{j}} is the average activation in a small window centered on location(x,y){\displaystyle (x,y)} and channeli{\displaystyle i}. The hyperparametersk,n,α,β{\displaystyle k,n,\alpha ,\beta }, and the size of the small window, are picked by using a validation set.

Similar methods were calleddivisive normalization, as they divide activations by a number depending on the activations. They were originally inspired by biology, where it was used to explain nonlinear responses of cortical neurons and nonlinear masking in visual perception.[24]

Both kinds of local normalization were obviated by batch normalization, which is a more global form of normalization.[25]

Response normalization reappeared in ConvNeXT-2 asglobal response normalization.[26]

Group normalization

[edit]

Group normalization (GroupNorm)[27] is a technique also solely used for CNNs. It can be understood as the LayerNorm for CNN applied once per channel group.

Suppose at a layerl{\displaystyle l}, there are channels1,2,,C{\displaystyle 1,2,\dots ,C}, then it is partitioned into groupsg1,g2,,gG{\displaystyle g_{1},g_{2},\dots ,g_{G}}. Then, LayerNorm is applied to each group.

Instance normalization

[edit]

Instance normalization (InstanceNorm), orcontrast normalization, is a technique first developed forneural style transfer, and is also only used for CNNs.[28] It can be understood as the LayerNorm for CNN applied once per channel, or equivalently, as group normalization where each group consists of a single channel:

μc(l)=1HWh=1Hw=1Wxh,w,c(l)(σc(l))2=1HWh=1Hw=1W(xh,w,c(l)μc(l))2x^h,w,c(l)=x^h,w,c(l)μc(l)(σc(l))2+ϵyh,w,c(l)=γc(l)x^h,w,c(l)+βc(l){\displaystyle {\begin{aligned}\mu _{c}^{(l)}&={\frac {1}{HW}}\sum _{h=1}^{H}\sum _{w=1}^{W}x_{h,w,c}^{(l)}\\(\sigma _{c}^{(l)})^{2}&={\frac {1}{HW}}\sum _{h=1}^{H}\sum _{w=1}^{W}(x_{h,w,c}^{(l)}-\mu _{c}^{(l)})^{2}\\{\hat {x}}_{h,w,c}^{(l)}&={\frac {{\hat {x}}_{h,w,c}^{(l)}-\mu _{c}^{(l)}}{\sqrt {(\sigma _{c}^{(l)})^{2}+\epsilon }}}\\y_{h,w,c}^{(l)}&=\gamma _{c}^{(l)}{\hat {x}}_{h,w,c}^{(l)}+\beta _{c}^{(l)}\end{aligned}}}

Adaptive instance normalization

[edit]

Adaptive instance normalization (AdaIN) is a variant of instance normalization, designed specifically for neural style transfer with CNNs, rather than just CNNs in general.[29]

In the AdaIN method of style transfer, we take a CNN and two input images, one forcontent and one forstyle. Each image is processed through the same CNN, and at a certain layerl{\displaystyle l}, AdaIn is applied.

Letx(l), content{\displaystyle x^{(l),{\text{ content}}}} be the activation in the content image, andx(l), style{\displaystyle x^{(l),{\text{ style}}}} be the activation in the style image. Then, AdaIn first computes the mean and variance of the activations of the content imagex(l){\displaystyle x'^{(l)}}, then uses those as theγ,β{\displaystyle \gamma ,\beta } for InstanceNorm onx(l), content{\displaystyle x^{(l),{\text{ content}}}}. Note thatx(l), style{\displaystyle x^{(l),{\text{ style}}}} itself remains unchanged. Explicitly, we have:

yh,w,c(l), content=σc(l), style(xh,w,c(l), contentμc(l), content(σc(l), content)2+ϵ)+μc(l), style{\displaystyle {\begin{aligned}y_{h,w,c}^{(l),{\text{ content}}}&=\sigma _{c}^{(l),{\text{ style}}}\left({\frac {x_{h,w,c}^{(l),{\text{ content}}}-\mu _{c}^{(l),{\text{ content}}}}{\sqrt {(\sigma _{c}^{(l),{\text{ content}}})^{2}+\epsilon }}}\right)+\mu _{c}^{(l),{\text{ style}}}\end{aligned}}}

Transformers

[edit]

Some normalization methods were designed for use intransformers.

The original 2017 transformer used the "post-LN" configuration for its LayerNorms. It was difficult to train, and required carefulhyperparameter tuning and a "warm-up" inlearning rate, where it starts small and gradually increases. The pre-LN convention, proposed several times in 2018,[30] was found to be easier to train, requiring no warm-up, leading to faster convergence.[31]

FixNorm[32] andScaleNorm[33] both normalize activation vectors in a transformer. The FixNorm method divides theoutput vectors from a transformer by their L2 norms, then multiplies by a learned parameterg{\displaystyle g}. The ScaleNorm replaces all LayerNorms inside a transformer by division with L2 norm, then multiplying by a learned parameterg{\displaystyle g'} (shared by all ScaleNorm modules of a transformer).Query-Key normalization (QKNorm)[34] normalizes query and key vectors to have unit L2 norm.

InnGPT, many vectors are normalized to have unit L2 norm:[35] hidden state vectors, input and output embedding vectors, weight matrix columns, and query and key vectors.

Miscellaneous

[edit]

Gradient normalization (GradNorm)[36] normalizes gradient vectors during backpropagation.

See also

[edit]

References

[edit]
  1. ^Huang, Lei (2022).Normalization Techniques in Deep Learning. Synthesis Lectures on Computer Vision. Cham: Springer International Publishing.doi:10.1007/978-3-031-14595-7.ISBN 978-3-031-14594-0.
  2. ^abcdeIoffe, Sergey; Szegedy, Christian (2015-06-01)."Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift".Proceedings of the 32nd International Conference on Machine Learning. PMLR:448–456.arXiv:1502.03167.
  3. ^abGoodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). "8.7.1. Batch Normalization".Deep learning. Adaptive computation and machine learning. Cambridge, Massachusetts: The MIT Press.ISBN 978-0-262-03561-3.
  4. ^Desjardins, Guillaume; Simonyan, Karen; Pascanu, Razvan; kavukcuoglu, koray (2015)."Natural Neural Networks".Advances in Neural Information Processing Systems.28. Curran Associates, Inc.
  5. ^Xu, Jingjing; Sun, Xu; Zhang, Zhiyuan; Zhao, Guangxiang; Lin, Junyang (2019)."Understanding and Improving Layer Normalization".Advances in Neural Information Processing Systems.32. Curran Associates, Inc.arXiv:1911.07013.
  6. ^Awais, Muhammad; Bin Iqbal, Md. Tauhid; Bae, Sung-Ho (November 2021). "Revisiting Internal Covariate Shift for Batch Normalization".IEEE Transactions on Neural Networks and Learning Systems.32 (11):5082–5092.Bibcode:2021ITNNL..32.5082A.doi:10.1109/TNNLS.2020.3026784.ISSN 2162-237X.PMID 33095717.
  7. ^Bjorck, Nils; Gomes, Carla P; Selman, Bart; Weinberger, Kilian Q (2018)."Understanding Batch Normalization".Advances in Neural Information Processing Systems.31. Curran Associates, Inc.arXiv:1806.02375.
  8. ^Santurkar, Shibani; Tsipras, Dimitris; Ilyas, Andrew; Madry, Aleksander (2018)."How Does Batch Normalization Help Optimization?".Advances in Neural Information Processing Systems.31. Curran Associates, Inc.
  9. ^"BatchNorm2d — PyTorch 2.4 documentation".pytorch.org. Retrieved2024-09-26.
  10. ^Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024)."8.5. Batch Normalization".Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press.ISBN 978-1-009-38943-3.
  11. ^abLaurent, Cesar; Pereyra, Gabriel; Brakel, Philemon; Zhang, Ying; Bengio, Yoshua (March 2016). "Batch normalized recurrent neural networks".2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. pp. 2657–2661.arXiv:1510.01378.doi:10.1109/ICASSP.2016.7472159.ISBN 978-1-4799-9988-0.
  12. ^Cooijmans, Tim; Ballas, Nicolas; Laurent, César; Gülçehre, Çağlar; Courville, Aaron (2016). "Recurrent Batch Normalization".arXiv:1603.09025 [cs.LG].
  13. ^abSummers, Cecilia; Dinneen, Michael J. (2019). "Four Things Everyone Should Know to Improve Batch Normalization".arXiv:1906.03548 [cs.LG].
  14. ^Brock, Andrew; De, Soham; Smith, Samuel L.; Simonyan, Karen (2021). "High-Performance Large-Scale Image Recognition Without Normalization".arXiv:2102.06171 [cs.CV].
  15. ^abBa, Jimmy Lei; Kiros, Jamie Ryan; Hinton, Geoffrey E. (2016). "Layer Normalization".arXiv:1607.06450 [stat.ML].
  16. ^Phuong, Mary; Hutter, Marcus (2022-07-19). "Formal Algorithms for Transformers".arXiv:2207.09238 [cs.LG].
  17. ^Zhang, Biao; Sennrich, Rico (2019-10-16). "Root Mean Square Layer Normalization".arXiv:1910.07467 [cs.LG].
  18. ^Perez, Ethan; Strub, Florian; De Vries, Harm; Dumoulin, Vincent; Courville, Aaron (2018-04-29)."FiLM: Visual Reasoning with a General Conditioning Layer".Proceedings of the AAAI Conference on Artificial Intelligence.32 (1).arXiv:1709.07871.doi:10.1609/aaai.v32i1.11671.ISSN 2374-3468.
  19. ^Peebles, William; Xie, Saining (2023)."Scalable Diffusion Models with Transformers":4195–4205.arXiv:2212.09748.{{cite journal}}:Cite journal requires|journal= (help)
  20. ^Salimans, Tim; Kingma, Diederik P. (2016-06-03). "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks".arXiv:1602.07868 [cs.LG].
  21. ^Miyato, Takeru; Kataoka, Toshiki; Koyama, Masanori; Yoshida, Yuichi (2018-02-16). "Spectral Normalization for Generative Adversarial Networks".arXiv:1802.05957 [cs.LG].
  22. ^Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E (2012)."ImageNet Classification with Deep Convolutional Neural Networks".Advances in Neural Information Processing Systems.25. Curran Associates, Inc.
  23. ^Jarrett, Kevin; Kavukcuoglu, Koray; Ranzato, Marc' Aurelio; LeCun, Yann (September 2009)."What is the best multi-stage architecture for object recognition?".2009 IEEE 12th International Conference on Computer Vision. IEEE. pp. 2146–2153.doi:10.1109/iccv.2009.5459469.ISBN 978-1-4244-4420-5.
  24. ^Lyu, Siwei; Simoncelli, Eero P. (2008). "Nonlinear image representation using divisive normalization".2008 IEEE Conference on Computer Vision and Pattern Recognition. Vol. 2008. pp. 1–8.doi:10.1109/CVPR.2008.4587821.ISBN 978-1-4244-2242-5.ISSN 1063-6919.PMC 4207373.PMID 25346590.
  25. ^Ortiz, Anthony; Robinson, Caleb; Morris, Dan; Fuentes, Olac; Kiekintveld, Christopher; Hassan, Md Mahmudulla; Jojic, Nebojsa (2020)."Local Context Normalization: Revisiting Local Normalization":11276–11285.arXiv:1912.05845.{{cite journal}}:Cite journal requires|journal= (help)
  26. ^Woo, Sanghyun; Debnath, Shoubhik; Hu, Ronghang; Chen, Xinlei; Liu, Zhuang; Kweon, In So; Xie, Saining (2023)."ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders":16133–16142.arXiv:2301.00808.{{cite journal}}:Cite journal requires|journal= (help)
  27. ^Wu, Yuxin; He, Kaiming (2018)."Group Normalization":3–19.{{cite journal}}:Cite journal requires|journal= (help)
  28. ^Ulyanov, Dmitry; Vedaldi, Andrea; Lempitsky, Victor (2017-11-06). "Instance Normalization: The Missing Ingredient for Fast Stylization".arXiv:1607.08022 [cs.CV].
  29. ^Huang, Xun; Belongie, Serge (2017)."Arbitrary Style Transfer in Real-Time With Adaptive Instance Normalization":1501–1510.arXiv:1703.06868.{{cite journal}}:Cite journal requires|journal= (help)
  30. ^Wang, Qiang; Li, Bei; Xiao, Tong; Zhu, Jingbo; Li, Changliang; Wong, Derek F.; Chao, Lidia S. (2019). "Learning Deep Transformer Models for Machine Translation".arXiv:1906.01787 [cs.CL].
  31. ^Xiong, Ruibin; Yang, Yunchang; He, Di; Zheng, Kai; Zheng, Shuxin; Xing, Chen; Zhang, Huishuai; Lan, Yanyan; Wang, Liwei; Liu, Tie-Yan (2020-06-29). "On Layer Normalization in the Transformer Architecture".arXiv:2002.04745 [cs.LG].
  32. ^Nguyen, Toan Q.; Chiang, David (2017). "Improving Lexical Choice in Neural Machine Translation".arXiv:1710.01329 [cs.CL].
  33. ^Nguyen, Toan Q.; Salazar, Julian (2019-11-02). "Transformers without Tears: Improving the Normalization of Self-Attention".arXiv:1910.05895.doi:10.5281/zenodo.3525484.{{cite journal}}:Cite journal requires|journal= (help)
  34. ^Henry, Alex; Dachapally, Prudhvi Raj; Pawar, Shubham Shantaram; Chen, Yuxuan (November 2020). Cohn, Trevor; He, Yulan; Liu, Yang (eds.)."Query-Key Normalization for Transformers".Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics:4246–4253.arXiv:2010.04245.doi:10.18653/v1/2020.findings-emnlp.379.
  35. ^Loshchilov, Ilya; Hsieh, Cheng-Ping; Sun, Simeng; Ginsburg, Boris (2024). "NGPT: Normalized Transformer with Representation Learning on the Hypersphere".arXiv:2410.01131 [cs.LG].
  36. ^Chen, Zhao; Badrinarayanan, Vijay; Lee, Chen-Yu; Rabinovich, Andrew (2018-07-03)."GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks".Proceedings of the 35th International Conference on Machine Learning. PMLR:794–803.arXiv:1711.02257.

Further reading

[edit]
Concepts
Applications
Implementations
Audio–visual
Text
Decisional
People
Architectures
Retrieved from "https://en.wikipedia.org/w/index.php?title=Normalization_(machine_learning)&oldid=1308010438"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp