Movatterモバイル変換

[0]ホーム

Jump to content

Gating mechanism

Català

Edit links

From Wikipedia, the free encyclopedia

Regulator for flow of signals in neural networks

Inneural networks, thegating mechanism is an architectural motif for controlling the flow ofactivation andgradient signals. They are most prominently used inrecurrent neural networks (RNNs), but have also found applications in other architectures.

RNNs

[edit]

Gating mechanisms are the centerpiece oflong short-term memory (LSTM).^[1] They were proposed to mitigate thevanishing gradient problem often encountered by regular RNNs.

An LSTM unit contains three gates:

Aninput gate, which controls the flow of new information into the memory cell
Aforget gate, which controls how much information is retained from the previous time step
Anoutput gate, which controls how much information is passed to the next layer.

The equations for LSTM are:^[2]

${\begin{aligned}\mathbf {I} _{t}&=\sigma (\mathbf {X} _{t}\mathbf {W} _{xi}+\mathbf {H} _{t-1}\mathbf {W} _{hi}+\mathbf {b} _{i})\\\mathbf {F} _{t}&=\sigma (\mathbf {X} _{t}\mathbf {W} _{xf}+\mathbf {H} _{t-1}\mathbf {W} _{hf}+\mathbf {b} _{f})\\\mathbf {O} _{t}&=\sigma (\mathbf {X} _{t}\mathbf {W} _{xo}+\mathbf {H} _{t-1}\mathbf {W} _{ho}+\mathbf {b} _{o})\\{\tilde {\mathbf {C} }}_{t}&=\tanh(\mathbf {X} _{t}\mathbf {W} _{xc}+\mathbf {H} _{t-1}\mathbf {W} _{hc}+\mathbf {b} _{c})\\\mathbf {C} _{t}&=\mathbf {F} _{t}\odot \mathbf {C} _{t-1}+\mathbf {I} _{t}\odot {\tilde {\mathbf {C} }}_{t}\\\mathbf {H} _{t}&=\mathbf {O} _{t}\odot \tanh(\mathbf {C} _{t})\end{aligned}}$

Here, $\odot$ representselementwise multiplication.

LSTM architecture, with gates

Thegated recurrent unit (GRU) simplifies the LSTM.^[3] Compared to the LSTM, the GRU has just two gates: areset gate and anupdate gate. GRU also merges the cell state and hidden state. The reset gate roughly corresponds to the forget gate, and the update gate roughly corresponds to the input gate. The output gate is removed.

There are several variants of GRU. One particular variant has these equations:^[4]

${\begin{aligned}\mathbf {R} _{t}&=\sigma (\mathbf {X} _{t}\mathbf {W} _{xr}+\mathbf {H} _{t-1}\mathbf {W} _{hr}+\mathbf {b} _{r})\\\mathbf {Z} _{t}&=\sigma (\mathbf {X} _{t}\mathbf {W} _{xz}+\mathbf {H} _{t-1}\mathbf {W} _{hz}+\mathbf {b} _{z})\\{\tilde {\mathbf {H} }}_{t}&=\tanh(\mathbf {X} _{t}\mathbf {W} _{xh}+(\mathbf {R} _{t}\odot \mathbf {H} _{t-1})\mathbf {W} _{hh}+\mathbf {b} _{h})\\\mathbf {H} _{t}&=\mathbf {Z} _{t}\odot \mathbf {H} _{t-1}+(1-\mathbf {Z} _{t})\odot {\tilde {\mathbf {H} }}_{t}\end{aligned}}$

Gated Recurrent Unit architecture, with gates

Gated Linear Unit

[edit]

Gated Linear Units (GLUs)^[5] adapt the gating mechanism for use infeedforward neural networks, often withintransformer-based architectures. They are defined as:

$\mathrm {GLU} (a,b)=a\odot \sigma (b)$

where $a, b {\displaystyle a,b}$ are the first and second inputs, respectively. $\sigma$ represents thesigmoid activation function.

Replacing $\sigma$ with other activation functions leads to variants of GLU:

${\begin{aligned}\mathrm {ReGLU} (a,b)&=a\odot {\text{ReLU}}(b)\\\mathrm {GEGLU} (a,b)&=a\odot {\text{GELU}}(b)\\\mathrm {SwiGLU} (a,b,\beta )&=a\odot {\text{Swish}}_{\beta }(b)\end{aligned}}$

whereReLU,GELU, andSwish are different activation functions.

In transformer models, such gating units are often used in thefeedforward modules. For a single vector input, this results in:^[6]

${\begin{aligned}\operatorname {GLU} (x,W,V,b,c)&=\sigma (xW+b)\odot (xV+c)\\\operatorname {Bilinear} (x,W,V,b,c)&=(xW+b)\odot (xV+c)\\\operatorname {ReGLU} (x,W,V,b,c)&=\max(0,xW+b)\odot (xV+c)\\\operatorname {GEGLU} (x,W,V,b,c)&=\operatorname {GELU} (xW+b)\odot (xV+c)\\\operatorname {SwiGLU} (x,W,V,b,c,\beta )&=\operatorname {Swish} _{\beta }(xW+b)\odot (xV+c)\end{aligned}}$

Other architectures

[edit]

Gating mechanism is used inhighway networks, which were designed by unrolling an LSTM.

Channel gating^[7] uses a gate to control the flow of information through different channels inside aconvolutional neural network (CNN).

References

[edit]

^Sepp Hochreiter;Jürgen Schmidhuber (1997)."Long short-term memory".Neural Computation.9 (8):1735–1780.doi:10.1162/neco.1997.9.8.1735.PMID 9377276.S2CID 1915014.
^Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024)."10.1. Long Short-Term Memory (LSTM)".Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press.ISBN 978-1-009-38943-3.
^Cho, Kyunghyun; van Merrienboer, Bart; Bahdanau, DZmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation".Association for Computational Linguistics.arXiv:1406.1078.
^Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024)."10.2. Gated Recurrent Units (GRU)".Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press.ISBN 978-1-009-38943-3.
^Dauphin, Yann N.; Fan, Angela; Auli, Michael; Grangier, David (2017-07-17)."Language Modeling with Gated Convolutional Networks".Proceedings of the 34th International Conference on Machine Learning. PMLR:933–941.arXiv:1612.08083.
^Shazeer, Noam (February 14, 2020). "GLU Variants Improve Transformer".arXiv:2002.05202 [cs.LG].
^Hua, Weizhe; Zhou, Yuan; De Sa, Christopher M; Zhang, Zhiru; Suh, G. Edward (2019)."Channel Gating Neural Networks".Advances in Neural Information Processing Systems.32. Curran Associates, Inc.arXiv:1805.12549.

Movatterモバイル変換

Gating mechanism

RNNs

Gated Linear Unit

Other architectures

See also

References

Further reading