Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Highway network

From Wikipedia, the free encyclopedia
Type of artificial neural network
This article is about a technique used in machine learning. For other uses, seeHighway.

Inmachine learning, theHighway Network was the first working very deepfeedforward neural network with hundreds of layers, much deeper than previousneural networks.[1][2][3]It usesskip connections modulated by learnedgating mechanisms to regulate information flow, inspired bylong short-term memory (LSTM)recurrent neural networks.[4][5]The advantage of the Highway Network over otherdeep learning architectures is its ability to overcome or partially prevent thevanishing gradient problem,[6] thus improving its optimization. Gating mechanisms are used to facilitate information flow across the many layers ("information highways").[1][2]

Highway Networks have found use intext sequence labeling andspeech recognition tasks.[7][8]

In 2014, the state of the art was training deep neural networks with 20 to 30 layers.[9] Stacking too many layers led to a steep reduction intraining accuracy,[10] known as the "degradation" problem.[11] In 2015, two techniques were developed to train such networks: the Highway Network (published in May), and theresidual neural network, or ResNet[12] (December). ResNet behaves like an open-gated Highway Net.

Model

[edit]

The model has two gates in addition to theH(WH,x){\displaystyle H(W_{H},x)} gate: the transform gateT(WT,x){\displaystyle T(W_{T},x)} and the carry gateC(WC,x){\displaystyle C(W_{C},x)}. The latter two gates are non-linear transfer functions (specificallysigmoid by convention). The functionH{\displaystyle H} can be any desired transfer function.

The carry gate is defined as:

C(WC,x)=1T(WT,x){\displaystyle C(W_{C},x)=1-T(W_{T},x)}

while the transform gate is just a gate with a sigmoid transfer function.

Structure

[edit]

The structure of a hidden layer in the Highway Network follows the equation:

y=H(x,WH)T(x,WT)+xC(x,WC)=H(x,WH)T(x,WT)+x(1T(x,WT)){\displaystyle {\begin{aligned}y=H(x,W_{H})\cdot T(x,W_{T})+x\cdot C(x,W_{C})\\=H(x,W_{H})\cdot T(x,W_{T})+x\cdot (1-T(x,W_{T}))\end{aligned}}}

Related work

[edit]

Sepp Hochreiter analyzed thevanishing gradient problem in 1991 and attributed to it the reason whydeep learning did not work well.[6]To overcome this problem,Long Short-Term Memory (LSTM)recurrent neural networks[4]have residual connections with a weight of 1.0 in every LSTM cell (called the constant error carrousel) to computeyt+1=F(xt)+xt{\textstyle y_{t+1}=F(x_{t})+x_{t}}. Duringbackpropagation through time, this becomes the residual formulay=F(x)+x{\textstyle y=F(x)+x} for feedforward neural networks. This enables training very deeprecurrent neural networks with a very long time span t. A later LSTM version published in 2000[5] modulates the identity LSTM connections by so-called "forget gates" such that their weights are not fixed to 1.0 but can be learned. In experiments, the forget gates were initialized with positive bias weights,[5] thus being opened, addressing the vanishing gradient problem.As long as the forget gates of the 2000 LSTM are open, it behaves like the 1997 LSTM.

The Highway Network of May 2015[1]applies these principles tofeedforward neural networks.It was reported to be "the first very deep feedforward network with hundreds of layers".[13] It is like a 2000 LSTM with forget gatesunfolded in time,[5] while the later Residual Nets have no equivalent of forget gates and are like the unfolded original 1997 LSTM.[4]If the skip connections in Highway Networks are "without gates," or if their gates are kept open (activation 1.0), they become Residual Networks.

The residual connection is a special case of the "short-cut connection" or "skip connection" by Rosenblatt (1961)[14] and Lang & Witbrock (1988)[15] which has the formxF(x)+Ax{\displaystyle x\mapsto F(x)+Ax}. Here the randomly initialized weight matrix A does not have to be the identity mapping. Every residual connection is a skip connection, but almost all skip connections are not residual connections.

The original Highway Network paper[16] not only introduced the basic principle for very deep feedforward networks, but also included experimental results with 20, 50, and 100 layers networks, and mentioned ongoing experiments with up to 900 layers. Networks with 50 or 100 layers had lower training error than their plain network counterparts, but no lower training error than their 20 layers counterpart (on the MNIST dataset, Figure 1 in[16]). No improvement on test accuracy was reported with networks deeper than 19 layers (on the CIFAR-10 dataset; Table 1 in[16]). The ResNet paper,[17] however, provided strong experimental evidence of the benefits of going deeper than 20 layers. It argued that the identity mapping without modulation is crucial and mentioned that modulation in the skip connection can still lead to vanishing signals in forward and backward propagation (Section 3 in[17]). This is also why the forget gates of the 2000 LSTM[18] were initially opened through positive bias weights: as long as the gates are open, it behaves like the 1997 LSTM. Similarly, a Highway Net whose gates are opened through strongly positive bias weights behaves like a ResNet. The skip connections used in modern neural networks (e.g.,Transformers) are dominantly identity mappings.

References

[edit]
  1. ^abcSrivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (2 May 2015). "Highway Networks".arXiv:1505.00387 [cs.LG].
  2. ^abSrivastava, Rupesh K; Greff, Klaus; Schmidhuber, Juergen (2015)."Training Very Deep Networks".Advances in Neural Information Processing Systems.28. Curran Associates, Inc.:2377–2385.
  3. ^Schmidhuber, Jürgen (2021)."The most cited neural networks all build on work done in my labs".AI Blog. IDSIA, Switzerland. Retrieved2022-04-30.
  4. ^abcSepp Hochreiter;Jürgen Schmidhuber (1997)."Long short-term memory".Neural Computation.9 (8):1735–1780.doi:10.1162/neco.1997.9.8.1735.PMID 9377276.S2CID 1915014.
  5. ^abcdFelix A. Gers; Jürgen Schmidhuber; Fred Cummins (2000). "Learning to Forget: Continual Prediction with LSTM".Neural Computation.12 (10):2451–2471.CiteSeerX 10.1.1.55.5709.doi:10.1162/089976600300015015.PMID 11032042.S2CID 11598600.
  6. ^abHochreiter, Sepp (1991).Untersuchungen zu dynamischen neuronalen Netzen(PDF) (diploma thesis). Technical University Munich, Institute of Computer Science, advisor: J. Schmidhuber.
  7. ^Liu, Liyuan; Shang, Jingbo; Xu, Frank F.; Ren, Xiang; Gui, Huan; Peng, Jian; Han, Jiawei (12 September 2017). "Empower Sequence Labeling with Task-Aware Neural Language Model".arXiv:1709.04109 [cs.CL].
  8. ^Kurata, Gakuto;Ramabhadran, Bhuvana; Saon, George; Sethy, Abhinav (19 September 2017). "Language Modeling with Highway LSTM".arXiv:1709.06436 [cs.CL].
  9. ^Simonyan, Karen; Zisserman, Andrew (2015-04-10),Very Deep Convolutional Networks for Large-Scale Image Recognition,arXiv:1409.1556
  10. ^He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification".arXiv:1502.01852 [cs.CV].
  11. ^He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (10 Dec 2015).Deep Residual Learning for Image Recognition.arXiv:1512.03385.
  12. ^He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). "Deep Residual Learning for Image Recognition".2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 770–778.arXiv:1512.03385.doi:10.1109/CVPR.2016.90.ISBN 978-1-4673-8851-1.
  13. ^Schmidhuber, Jürgen (2015)."Highway Networks (May 2015): First Working Really Deep Feedforward Neural Networks With Over 100 Layers".
  14. ^Rosenblatt, Frank (1961).Principles of neurodynamics. perceptrons and the theory of brain mechanisms(PDF).
  15. ^Lang, Kevin; Witbrock, Michael (1988)."Learning to tell two spirals apart"(PDF).Proceedings of the 1988 Connectionist Models Summer School:52–59.
  16. ^abcSrivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (3 May 2015). "Highway Networks".arXiv:1505.00387 [cs.LG].
  17. ^abHe, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015). "Identity Mappings in Deep Residual Networks".arXiv:1603.05027 [cs.CV].
  18. ^Felix A. Gers; Jürgen Schmidhuber; Fred Cummins (2000). "Learning to Forget: Continual Prediction with LSTM".Neural Computation.12 (10):2451–2471.CiteSeerX 10.1.1.55.5709.doi:10.1162/089976600300015015.PMID 11032042.S2CID 11598600.
Concepts
Applications
Implementations
Audio–visual
Text
Decisional
People
Architectures
Political
Social and economic
Retrieved from "https://en.wikipedia.org/w/index.php?title=Highway_network&oldid=1307901797"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp