Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Residual neural network

From Wikipedia, the free encyclopedia
Type of artificial neural network
"ResNet" redirects here. For other uses, seeResNet (disambiguation).
A residual block in a deep residual network. Here, the residual connection skips two layers.

Aresidual neural network (also referred to as aresidual network orResNet)[1] is adeep learning architecture in which the layers learn residual functions with reference to the layer inputs. It was developed in 2015 forimage recognition, and won theImageNet Large Scale Visual Recognition Challenge (ILSVRC) of that year.[2][3]

As a point of terminology, "residual connection" refers to the specific architectural motif ofxf(x)+x{\displaystyle x\mapsto f(x)+x}, wheref{\displaystyle f} is an arbitrary neural network module. The motif had been used previously (see§History for details). However, the publication of ResNet made it widely popular forfeedforward networks, appearing in neural networks that are seemingly unrelated to ResNet.

The residual connection stabilizes the training and convergence of deep neural networks with hundreds of layers, and is a common motif in deep neural networks, such astransformer models (e.g.,BERT, andGPT models such asChatGPT), theAlphaGo Zero system, theAlphaStar system, and theAlphaFold system.

Mathematics

[edit]

Residual connection

[edit]

In a multilayer neural network model, consider a subnetwork with a certain number of stacked layers (e.g., 2 or 3). Denote the underlying function performed by this subnetwork asH(x){\displaystyle H(x)}, wherex{\displaystyle x} is the input to the subnetwork. Residual learning re-parameterizes this subnetwork and lets the parameter layers represent a "residual function"F(x)=H(x)x{\displaystyle F(x)=H(x)-x}. The outputy{\displaystyle y} of this subnetwork is then represented as:

y=F(x)+x{\displaystyle y=F(x)+x}

The operation of "+ x{\displaystyle +\ x}" is implemented via a "skip connection" that performs an identity mapping to connect the input of the subnetwork with its output. This connection is referred to as a "residual connection" in later work. The functionF(x){\displaystyle F(x)} is often represented by matrix multiplication interlaced withactivation functions and normalization operations (e.g.,batch normalization orlayer normalization). As a whole, one of these subnetworks is referred to as a "residual block".[1] A deep residual network is constructed by simply stacking these blocks.

Long short-term memory (LSTM) has a memory mechanism that serves as a residual connection.[4] In an LSTM without a forgetgate, an inputxt{\displaystyle x_{t}} is processed by a functionF{\displaystyle F} and added to a memory cellct{\displaystyle c_{t}}, resulting inct+1=ct+F(xt){\displaystyle c_{t+1}=c_{t}+F(x_{t})}. An LSTM with a forget gate essentially functions as ahighway network.

To stabilize thevariance of the layers' inputs, it is recommended to replace the residual connectionsx+f(x){\displaystyle x+f(x)} withx/L+f(x){\displaystyle x/L+f(x)}, whereL{\displaystyle L} is the total number of residual layers.[5]

Projection connection

[edit]

If the functionF{\displaystyle F} is of typeF:RnRm{\displaystyle F:\mathbb {R} ^{n}\to \mathbb {R} ^{m}} wherenm{\displaystyle n\neq m}, thenF(x)+x{\displaystyle F(x)+x} is undefined. To handle this special case, a projection connection is used:

y=F(x)+P(x){\displaystyle y=F(x)+P(x)}

whereP{\displaystyle P} is typically a linear projection, defined byP(x)=Mx{\displaystyle P(x)=Mx} whereM{\displaystyle M} is am×n{\displaystyle m\times n} matrix. The matrix is trained viabackpropagation, as is any other parameter of the model.

Signal propagation

[edit]

The introduction of identity mappings facilitates signal propagation in both forward and backward paths.[6]

Forward propagation

[edit]

If the output of the{\displaystyle \ell }-th residual block is the input to the(+1){\displaystyle (\ell +1)}-th residual block (assuming no activation function between blocks), then the(+1){\displaystyle (\ell +1)}-th input is:

x+1=F(x)+x{\displaystyle x_{\ell +1}=F(x_{\ell })+x_{\ell }}

Applying this formulation recursively, e.g.:

x+2=F(x+1)+x+1=F(x+1)+F(x)+x{\displaystyle {\begin{aligned}x_{\ell +2}&=F(x_{\ell +1})+x_{\ell +1}\\&=F(x_{\ell +1})+F(x_{\ell })+x_{\ell }\end{aligned}}}

yields the general relationship:

xL=x+i=L1F(xi){\displaystyle x_{L}=x_{\ell }+\sum _{i=\ell }^{L-1}F(x_{i})}

whereL{\textstyle L} is the index of a residual block and{\textstyle \ell } is the index of some earlier block. This formulation suggests that there is always a signal that is directly sent from a shallower block{\textstyle \ell } to a deeper blockL{\textstyle L}.

Backward propagation

[edit]

The residual learning formulation provides the added benefit of mitigating thevanishing gradient problem to some extent. However, it is crucial to acknowledge that the vanishing gradient issue is not the root cause of the degradation problem, which is tackled through the use of normalization. To observe the effect of residual blocks on backpropagation, consider the partial derivative of aloss functionE{\displaystyle {\mathcal {E}}} with respect to some residual block inputx{\displaystyle x_{\ell }}. Using the equation above from forward propagation for a later residual blockL>{\displaystyle L>\ell }:[6]

Ex=ExLxLx=ExL(1+xi=L1F(xi))=ExL+ExLxi=L1F(xi){\displaystyle {\begin{aligned}{\frac {\partial {\mathcal {E}}}{\partial x_{\ell }}}&={\frac {\partial {\mathcal {E}}}{\partial x_{L}}}{\frac {\partial x_{L}}{\partial x_{\ell }}}\\&={\frac {\partial {\mathcal {E}}}{\partial x_{L}}}\left(1+{\frac {\partial }{\partial x_{\ell }}}\sum _{i=\ell }^{L-1}F(x_{i})\right)\\&={\frac {\partial {\mathcal {E}}}{\partial x_{L}}}+{\frac {\partial {\mathcal {E}}}{\partial x_{L}}}{\frac {\partial }{\partial x_{\ell }}}\sum _{i=\ell }^{L-1}F(x_{i})\end{aligned}}}

This formulation suggests that the gradient computation of a shallower layer,Ex{\textstyle {\frac {\partial {\mathcal {E}}}{\partial x_{\ell }}}}, always has a later termExL{\textstyle {\frac {\partial {\mathcal {E}}}{\partial x_{L}}}} that is directly added. Even if the gradients of theF(xi){\displaystyle F(x_{i})} terms are small, the total gradientEx{\textstyle {\frac {\partial {\mathcal {E}}}{\partial x_{\ell }}}} resists vanishing due to the added termExL{\textstyle {\frac {\partial {\mathcal {E}}}{\partial x_{L}}}}.

Variants of residual blocks

[edit]
Two variants of convolutional Residual Blocks.[1]Left: abasic block that has two 3x3 convolutional layers.Right: abottleneck block that has a 1x1 convolutional layer for dimension reduction, a 3x3 convolutional layer, and another 1x1 convolutional layer for dimension restoration.

Basic block

[edit]

Abasic block is the simplest building block studied in the original ResNet.[1] This block consists of two sequential 3x3convolutional layers and a residual connection. The input and output dimensions of both layers are equal.

Block diagram of ResNet (2015). It shows a ResNet block with and without the 1x1 convolution. The 1x1 convolution (with stride) can be used to change the shape of the array, which is necessary for residual connection through an upsampling/downsampling layer.

Bottleneck block

[edit]

Abottleneck block[1] consists of three sequential convolutional layers and a residual connection. The first layer in this block is a 1×1 convolution for dimension reduction (e.g., to 1/2 of the input dimension); the second layer performs a 3×3 convolution; the last layer is another 1×1 convolution for dimension restoration. The models of ResNet-50, ResNet-101, and ResNet-152 are all based on bottleneck blocks.[1]

Pre-activation block

[edit]

Thepre-activation residual block[6] applies activation functions before applying the residual functionF{\displaystyle F}. Formally, the computation of a pre-activation residual block can be written as:

x+1=F(ϕ(x))+x{\displaystyle x_{\ell +1}=F(\phi (x_{\ell }))+x_{\ell }}

whereϕ{\displaystyle \phi } can be any activation (e.g.ReLU) or normalization (e.g.LayerNorm) operation. This design reduces the number of non-identity mappings between residual blocks, and allows an identity mapping directly from the input to the output. This design was used to train models with 200 to over 1000 layers, and was found to consistently outperform variants where the residual path is not an identity function. The pre-activation ResNet with 200 layers took 3 weeks to train forImageNet on 8GPUs in 2016.[6]

SinceGPT-2,transformer blocks have been mostly implemented as pre-activation blocks. This is often referred to as "pre-normalization" in the literature of transformer models.[7]

The original Resnet-18 architecture. Up to 152 layers were trained in the original publication (as "ResNet-152").[8]

Applications

[edit]

Originally, ResNet was designed forcomputer vision.[1][8][9]

The Transformer architecture includes residual connections.

All transformer architectures include residual connections. Indeed, very deep transformers cannot be trained without them.[10]

The original ResNet paper made no claim on being inspired by biological systems. However, later research has related ResNet to biologically-plausible algorithms.[11][12]

A study published inScience in 2023[13] disclosed the completeconnectome of an insect brain (specifically that of a fruit fly larva). This study discovered "multilayer shortcuts" that resemble the skip connections in artificial neural networks, including ResNets.

History

[edit]

Previous work

[edit]

Residual connections were noticed inneuroanatomy, such asLorente de No (1938).[14]: Fig 3 McCulloch andPitts (1943) proposed artificial neural networks and considered those with residual connections.[15]: Fig 1.h 

In 1961,Frank Rosenblatt described a three-layermultilayer perceptron (MLP) model with skip connections.[16]: 313, Chapter 15  The model was referred to as a "cross-coupled system", and the skip connections were forms of cross-coupled connections.

During the late 1980s, "skip-layer" connections were sometimes used in neural networks. Examples include:[17][18] Lang and Witbrock (1988)[19] trained a fully connected feedforward network where each layer skip-connects to all subsequent layers, like the later DenseNet (2016). In this work, the residual connection was the formxF(x)+P(x){\displaystyle x\mapsto F(x)+P(x)}, whereP{\displaystyle P} is a randomly-initialized projection connection. They termed it a "short-cut connection". An early neural language model used residual connections and named them "direct connections".[20]

The long short-term memory (LSTM) cell can process data sequentially and keep its hidden state through time. The cell statect{\displaystyle c_{t}} can function as a generalized residual connection.

Degradation problem

[edit]

Sepp Hochreiter discovered thevanishing gradient problem in 1991[21] and argued that it explained why the then-prevalent forms ofrecurrent neural networks did not work for long sequences. He andSchmidhuber later designed the LSTM architecture to solve this problem,[4][22] which has a "cell state"ct{\displaystyle c_{t}} that can function as a generalized residual connection. Thehighway network (2015)[23][24] applied the idea of an LSTMunfolded in time tofeedforward neural networks, resulting in the highway network. ResNet is equivalent to an open-gated highway network.

Standard (left) and unfolded (right) basic recurrent neural network

During the early days of deep learning, there were attempts to train increasingly deep models. Notable examples included theAlexNet (2012), which had 8 layers, and theVGG-19 (2014), which had 19 layers.[25] However, stacking too many layers led to a steep reduction intraining accuracy,[26] known as the "degradation" problem.[1] In theory, adding additional layers to deepen a network should not result in a higher trainingloss, but this is what happened withVGGNet.[1] If the extra layers can be set asidentity mappings, however, then the deeper network would represent the same function as its shallower counterpart. There is some evidence that the optimizer is not able to approach identity mappings for the parameterized layers, and the benefit of residual connections was to allow identity mappings by default.[6]

In 2014, the state of the art was training deep neural networks with 20 to 30 layers.[25] The research team for ResNet attempted to train deeper ones by empirically testing various methods for training deeper networks, until they came upon the ResNet architecture.[27]

Subsequent work

[edit]

Wide Residual Network (2016) found that using more channels and fewer layers than the original ResNet improves performance and GPU-computational efficiency, and that a block with two 3×3 convolutions is superior to other configurations of convolution blocks.[28]

DenseNet (2016)[29] connects the output of each layer to the input to each subsequent layer:

x+1=F(x1,x2,,x1,x){\displaystyle x_{\ell +1}=F(x_{1},x_{2},\dots ,x_{\ell -1},x_{\ell })}

Stochastic depth[30] is aregularization method that randomly drops a subset of layers and lets the signal propagate through the identity skip connections. Also known asDropPath, this regularizes training for deep models, such asvision transformers.[31]

ResNeXt block diagram

ResNeXt (2017) combines theInception module with ResNet.[32][8]

Squeeze-and-Excitation Networks (2018) added squeeze-and-excitation (SE) modules to ResNet.[33] An SE module is applied after a convolution, and takes a tensor of shapeRH×W×C{\displaystyle \mathbb {R} ^{H\times W\times C}} (height, width, channels) as input. Each channel is averaged, resulting in a vector of shapeRC{\displaystyle \mathbb {R} ^{C}}. This is then passed through amultilayer perceptron (with an architecture such aslinear-ReLU-linear-sigmoid) before it is multiplied with the original tensor. It won theILSVRC in 2017.[34]

References

[edit]
  1. ^abcdefghiHe, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016).Deep Residual Learning for Image Recognition(PDF).Conference on Computer Vision and Pattern Recognition.arXiv:1512.03385.doi:10.1109/CVPR.2016.90.
  2. ^"ILSVRC2015 Results".image-net.org.
  3. ^Deng, Jia; Dong, Wei; Socher, Richard; Li, Li-Jia; Li, Kai;Li, Fei-Fei (2009).ImageNet: A large-scale hierarchical image database.Conference on Computer Vision and Pattern Recognition.doi:10.1109/CVPR.2009.5206848.
  4. ^abSepp Hochreiter;Jürgen Schmidhuber (1997)."Long short-term memory".Neural Computation.9 (8):1735–1780.doi:10.1162/neco.1997.9.8.1735.PMID 9377276.S2CID 1915014.
  5. ^Hanin, Boris; Rolnick, David (2018).How to Start Training: The Effect of Initialization and Architecture(PDF).Conference on Neural Information Processing Systems. Vol. 31. Curran Associates, Inc.arXiv:1803.01719.
  6. ^abcdeHe, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016).Identity Mappings in Deep Residual Networks(PDF).European Conference on Computer Vision.arXiv:1603.05027.doi:10.1007/978-3-319-46493-0_38.
  7. ^Radford, Alec; Wu, Jeffrey; Child, Rewon; Luan, David; Amodei, Dario; Sutskever, Ilya (14 February 2019)."Language models are unsupervised multitask learners"(PDF).Archived(PDF) from the original on 6 February 2021. Retrieved19 December 2020.
  8. ^abcZhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024)."8.6. Residual Networks (ResNet) and ResNeXt".Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press.ISBN 978-1-009-38943-3.
  9. ^Szegedy, Christian; Ioffe, Sergey; Vanhoucke, Vincent; Alemi, Alex (2017).Inception-v4, Inception-ResNet and the impact of residual connections on learning(PDF).AAAI Conference on Artificial Intelligence.arXiv:1602.07261.doi:10.1609/aaai.v31i1.11231.
  10. ^Dong, Yihe; Cordonnier, Jean-Baptiste; Loukas, Andreas (2021).Attention is not all you need: pure attention loses rank doubly exponentially with depth(PDF).International Conference on Machine Learning. PMLR. pp. 2793–2803.arXiv:2103.03404.
  11. ^Liao, Qianli; Poggio, Tomaso (2016). "Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex".arXiv:1604.03640 [cs.LG].
  12. ^Xiao, Will; Chen, Honglin; Liao, Qianli; Poggio, Tomaso (2019).Biologically-Plausible Learning Algorithms Can Scale to Large Datasets.International Conference on Learning Representations.arXiv:1811.03567.
  13. ^Winding, Michael; Pedigo, Benjamin; Barnes, Christopher; Patsolic, Heather; Park, Youngser; Kazimiers, Tom; Fushiki, Akira; Andrade, Ingrid; Khandelwal, Avinash; Valdes-Aleman, Javier; Li, Feng; Randel, Nadine; Barsotti, Elizabeth; Correia, Ana; Fetter, Fetter; Hartenstein, Volker; Priebe, Carey; Vogelstein, Joshua; Cardona, Albert; Zlatic, Marta (10 Mar 2023)."The connectome of an insect brain".Science.379 (6636) eadd9330.bioRxiv 10.1101/2022.11.28.516756v1.doi:10.1126/science.add9330.PMC 7614541.PMID 36893230.S2CID 254070919.
  14. ^De N, Rafael Lorente (1938-05-01)."Analysis of the Activity of the Chains of Internuncial Neurons".Journal of Neurophysiology.1 (3):207–244.doi:10.1152/jn.1938.1.3.207.ISSN 0022-3077.
  15. ^McCulloch, Warren S.; Pitts, Walter (1943-12-01)."A logical calculus of the ideas immanent in nervous activity".The Bulletin of Mathematical Biophysics.5 (4):115–133.doi:10.1007/BF02478259.ISSN 1522-9602.
  16. ^Rosenblatt, Frank (1961).Principles of neurodynamics. perceptrons and the theory of brain mechanisms(PDF).
  17. ^Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. "Learning internal representations by error propagation",Parallel Distributed Processing. Vol. 1. 1986.
  18. ^Venables, W. N.; Ripley, Brain D. (1994).Modern Applied Statistics with S-Plus. Springer. pp. 261–262.ISBN 978-3-540-94350-1.
  19. ^Lang, Kevin; Witbrock, Michael (1988)."Learning to tell two spirals apart"(PDF).Proceedings of the 1988 Connectionist Models Summer School:52–59.
  20. ^Bengio, Yoshua; Ducharme, Réjean; Vincent, Pascal; Jauvin, Christian (2003)."A Neural Probabilistic Language Model".Journal of Machine Learning Research.3 (Feb):1137–1155.ISSN 1533-7928.
  21. ^Hochreiter, Sepp (1991).Untersuchungen zu dynamischen neuronalen Netzen(PDF) (diploma thesis). Technical University Munich, Institute of Computer Science, advisor: J. Schmidhuber.
  22. ^Felix A. Gers; Jürgen Schmidhuber; Fred Cummins (2000). "Learning to Forget: Continual Prediction with LSTM".Neural Computation.12 (10):2451–2471.CiteSeerX 10.1.1.55.5709.doi:10.1162/089976600300015015.PMID 11032042.S2CID 11598600.
  23. ^Srivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (3 May 2015). "Highway Networks".arXiv:1505.00387 [cs.LG].
  24. ^Srivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (2015).Training Very Deep Networks(PDF).Conference on Neural Information Processing Systems.arXiv:1507.06228.
  25. ^abSimonyan, Karen; Zisserman, Andrew (2015-04-10). "Very Deep Convolutional Networks for Large-Scale Image Recognition".arXiv:1409.1556 [cs.CV].
  26. ^He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015).Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification(PDF).International Conference on Computer Vision.arXiv:1502.01852.doi:10.1109/ICCV.2015.123.
  27. ^Linn, Allison (2015-12-10)."Microsoft researchers win ImageNet computer vision challenge".The AI Blog. Retrieved2024-06-29.
  28. ^Zagoruyko, Sergey; Komodakis, Nikos (2016-05-23). "Wide Residual Networks".arXiv:1605.07146 [cs.CV].
  29. ^Huang, Gao; Liu, Zhuang; van der Maaten, Laurens; Weinberger, Kilian (2017).Densely Connected Convolutional Networks(PDF).Conference on Computer Vision and Pattern Recognition.arXiv:1608.06993.doi:10.1109/CVPR.2017.243.
  30. ^Huang, Gao; Sun, Yu; Liu, Zhuang; Weinberger, Kilian (2016).Deep Networks with Stochastic Depth(PDF).European Conference on Computer Vision.arXiv:1603.09382.doi:10.1007/978-3-319-46493-0_39.
  31. ^Lee, Youngwan; Kim, Jonghee; Willette, Jeffrey; Hwang, Sung Ju (2022).MPViT: Multi-Path Vision Transformer for Dense Prediction(PDF).Conference on Computer Vision and Pattern Recognition. pp. 7287–7296.arXiv:2112.11010.doi:10.1109/CVPR52688.2022.00714.
  32. ^Xie, Saining; Girshick, Ross; Dollar, Piotr; Tu, Zhuowen;He, Kaiming (2017).Aggregated Residual Transformations for Deep Neural Networks(PDF).Conference on Computer Vision and Pattern Recognition. pp. 1492–1500.arXiv:1611.05431.doi:10.1109/CVPR.2017.634.
  33. ^Hu, Jie; Shen, Li; Sun, Gang (2018).Squeeze-and-Excitation Networks(PDF).Conference on Computer Vision and Pattern Recognition. pp. 7132–7141.arXiv:1709.01507.doi:10.1109/CVPR.2018.00745.
  34. ^Jie, Hu (2017).Squeeze-and-Excitation Networks(PDF). Beyond ImageNet Large Scale Visual Recognition Challenge, Workshop at CVPR 2017 (Presentation).
Concepts
Applications
Implementations
Audio–visual
Text
Decisional
People
Architectures
Political
Social and economic
Retrieved from "https://en.wikipedia.org/w/index.php?title=Residual_neural_network&oldid=1314589246"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp