Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Differentiable neural computer

From Wikipedia, the free encyclopedia
Artificial neural network architecture
A differentiable neural computer being trained to store and recall dense binary numbers. Performance of a reference task during training shown. Upper left: the input (red) and target (blue), as 5-bitwords and a 1 bit interrupt signal. Upper right: the model's output.

Inartificial intelligence, adifferentiable neural computer (DNC) is a memory augmentedneural network architecture (MANN), which is typically (but not by definition) recurrent in its implementation. The model was published in 2016 byAlex Graves et al. ofDeepMind.[1]

Applications

[edit]

DNC indirectly takes inspiration fromVon-Neumann architecture, making it likely to outperform conventional architectures in tasks that are fundamentally algorithmic that cannot be learned by finding adecision boundary.

So far, DNCs have been demonstrated to handle only relatively simple tasks, which can be solved using conventional programming. But DNCs don't need to be programmed for each problem, but can instead be trained. This attention span allows the user to feed complexdata structures such asgraphs sequentially, and recall them for later use. Furthermore, they can learn aspects ofsymbolic reasoning and apply it to working memory. The researchers who published the method see promise that DNCs can be trained to perform complex, structured tasks[1][2] and address big-data applications that require some sort of reasoning, such as generating video commentaries or semantic text analysis.[3][4]

DNC can be trained to navigaterapid transit systems, and apply that network to a different system. A neural network without memory would typically have to learn about each transit system from scratch. On graph traversal and sequence-processing tasks withsupervised learning, DNCs performed better than alternatives such aslong short-term memory or aneural turing machine.[5] With areinforcement learning approach to a block puzzle problem inspired bySHRDLU, DNC was trained via curriculum learning, and learned to make aplan. It performed better than a traditionalrecurrent neural network.[5]

Architecture

[edit]
DNC system diagram

DNC networks were introduced as an extension of theNeural Turing Machine (NTM), with the addition of memory attention mechanisms that control where the memory is stored, and temporal attention that records the order of events. This structure allows DNCs to be more robust and abstract than a NTM, and still perform tasks that have longer-term dependencies than some predecessors such as Long Short Term Memory (LSTM). The memory, which is simply a matrix, can be allocated dynamically and accessed indefinitely. The DNC isdifferentiable end-to-end (each subcomponent of the model is differentiable, therefore so is the whole model). This makes it possible to optimize them efficiently usinggradient descent.[3][6][7]

The DNC model is similar to theVon Neumann architecture, and because of the resizability of memory, it isTuring complete.[8]

Traditional DNC

[edit]
This sectionmay beconfusing or unclear to readers. In particular, a list of equations (without, e.g., anexhaustive association with a complete diagram of the DNC) is not a digestible description for many readers of this article. Please helpclarify the section. There might be a discussion about this onthe talk page.(October 2017) (Learn how and when to remove this message)

DNC, as originally published[1]

Independent variables
xt{\displaystyle \mathbf {x} _{t}}Input vector
zt{\displaystyle \mathbf {z} _{t}}Target vector
Controller
χt=[xt;rt11;;rt1R]{\displaystyle {\boldsymbol {\chi }}_{t}=[\mathbf {x} _{t};\mathbf {r} _{t-1}^{1};\cdots ;\mathbf {r} _{t-1}^{R}]}Controller input matrix


Deep (layered) LSTM0lL{\displaystyle \forall \;0\leq l\leq L}
itl=σ(Wil[χt;ht1l;htl1]+bil){\displaystyle \mathbf {i} _{t}^{l}=\sigma (W_{i}^{l}[{\boldsymbol {\chi }}_{t};\mathbf {h} _{t-1}^{l};\mathbf {h} _{t}^{l-1}]+\mathbf {b} _{i}^{l})}Input gate vector
otl=σ(Wol[χt;ht1l;htl1]+bol){\displaystyle \mathbf {o} _{t}^{l}=\sigma (W_{o}^{l}[{\boldsymbol {\chi }}_{t};\mathbf {h} _{t-1}^{l};\mathbf {h} _{t}^{l-1}]+\mathbf {b} _{o}^{l})}Output gate vector
ftl=σ(Wfl[χt;ht1l;htl1]+bfl){\displaystyle \mathbf {f} _{t}^{l}=\sigma (W_{f}^{l}[{\boldsymbol {\chi }}_{t};\mathbf {h} _{t-1}^{l};\mathbf {h} _{t}^{l-1}]+\mathbf {b} _{f}^{l})}Forget gate vector
stl=ftlst1l+itltanh(Wsl[χt;ht1l;htl1]+bsl){\displaystyle \mathbf {s} _{t}^{l}=\mathbf {f} _{t}^{l}\mathbf {s} _{t-1}^{l}+\mathbf {i} _{t}^{l}\tanh(W_{s}^{l}[{\boldsymbol {\chi }}_{t};\mathbf {h} _{t-1}^{l};\mathbf {h} _{t}^{l-1}]+\mathbf {b} _{s}^{l})}State gate vector,
s0=0{\displaystyle s_{0}=0}
htl=otltanh(stl){\displaystyle \mathbf {h} _{t}^{l}=\mathbf {o} _{t}^{l}\tanh(\mathbf {s} _{t}^{l})}Hidden gate vector,
h0=0;ht0=0t{\displaystyle h_{0}=0;h_{t}^{0}=0\;\forall \;t}


yt=Wy[ht1;;htL]+Wr[rt1;;rtR]{\displaystyle \mathbf {y} _{t}=W_{y}[\mathbf {h} _{t}^{1};\cdots ;\mathbf {h} _{t}^{L}]+W_{r}[\mathbf {r} _{t}^{1};\cdots ;\mathbf {r} _{t}^{R}]}DNC output vector
Read & Write heads
ξt=Wξ[ht1;;htL]{\displaystyle \xi _{t}=W_{\xi }[h_{t}^{1};\cdots ;h_{t}^{L}]}Interface parameters
=[ktr,1;;ktr,R;β^tr,1;;β^tr,R;ktw;βtw^;e^t;vt;ft1^;;ftR^;g^ta;g^tw;π^t1;;π^tR]{\displaystyle =[\mathbf {k} _{t}^{r,1};\cdots ;\mathbf {k} _{t}^{r,R};{\hat {\beta }}_{t}^{r,1};\cdots ;{\hat {\beta }}_{t}^{r,R};\mathbf {k} _{t}^{w};{\hat {\beta _{t}^{w}}};\mathbf {\hat {e}} _{t};\mathbf {v} _{t};{\hat {f_{t}^{1}}};\cdots ;{\hat {f_{t}^{R}}};{\hat {g}}_{t}^{a};{\hat {g}}_{t}^{w};{\hat {\boldsymbol {\pi }}}_{t}^{1};\cdots ;{\hat {\boldsymbol {\pi }}}_{t}^{R}]}


Read heads1iR{\displaystyle \forall \;1\leq i\leq R}
ktr,i{\displaystyle \mathbf {k} _{t}^{r,i}}Read keys
βtr,i=oneplus(β^tr,i){\displaystyle \beta _{t}^{r,i}={\text{oneplus}}({\hat {\beta }}_{t}^{r,i})}Read strengths
fti=σ(f^ti){\displaystyle f_{t}^{i}=\sigma ({\hat {f}}_{t}^{i})}Free gates
πti=softmax(π^ti){\displaystyle {\boldsymbol {\pi }}_{t}^{i}={\text{softmax}}({\hat {\boldsymbol {\pi }}}_{t}^{i})}Read modes,
πtiR3{\displaystyle {\boldsymbol {\pi }}_{t}^{i}\in \mathbb {R} ^{3}}


Write head
ktw{\displaystyle \mathbf {k} _{t}^{w}}Write key
βtw=β^tw{\displaystyle \beta _{t}^{w}={\hat {\beta }}_{t}^{w}}Write strength
et=σ(e^t){\displaystyle \mathbf {e} _{t}=\sigma (\mathbf {\hat {e}} _{t})}Erase vector
vt{\displaystyle \mathbf {v} _{t}}Write vector
gta=σ(g^ta){\displaystyle g_{t}^{a}=\sigma ({\hat {g}}_{t}^{a})}Allocation gate
gtw=σ(g^tw){\displaystyle g_{t}^{w}=\sigma ({\hat {g}}_{t}^{w})}Write gate
Memory
Mt=Mt1(Ewtwet)+wtwvt{\displaystyle M_{t}=M_{t-1}\circ (E-\mathbf {w} _{t}^{w}\mathbf {e} _{t}^{\intercal })+\mathbf {w} _{t}^{w}\mathbf {v} _{t}^{\intercal }}Memory matrix,
Matrix of onesERN×W{\displaystyle E\in \mathbb {R} ^{N\times W}}
ut=(ut1+wt1wut1wt1w)ψt{\displaystyle \mathbf {u} _{t}=(\mathbf {u} _{t-1}+\mathbf {w} _{t-1}^{w}-\mathbf {u} _{t-1}\circ \mathbf {w} _{t-1}^{w})\circ {\boldsymbol {\psi }}_{t}}Usage vector
pt=(1iwtw[i])pt1+wtw{\displaystyle \mathbf {p} _{t}=\left(1-\sum _{i}\mathbf {w} _{t}^{w}[i]\right)\mathbf {p} _{t-1}+\mathbf {w} _{t}^{w}}Precedence weighting,
p0=0{\displaystyle \mathbf {p} _{0}=\mathbf {0} }
Lt=(1I)[(1wtw[i]wtw[j])Lt1[i,j]+wtw[i]pt1j]{\displaystyle L_{t}=(\mathbf {1} -\mathbf {I} )\left[(1-\mathbf {w} _{t}^{w}[i]-\mathbf {w} _{t}^{w}[j])L_{t-1}[i,j]+\mathbf {w} _{t}^{w}[i]\mathbf {p} _{t-1}^{j}\right]}Temporal link matrix,
L0=0{\displaystyle L_{0}=\mathbf {0} }
wtw=gtw[gtaat+(1gta)ctw]{\displaystyle \mathbf {w} _{t}^{w}=g_{t}^{w}[g_{t}^{a}\mathbf {a} _{t}+(1-g_{t}^{a})\mathbf {c} _{t}^{w}]}Write weighting
wtr,i=πti[1]bti+πti[2]ctr,i+πti[3]fti{\displaystyle \mathbf {w} _{t}^{r,i}={\boldsymbol {\pi }}_{t}^{i}[1]\mathbf {b} _{t}^{i}+{\boldsymbol {\pi }}_{t}^{i}[2]c_{t}^{r,i}+{\boldsymbol {\pi }}_{t}^{i}[3]f_{t}^{i}}Read weighting
rti=Mtwtr,i{\displaystyle \mathbf {r} _{t}^{i}=M_{t}^{\intercal }\mathbf {w} _{t}^{r,i}}Read vectors


C(M,k,β)[i]=exp{D(k,M[i,])β}jexp{D(k,M[j,])β}{\displaystyle {\mathcal {C}}(M,\mathbf {k} ,\beta )[i]={\frac {\exp\{{\mathcal {D}}(\mathbf {k} ,M[i,\cdot ])\beta \}}{\sum _{j}\exp\{{\mathcal {D}}(\mathbf {k} ,M[j,\cdot ])\beta \}}}}Content-based addressing,
Lookup keyk{\displaystyle \mathbf {k} }, key strengthβ{\displaystyle \beta }
ϕt{\displaystyle \phi _{t}}Indices ofut{\displaystyle \mathbf {u} _{t}},
sorted in ascending order of usage
at[ϕt[j]]=(1ut[ϕt[j]])i=1j1ut[ϕt[i]]{\displaystyle \mathbf {a} _{t}[\phi _{t}[j]]=(1-\mathbf {u} _{t}[\phi _{t}[j]])\prod _{i=1}^{j-1}\mathbf {u} _{t}[\phi _{t}[i]]}Allocation weighting
ctw=C(Mt1,ktw,βtw){\displaystyle \mathbf {c} _{t}^{w}={\mathcal {C}}(M_{t-1},\mathbf {k} _{t}^{w},\beta _{t}^{w})}Write content weighting
ctr,i=C(Mt1,ktr,i,βtr,i){\displaystyle \mathbf {c} _{t}^{r,i}={\mathcal {C}}(M_{t-1},\mathbf {k} _{t}^{r,i},\beta _{t}^{r,i})}Read content weighting
fti=Ltwt1r,i{\displaystyle \mathbf {f} _{t}^{i}=L_{t}\mathbf {w} _{t-1}^{r,i}}Forward weighting
bti=Ltwt1r,i{\displaystyle \mathbf {b} _{t}^{i}=L_{t}^{\intercal }\mathbf {w} _{t-1}^{r,i}}Backward weighting
ψt=i=1R(1ftiwt1r,i){\displaystyle {\boldsymbol {\psi }}_{t}=\prod _{i=1}^{R}\left(\mathbf {1} -f_{t}^{i}\mathbf {w} _{t-1}^{r,i}\right)}Memory retention vector
Definitions
W,b{\displaystyle \mathbf {W} ,\mathbf {b} }Weight matrix, bias vector
0,1,I{\displaystyle \mathbf {0} ,\mathbf {1} ,\mathbf {I} }Zeros matrix, ones matrix,identity matrix
{\displaystyle \circ }Element-wise multiplication
D(u,v)=uvuv{\displaystyle {\mathcal {D}}(\mathbf {u} ,\mathbf {v} )={\frac {\mathbf {u} \cdot \mathbf {v} }{\|\mathbf {u} \|\|\mathbf {v} \|}}}Cosine similarity
σ(x)=1/(1+ex){\displaystyle \sigma (x)=1/(1+e^{-x})}Sigmoid function
oneplus(x)=1+log(1+ex){\displaystyle {\text{oneplus}}(x)=1+\log(1+e^{x})}Oneplus function
softmax(x)j=exjk=1Kexk{\displaystyle {\text{softmax}}(\mathbf {x} )_{j}={\frac {e^{x_{j}}}{\sum _{k=1}^{K}e^{x_{k}}}}}    forj = 1, ...,K.Softmax function

Extensions

[edit]

Refinements include sparse memory addressing, which reduces time and space complexity by thousands of times. This can be achieved by using an approximate nearest neighbor algorithm, such asLocality-sensitive hashing, or a randomk-d tree like Fast Library for Approximate Nearest Neighbors fromUBC.[9] Adding Adaptive Computation Time (ACT) separates computation time from data time, which uses the fact that problem length and problem difficulty are not always the same.[10] Training using synthetic gradients performs considerably better thanBackpropagation through time (BPTT).[11] Robustness can be improved with use of layer normalization and Bypass Dropout as regularization.[12]

See also

[edit]

References

[edit]
  1. ^abcGraves, Alex; Wayne, Greg; Reynolds, Malcolm; Harley, Tim; Danihelka, Ivo; Grabska-Barwińska, Agnieszka; Colmenarejo, Sergio Gómez; Grefenstette, Edward; Ramalho, Tiago (2016-10-12)."Hybrid computing using a neural network with dynamic external memory".Nature.538 (7626):471–476.Bibcode:2016Natur.538..471G.doi:10.1038/nature20101.ISSN 1476-4687.PMID 27732574.S2CID 205251479.
  2. ^"Differentiable neural computers | DeepMind".DeepMind. 12 October 2016. Retrieved2016-10-19.
  3. ^abBurgess, Matt."DeepMind's AI learned to ride the London Underground using human-like reason and memory".WIRED UK. Retrieved2016-10-19.
  4. ^Jaeger, Herbert (2016-10-12)."Artificial intelligence: Deep neural reasoning".Nature.538 (7626):467–468.Bibcode:2016Natur.538..467J.doi:10.1038/nature19477.ISSN 1476-4687.PMID 27732576.
  5. ^abJames, Mike."DeepMind's Differentiable Neural Network Thinks Deeply".www.i-programmer.info. Retrieved2016-10-20.
  6. ^"DeepMind AI 'Learns' to Navigate London Tube".PCMAG. Retrieved2016-10-19.
  7. ^Mannes, John (13 October 2016)."DeepMind's differentiable neural computer helps you navigate the subway with its memory".TechCrunch. Retrieved2016-10-19.
  8. ^"RNN Symposium 2016: Alex Graves - Differentiable Neural Computer".YouTube. 22 March 2017.
  9. ^Jack W Rae; Jonathan J Hunt; Harley, Tim; Danihelka, Ivo; Senior, Andrew; Wayne, Greg; Graves, Alex; Timothy P Lillicrap (2016). "Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes".arXiv:1610.09027 [cs.LG].
  10. ^Graves, Alex (2016). "Adaptive Computation Time for Recurrent Neural Networks".arXiv:1603.08983 [cs.NE].
  11. ^Jaderberg, Max; Wojciech Marian Czarnecki; Osindero, Simon; Vinyals, Oriol; Graves, Alex; Silver, David; Kavukcuoglu, Koray (2016). "Decoupled Neural Interfaces using Synthetic Gradients".arXiv:1608.05343 [cs.LG].
  12. ^Franke, Jörg; Niehues, Jan; Waibel, Alex (2018). "Robust and Scalable Differentiable Neural Computer for Question Answering".arXiv:1807.02658 [cs.CL].

External links

[edit]
Concepts
Applications
Implementations
Audio–visual
Text
Decisional
People
Architectures
Retrieved from "https://en.wikipedia.org/w/index.php?title=Differentiable_neural_computer&oldid=1296416348"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp