Movatterモバイル変換

Attention (machine learning)

From Wikipedia, the free encyclopedia

Machine learning technique

Machine learning anddata mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Semi-supervised learning Self-supervised learning Reinforcement learning Meta-learning Online learning Batch learning Curriculum learning Rule-based learning Neuro-symbolic AI Neuromorphic engineering Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Neural networks Autoencoder Deep learning Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Boltzmann machine Restricted GAN Diffusion model SOM Convolutional neural network U-Net LeNet AlexNet DeepDream Neural field Neural radiance field Physics-informed neural networks Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning Policy gradient SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop Mechanistic interpretability RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory Topological deep learning
Journals and conferences AAAI ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

Inmachine learning,attention is a method that determines the importance of each component in a sequence relative to the other components in that sequence. Innatural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodesvectors calledtoken embeddings across a fixed-widthsequence that can range from tens to millions of tokens in size.

Unlike "hard" weights, which are computed during the backwards training pass, "soft" weights exist only in the forward pass and therefore change with every step of the input. Earlier designs implemented the attention mechanism in a serialrecurrent neural network (RNN) language translation system, but a more recent design, namely thetransformer, removed the slower sequential RNN and relied more heavily on the faster parallel attention scheme.

Inspired by ideas aboutattention in humans, the attention mechanism was developed to address the weaknesses of using information from thehidden layers of recurrent neural networks. Recurrent neural networks favor more recent information contained in words at the end of a sentence, while information earlier in the sentence tends to beattenuated. Attention allows a token equal access to any part of a sentence directly, rather than only through the previous state.

1950s–1960s	Psychology and biology of attention.Cocktail party effect^[1] — focusing on content by filtering out background noise.Filter model of attention,^[2]partial report paradigm, andsaccade control.^[3]
1980s	Sigma-pi units,^[4] higher-order neural networks.
1990s	Fast weight controllers and dynamic links between neurons, anticipating key-value mechanisms in attention.^[5]^[6]^[7]^[8]
1998	Thebilateral filter was introduced in image processing. It uses pairwise affinity matrices to propagate relevance across elements.^[9]
2005	Non-local means extended affinity-based filtering in image denoising, using Gaussian similarity kernels as fixed attention-like weights.^[10]
2014	seq2seq with RNN + Attention.^[11] Attention was introduced to enhance RNN encoder-decoder translation, particularly for long sentences. See Overview section. Attentional Neural Networks introduced a learned feature selection mechanism using top-down cognitive modulation, showing how attention weights can highlight relevant inputs.^[12]
2015	Attention was extended to vision for image captioning tasks.^[13]^[14]
2016	Self-attention was integrated into RNN-based models to capture intra-sequence dependencies.^[15]^[16] Self-attention was explored in decomposable attention models for natural language inference^[17] and structured self-attentive sentence embeddings.^[18]
2017	TheTransformer architecture introduced in the research paperAttention is All You Need^[19] formalized scaled dot-product self-attention: $A={\text{softmax}}\left({\frac {QK^{T}}{\sqrt {d_{k}}}}\right)V$ Relation networks^[20] and set Transformers^[21] applied attention to unordered sets and relational reasoning, generalizing pairwise interaction models.
2018	Non-local neural networks^[22] extended attention to computer vision by capturing long-range dependencies in space and time. Graph attention networks^[23] applied attention mechanisms to graph-structured data.
2019–2020	Efficient Transformers, including Reformer,^[24] Linformer,^[25] and Performer,^[26] introduced scalable approximations of attention for long sequences.
2019+	Hopfield networks were reinterpreted as associative memory-based attention systems,^[27] andvision transformers (ViTs) achieved competitive results in image classification.^[28] Transformers were adopted across scientific domains, includingAlphaFold for protein folding,^[29] CLIP for vision-language pretraining,^[30] and attention-based dense segmentation models like CCNet^[31] and DANet.^[32]

Label	Description
Variables X, H, S, T	Upper case variables represent the entire sentence, and not just the current word. For example, H is a matrix of the encoder hidden state—one word per column.
S, T	S, decoder hidden state; T, target word embedding. In thePytorch Tutorial variant training phase, T alternates between 2 sources depending on the level ofteacher forcing used. T could be the embedding of the network's output word; i.e. embedding(argmax(FC output)). Alternatively with teacher forcing, T could be the embedding of the known correct word which can occur with a constant forcing probability, say 1/2.
X, H	H, encoder hidden state; X, input word embeddings.
W	Attention coefficients
Qw, Kw, Vw, FC	Weight matrices for query, key, value respectively. FC is a fully-connected weight matrix.
⊕, ⊗	⊕, vector concatenation; ⊗, matrix multiplication.
corr	Column-wise softmax(matrix of all combinations of dot products). The dot products arex_i * x_j in variant #3,*h_i s*_j in variant 1, and column _i ( Kw H ) * column _j ( Qw * S ) in variant 2, and column _i ( Kw * X ) * column _j ( Qw * X ) in variant 4. Variant 5 uses a fully-connected layer to determine the coefficients. If the variant is QKV, then the dot products are normalized by the√d whered is the height of the QKV matrices.

	I	love	you
je	0.94	0.02	0.04
t'	0.11	0.01	0.88
aime	0.03	0.95	0.02

Movatterモバイル変換

History

Overview

Interpreting attention weights

Variants

Optimizations

Flash attention

FlexAttention

Applications

Attention maps as explanations for vision transformers

Mathematical representation

Standard scaled dot-product attention

Masked attention

Multi-head attention

Bahdanau (additive) attention

Luong attention (general)

Self-attention

Masking

See also

References

External links