Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Cover image for Let's pay some Attention!
Shambhavi Mishra
Shambhavi Mishra

Posted on

     

Let's pay some Attention!

Before discussing a new technology or methodology, we should try to understand the need of it. And so, let us know what gave path to the Transformer Networks.

Challenges with Recurrent Neural Networks

Image from mc.ai
(Image Source : mc.ai)

Gradients are simply vectors pointing in the direction of highest rate of increase of the function. During backpropagation, gradients go through matrix multiplication multiple times using the chain rule. Small gradients get smaller until they vanish and thus it gets harder to train the weights. This is called the vanishing gradient problem.
While smaller gradients vanish, if your gradient is a large value they go on increasing and result in very large updates to our network. This is known as the exploding gradient problem.

Another challenge one faces with RNNs is that of 'reccurence'. Recurrence prevents parallel computation.
Also, large number of training steps are required to train an RNN.

Solution to all our problems is -Transformers!
As the title says,Attention is all you need byVaswani et al, (2017) is the paper that introduced the concept of transformers.
Let us first understand theAttention Mechanism.
Below attached is an image from my notes ofProf. Pascal Poupart's lecture on Transformers.
Alt Text

Attention Mechanism mimics the retrieval of a value (v) for a query (q) based on a key (k) in the database.
We have a query and some keys (k1, k2, k3, k4), we aim to produce an output which is a linear combination of values where the weights come from the similarity between our query and keys.
In the above diagram, the first layer consists of the keys (vectors). We generate another layer from the similarity comparison of these keys with the query (q). Thus the second layer consists of similarities (s).

We take softmax of these values to yield another layer (a). The product of values in (a) with the values (v) gives us the attention value.

So far we have understood what gave rise to the need ofAttention and what exactly isAttention Mechanism.
What more will we cover?

  • Multihead Attention
  • Masked Multihead Attention
  • Layer Normalisation
  • Positional Embedding
  • Comparison of Self Attention and Recurrent Layers

Let's cover all this in the next blog!
You can follow me ontwitter where I share all the good content and blogs!

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

  • Joined

More fromShambhavi Mishra

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp