gy910210/rnn-from-scratchPublic

NotificationsYou must be signed in to change notification settings
Fork151
Star520

Implementing Recurrent Neural Network from Scratch

520 stars 151 forks Branches Tags Activity

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
__pycache__		__pycache__
data		data
figures		figures
README.md		README.md
activation.py		activation.py
gate.py		gate.py
layer.py		layer.py
output.py		output.py
preprocessing.py		preprocessing.py
rnn.py		rnn.py
rnnlm.py		rnnlm.py

Repository files navigation

Implementing Recurrent Neural Network from Scratch

I’m assuming that you are somewhat familiar with basic Neural Networks. If you’re not, you may want to head over toImplementing A Neural Network From Scratch, which guides you through the ideas and implementation behind non-recurrent networks.

Introduction

This post is inspired byrecurrent-neural-networks-tutorial fromWildML. And you can deeply read it to know the basic knowledge about RNN, which I will not include in this tutorial.

In this tutorial, we will focus on how to train RNN byBackpropagation Through Time (BPTT), based on thecomputation graph of RNN and doautomatic differentiation. You can find that it is more simple and reliable to calculate the gradient in this way than you do it by hand.

This post will take RNN language model (rnnlm) as example. More about the fancy applications of RNN can be foundhere.

How to train RNN

The architecture of RNN can be as the following figure.

You can find that the parameters(W, U, V) are shared in different time steps. And the output in each time step can besoftmax. So you can usecross entropy loss as an error function and use some optimizing method (e.g. gradient descent) to calculate the optimized parameters(W, U, V).

Let recap the equations of our RNN:

We also defined our loss, or error, to be the cross entropy loss, given by:

Herey_t is the correct word at time stept, andy^_t is our prediction. We typically treat the full sequence (sentence) as one training example, so the total error is just the sum of the errors at each time step (word).

Remember that our goal is to calculate the gradients of the error with respect to our parametersU,V andW and then learn good parameters using optimizing method (in this post we useStochastic Gradient Descent). Just like we sum up the errors, we also sum up the gradients at each time step for one training example:. That is we should calculatedEt/dW,dEt/dU anddEt/dV, then sum up all time steps.

It is simple to calculatedEt/dV, because it only depends on the values at the current time step. But the story is different fordEt/dW anddEt/dU. Note thats_3 = tanh(Ux_3 + Ws_2) depend ons_2, which depends onW,U ands_1, and so on. So if we take the derivative with respect toW we can't treats_2 as a constant! We need to apply the chain rule again. You can have a view from the following figure.

Now usecomputation graph to representE1 as an example and calculatedE1/dW,dE1/dU is the same idea.

Note that this is exactly the same as the standard backpropagation algorithm that we use in deepFeedforward Neural Networks. The key difference is that we sum up the gradients forW at each time step. In a traditional NN we don’t share parameters across layers, so we don’t need to sum anything. But in my opinion BPTT is just a fancy name for standard backpropagation on an unrolled RNN.

To simplify thecomputation graph to make it efficient, we can integrate some small operation units to a big operation unit. You can have a look the following figure. Note that the operation unit should also implement theforward function andbackward function.

The implementation of all operation unit and softmax output can be found as follows:

mulGate=MultiplyGate()addGate=AddGate()activation=Tanh()classRNNLayer:defforward(self,x,prev_s,U,W,V):self.mulu=mulGate.forward(U,x)self.mulw=mulGate.forward(W,prev_s)self.add=addGate.forward(self.mulw,self.mulu)self.s=activation.forward(self.add)self.mulv=mulGate.forward(V,self.s)defbackward(self,x,prev_s,U,W,V,diff_s,dmulv):self.forward(x,prev_s,U,W,V)dV,dsv=mulGate.backward(V,self.s,dmulv)ds=dsv+diff_sdadd=activation.backward(self.add,ds)dmulw,dmulu=addGate.backward(self.mulw,self.mulu,dadd)dW,dprev_s=mulGate.backward(W,prev_s,dmulw)dU,dx=mulGate.backward(U,x,dmulu)return (dprev_s,dU,dW,dV)

classMultiplyGate:defforward(self,W,x):returnnp.dot(W,x)defbackward(self,W,x,dz):dW=np.asarray(np.dot(np.transpose(np.asmatrix(dz)),np.asmatrix(x)))dx=np.dot(np.transpose(W),dz)returndW,dxclassAddGate:defforward(self,x1,x2):returnx1+x2defbackward(self,x1,x2,dz):dx1=dz*np.ones_like(x1)dx2=dz*np.ones_like(x2)returndx1,dx2

classSigmoid:defforward(self,x):return1.0/ (1.0+np.exp(-x))defbackward(self,x,top_diff):output=self.forward(x)return (1.0-output)*output*top_diffclassTanh:defforward(self,x):returnnp.tanh(x)defbackward(self,x,top_diff):output=self.forward(x)return (1.0-np.square(output))*top_diff

classSoftmax:defpredict(self,x):exp_scores=np.exp(x)returnexp_scores/np.sum(exp_scores)defloss(self,x,y):probs=self.predict(x)return-np.log(probs[y])defdiff(self,x,y):probs=self.predict(x)probs[y]-=1.0returnprobs

These implementation is just the same withImplementing A Neural Network From Scratch, except that in this post the inputx ors is1-D array, but in previous post inputX is a batch of data represented as a matrix (each row is an example).

Now that we are able to calculate the gradients for our parameters we can use SGD to train the model.

Implement

Initialization

Initializing the parametersU,V andW is a bit tricky. We can’t just initialize them to 0’s because that would result in symmetric calculations in all our layers. We must initialize them randomly. Because proper initialization seems to have an impact on training results there has been lot of research in this area. It turns out that the best initialization depends on the activation function (tanh in our case) and one recommended approach is to initialize the weights randomly in the interval from wheren is the number of incoming connections from the previous layer.

classModel:def__init__(self,word_dim,hidden_dim=100,bptt_truncate=4):self.word_dim=word_dimself.hidden_dim=hidden_dimself.bptt_truncate=bptt_truncateself.U=np.random.uniform(-np.sqrt(1./word_dim),np.sqrt(1./word_dim), (hidden_dim,word_dim))self.W=np.random.uniform(-np.sqrt(1./hidden_dim),np.sqrt(1./hidden_dim), (hidden_dim,hidden_dim))self.V=np.random.uniform(-np.sqrt(1./hidden_dim),np.sqrt(1./hidden_dim), (word_dim,hidden_dim))

Above,word_dim is the size of our vocabulary, andhidden_dim is the size of our hidden layer (we can pick it). Don’t worry about thebptt_truncate parameter for now, we’ll explain what that is later.

Forward Propagation

Next, let’s implement the forward propagation (predicting word probabilities) defined by our equations above:

'''    forward propagation (predicting word probabilities)    x is one single data, and a batch of data    for example x = [0, 179, 341, 416], then its y = [179, 341, 416, 1]'''defforward_propagation(self,x):# The total number of time stepsT=len(x)layers= []prev_s=np.zeros(self.hidden_dim)# For each time step...fortinrange(T):layer=RNNLayer()input=np.zeros(self.word_dim)input[x[t]]=1layer.forward(input,prev_s,self.U,self.W,self.V)prev_s=layer.slayers.append(layer)returnlayers

We also implement apredict function to generate the results.

defpredict(self,x):output=Softmax()layers=self.forward_propagation(x)return [np.argmax(output.predict(layer.mulv))forlayerinlayers]

Calculating the Loss

To train our network we need a way to measure the errors it makes. We call this the loss functionL, and our goal is find the parametersU,V andW that minimize the loss function for our training data. A common choice for the loss function is thecross-entropy loss.

defcalculate_loss(self,x,y):assertlen(x)==len(y)output=Softmax()layers=self.forward_propagation(x)loss=0.0fori,layerinenumerate(layers):loss+=output.loss(layer.mulv,y[i])returnloss/float(len(y))defcalculate_total_loss(self,X,Y):loss=0.0foriinrange(len(Y)):loss+=self.calculate_loss(X[i],Y[i])returnloss/float(len(Y))

Backpropagation Through Time (BPTT)

Just as what we have introduced, we implement BPTT algorithm. It takes as input a training example(x, y) and returns the gradientsdL/dW,dL/dU anddL/dV.

defbptt(self,x,y):assertlen(x)==len(y)output=Softmax()layers=self.forward_propagation(x)dU=np.zeros(self.U.shape)dV=np.zeros(self.V.shape)dW=np.zeros(self.W.shape)T=len(layers)prev_s_t=np.zeros(self.hidden_dim)diff_s=np.zeros(self.hidden_dim)fortinrange(0,T):dmulv=output.diff(layers[t].mulv,y[t])input=np.zeros(self.word_dim)input[x[t]]=1dprev_s,dU_t,dW_t,dV_t=layers[t].backward(input,prev_s_t,self.U,self.W,self.V,diff_s,dmulv)prev_s_t=layers[t].sdmulv=np.zeros(self.word_dim)foriinrange(t-1,max(-1,t-self.bptt_truncate-1),-1):input=np.zeros(self.word_dim)input[x[i]]=1prev_s_i=np.zeros(self.hidden_dim)ifi==0elselayers[i-1].sdprev_s,dU_i,dW_i,dV_i=layers[i].backward(input,prev_s_i,self.U,self.W,self.V,dprev_s,dmulv)dU_t+=dU_idW_t+=dW_idV+=dV_tdU+=dU_tdW+=dW_treturn (dU,dW,dV)

SGD Implementation

Now that we are able to calculate the gradients for our parameters we can implement SGD. I like to do this in two steps:

A functionsdg_step that calculates the gradients and performs the updates for one batch.
An outer loop that iterates through the training set and adjusts the learning rate.

defsgd_step(self,x,y,learning_rate):dU,dW,dV=self.bptt(x,y)self.U-=learning_rate*dUself.V-=learning_rate*dVself.W-=learning_rate*dWdeftrain(self,X,Y,learning_rate=0.005,nepoch=100,evaluate_loss_after=5):num_examples_seen=0losses= []forepochinrange(nepoch):if (epoch%evaluate_loss_after==0):loss=self.calculate_total_loss(X,Y)losses.append((num_examples_seen,loss))time=datetime.now().strftime('%Y-%m-%d %H:%M:%S')print("%s: Loss after num_examples_seen=%d epoch=%d: %f"% (time,num_examples_seen,epoch,loss))# Adjust the learning rate if loss increasesiflen(losses)>1andlosses[-1][1]>losses[-2][1]:learning_rate=learning_rate*0.5print("Setting learning rate to %f"%learning_rate)sys.stdout.flush()# For each training example...foriinrange(len(Y)):self.sgd_step(X[i],Y[i],learning_rate)num_examples_seen+=1returnlosses

Here, we annealing the learning rate by0.5 if we find the loss increases in this epoch. More about the decay of learning rate can be foundhere.

Evaluation

Done! Let’s try to get a sense of how long it would take to train our network:

word_dim=8000hidden_dim=100X_train,y_train=getSentenceData('data/reddit-comments-2015-08.csv',word_dim)np.random.seed(10)rnn=Model(word_dim,hidden_dim)rnn.sgd_step(X_train[10],y_train[10],0.005)

Bad new is that one step of SGD takes a few seconds on my laptop. We have about 80,000 examples in our training data, so one epoch (iteration over the whole data set) would take several hours. Multiple epochs would take days, or even weeks!

There are many ways to speed up our code. One is to implement our code on GPU with some library likeTheano. But in this tutorial, let’s just try to run SGD with a small dataset and check if the loss actually decreases:

word_dim=8000hidden_dim=100X_train,y_train=getSentenceData('data/reddit-comments-2015-08.csv',word_dim)np.random.seed(10)rnn=Model(word_dim,hidden_dim)losses=rnn.train(X_train[:100],y_train[:100],learning_rate=0.005,nepoch=10,evaluate_loss_after=1)

Good, it seems like our implementation is at least doing something useful and decreasing the loss, just like we wanted.

Further more

There is a problem about RNN calledvanishing gradient problem. That's why traditional RNN cannot capture the long term dependency, so we usebptt_truncate parameter to constrain the length of dependency. This will motivate our move to more sophisticated RNN models, such as LSTMs, which are the current state of the art for many tasks in NLP.

More aboutvanishing gradient problem andLSTM can be foundhere andhere andhere.

Reference

About

Implementing Recurrent Neural Network from Scratch

Releases

No releases published

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Folders and files

Latest commit

History

Repository files navigation

Implementing Recurrent Neural Network from Scratch

Introduction

How to train RNN

Implement

Initialization

Forward Propagation

Calculating the Loss

Backpropagation Through Time (BPTT)

SGD Implementation

Evaluation

Further more

Reference

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

gy910210/rnn-from-scratch

Folders and files

Latest commit

History

Repository files navigation

Implementing Recurrent Neural Network from Scratch

Introduction

How to train RNN

Implement

Initialization

Forward Propagation

Calculating the Loss

Backpropagation Through Time (BPTT)

SGD Implementation

Evaluation

Further more

Reference

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages