- Notifications
You must be signed in to change notification settings - Fork151
gy910210/rnn-from-scratch
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
I’m assuming that you are somewhat familiar with basic Neural Networks. If you’re not, you may want to head over toImplementing A Neural Network From Scratch, which guides you through the ideas and implementation behind non-recurrent networks.
This post is inspired byrecurrent-neural-networks-tutorial fromWildML. And you can deeply read it to know the basic knowledge about RNN, which I will not include in this tutorial.
In this tutorial, we will focus on how to train RNN byBackpropagation Through Time (BPTT), based on thecomputation graph of RNN and doautomatic differentiation. You can find that it is more simple and reliable to calculate the gradient in this way than you do it by hand.
This post will take RNN language model (rnnlm) as example. More about the fancy applications of RNN can be foundhere.
The architecture of RNN can be as the following figure.
You can find that the parameters(W, U, V) are shared in different time steps. And the output in each time step can besoftmax. So you can usecross entropy loss as an error function and use some optimizing method (e.g. gradient descent) to calculate the optimized parameters(W, U, V).
Let recap the equations of our RNN:
We also defined our loss, or error, to be the cross entropy loss, given by:
Herey_t is the correct word at time stept, andy^_t is our prediction. We typically treat the full sequence (sentence) as one training example, so the total error is just the sum of the errors at each time step (word).
Remember that our goal is to calculate the gradients of the error with respect to our parametersU,V andW and then learn good parameters using optimizing method (in this post we useStochastic Gradient Descent). Just like we sum up the errors, we also sum up the gradients at each time step for one training example:
. That is we should calculatedEt/dW,dEt/dU anddEt/dV, then sum up all time steps.
It is simple to calculatedEt/dV, because it only depends on the values at the current time step. But the story is different fordEt/dW anddEt/dU. Note thats_3 = tanh(Ux_3 + Ws_2) depend ons_2, which depends onW,U ands_1, and so on. So if we take the derivative with respect toW we can't treats_2 as a constant! We need to apply the chain rule again. You can have a view from the following figure.
Now usecomputation graph to representE1 as an example and calculatedE1/dW,dE1/dU is the same idea.
Note that this is exactly the same as the standard backpropagation algorithm that we use in deepFeedforward Neural Networks. The key difference is that we sum up the gradients forW at each time step. In a traditional NN we don’t share parameters across layers, so we don’t need to sum anything. But in my opinion BPTT is just a fancy name for standard backpropagation on an unrolled RNN.
To simplify thecomputation graph to make it efficient, we can integrate some small operation units to a big operation unit. You can have a look the following figure. Note that the operation unit should also implement theforward function andbackward function.
The implementation of all operation unit and softmax output can be found as follows:
mulGate=MultiplyGate()addGate=AddGate()activation=Tanh()classRNNLayer:defforward(self,x,prev_s,U,W,V):self.mulu=mulGate.forward(U,x)self.mulw=mulGate.forward(W,prev_s)self.add=addGate.forward(self.mulw,self.mulu)self.s=activation.forward(self.add)self.mulv=mulGate.forward(V,self.s)defbackward(self,x,prev_s,U,W,V,diff_s,dmulv):self.forward(x,prev_s,U,W,V)dV,dsv=mulGate.backward(V,self.s,dmulv)ds=dsv+diff_sdadd=activation.backward(self.add,ds)dmulw,dmulu=addGate.backward(self.mulw,self.mulu,dadd)dW,dprev_s=mulGate.backward(W,prev_s,dmulw)dU,dx=mulGate.backward(U,x,dmulu)return (dprev_s,dU,dW,dV)
classMultiplyGate:defforward(self,W,x):returnnp.dot(W,x)defbackward(self,W,x,dz):dW=np.asarray(np.dot(np.transpose(np.asmatrix(dz)),np.asmatrix(x)))dx=np.dot(np.transpose(W),dz)returndW,dxclassAddGate:defforward(self,x1,x2):returnx1+x2defbackward(self,x1,x2,dz):dx1=dz*np.ones_like(x1)dx2=dz*np.ones_like(x2)returndx1,dx2
classSigmoid:defforward(self,x):return1.0/ (1.0+np.exp(-x))defbackward(self,x,top_diff):output=self.forward(x)return (1.0-output)*output*top_diffclassTanh:defforward(self,x):returnnp.tanh(x)defbackward(self,x,top_diff):output=self.forward(x)return (1.0-np.square(output))*top_diff
classSoftmax:defpredict(self,x):exp_scores=np.exp(x)returnexp_scores/np.sum(exp_scores)defloss(self,x,y):probs=self.predict(x)return-np.log(probs[y])defdiff(self,x,y):probs=self.predict(x)probs[y]-=1.0returnprobs
These implementation is just the same withImplementing A Neural Network From Scratch, except that in this post the inputx ors is1-D array, but in previous post inputX is a batch of data represented as a matrix (each row is an example).
Now that we are able to calculate the gradients for our parameters we can use SGD to train the model.
Initializing the parametersU,V andW is a bit tricky. We can’t just initialize them to 0’s because that would result in symmetric calculations in all our layers. We must initialize them randomly. Because proper initialization seems to have an impact on training results there has been lot of research in this area. It turns out that the best initialization depends on the activation function (tanh in our case) and one recommended approach is to initialize the weights randomly in the interval from
wheren is the number of incoming connections from the previous layer.
classModel:def__init__(self,word_dim,hidden_dim=100,bptt_truncate=4):self.word_dim=word_dimself.hidden_dim=hidden_dimself.bptt_truncate=bptt_truncateself.U=np.random.uniform(-np.sqrt(1./word_dim),np.sqrt(1./word_dim), (hidden_dim,word_dim))self.W=np.random.uniform(-np.sqrt(1./hidden_dim),np.sqrt(1./hidden_dim), (hidden_dim,hidden_dim))self.V=np.random.uniform(-np.sqrt(1./hidden_dim),np.sqrt(1./hidden_dim), (word_dim,hidden_dim))
Above,word_dim is the size of our vocabulary, andhidden_dim is the size of our hidden layer (we can pick it). Don’t worry about thebptt_truncate parameter for now, we’ll explain what that is later.
Next, let’s implement the forward propagation (predicting word probabilities) defined by our equations above:
''' forward propagation (predicting word probabilities) x is one single data, and a batch of data for example x = [0, 179, 341, 416], then its y = [179, 341, 416, 1]'''defforward_propagation(self,x):# The total number of time stepsT=len(x)layers= []prev_s=np.zeros(self.hidden_dim)# For each time step...fortinrange(T):layer=RNNLayer()input=np.zeros(self.word_dim)input[x[t]]=1layer.forward(input,prev_s,self.U,self.W,self.V)prev_s=layer.slayers.append(layer)returnlayers
We also implement apredict function to generate the results.
defpredict(self,x):output=Softmax()layers=self.forward_propagation(x)return [np.argmax(output.predict(layer.mulv))forlayerinlayers]
To train our network we need a way to measure the errors it makes. We call this the loss functionL, and our goal is find the parametersU,V andW that minimize the loss function for our training data. A common choice for the loss function is thecross-entropy loss.
defcalculate_loss(self,x,y):assertlen(x)==len(y)output=Softmax()layers=self.forward_propagation(x)loss=0.0fori,layerinenumerate(layers):loss+=output.loss(layer.mulv,y[i])returnloss/float(len(y))defcalculate_total_loss(self,X,Y):loss=0.0foriinrange(len(Y)):loss+=self.calculate_loss(X[i],Y[i])returnloss/float(len(Y))
Just as what we have introduced, we implement BPTT algorithm. It takes as input a training example(x, y) and returns the gradientsdL/dW,dL/dU anddL/dV.
defbptt(self,x,y):assertlen(x)==len(y)output=Softmax()layers=self.forward_propagation(x)dU=np.zeros(self.U.shape)dV=np.zeros(self.V.shape)dW=np.zeros(self.W.shape)T=len(layers)prev_s_t=np.zeros(self.hidden_dim)diff_s=np.zeros(self.hidden_dim)fortinrange(0,T):dmulv=output.diff(layers[t].mulv,y[t])input=np.zeros(self.word_dim)input[x[t]]=1dprev_s,dU_t,dW_t,dV_t=layers[t].backward(input,prev_s_t,self.U,self.W,self.V,diff_s,dmulv)prev_s_t=layers[t].sdmulv=np.zeros(self.word_dim)foriinrange(t-1,max(-1,t-self.bptt_truncate-1),-1):input=np.zeros(self.word_dim)input[x[i]]=1prev_s_i=np.zeros(self.hidden_dim)ifi==0elselayers[i-1].sdprev_s,dU_i,dW_i,dV_i=layers[i].backward(input,prev_s_i,self.U,self.W,self.V,dprev_s,dmulv)dU_t+=dU_idW_t+=dW_idV+=dV_tdU+=dU_tdW+=dW_treturn (dU,dW,dV)
Now that we are able to calculate the gradients for our parameters we can implement SGD. I like to do this in two steps:
- A function
sdg_stepthat calculates the gradients and performs the updates for one batch. - An outer loop that iterates through the training set and adjusts the learning rate.
defsgd_step(self,x,y,learning_rate):dU,dW,dV=self.bptt(x,y)self.U-=learning_rate*dUself.V-=learning_rate*dVself.W-=learning_rate*dWdeftrain(self,X,Y,learning_rate=0.005,nepoch=100,evaluate_loss_after=5):num_examples_seen=0losses= []forepochinrange(nepoch):if (epoch%evaluate_loss_after==0):loss=self.calculate_total_loss(X,Y)losses.append((num_examples_seen,loss))time=datetime.now().strftime('%Y-%m-%d %H:%M:%S')print("%s: Loss after num_examples_seen=%d epoch=%d: %f"% (time,num_examples_seen,epoch,loss))# Adjust the learning rate if loss increasesiflen(losses)>1andlosses[-1][1]>losses[-2][1]:learning_rate=learning_rate*0.5print("Setting learning rate to %f"%learning_rate)sys.stdout.flush()# For each training example...foriinrange(len(Y)):self.sgd_step(X[i],Y[i],learning_rate)num_examples_seen+=1returnlosses
Here, we annealing the learning rate by0.5 if we find the loss increases in this epoch. More about the decay of learning rate can be foundhere.
Done! Let’s try to get a sense of how long it would take to train our network:
word_dim=8000hidden_dim=100X_train,y_train=getSentenceData('data/reddit-comments-2015-08.csv',word_dim)np.random.seed(10)rnn=Model(word_dim,hidden_dim)rnn.sgd_step(X_train[10],y_train[10],0.005)
Bad new is that one step of SGD takes a few seconds on my laptop. We have about 80,000 examples in our training data, so one epoch (iteration over the whole data set) would take several hours. Multiple epochs would take days, or even weeks!
There are many ways to speed up our code. One is to implement our code on GPU with some library likeTheano. But in this tutorial, let’s just try to run SGD with a small dataset and check if the loss actually decreases:
word_dim=8000hidden_dim=100X_train,y_train=getSentenceData('data/reddit-comments-2015-08.csv',word_dim)np.random.seed(10)rnn=Model(word_dim,hidden_dim)losses=rnn.train(X_train[:100],y_train[:100],learning_rate=0.005,nepoch=10,evaluate_loss_after=1)
Good, it seems like our implementation is at least doing something useful and decreasing the loss, just like we wanted.
There is a problem about RNN calledvanishing gradient problem. That's why traditional RNN cannot capture the long term dependency, so we usebptt_truncate parameter to constrain the length of dependency. This will motivate our move to more sophisticated RNN models, such as LSTMs, which are the current state of the art for many tasks in NLP.
More aboutvanishing gradient problem andLSTM can be foundhere andhere andhere.
- https://github.com/pangolulu/neural-network-from-scratch
- http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
- http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/
- http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/
About
Implementing Recurrent Neural Network from Scratch
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.







