60

In MNIST LSTM examples, I don't understand what "hidden layer" means. Is it the imaginary-layer formed when you represent an unrolled RNN over time?

Why is thenum_units = 128 in most cases ?

askedJun 18, 2016 at 19:51
Subrat's user avatar
2
  • I'd like to note that the authors of that tutorial (that is, the one the OP is linking to) have changed the name of the variables, includingnum_units tonum_hidden. There's now a comment in front of that variable sayinghidden layer num of features.CommentedJan 1, 2018 at 19:01
  • Sure, I've modified it accordingly.CommentedJan 1, 2018 at 22:47

11 Answers11

49

Fromthis brilliant article

num_units can be interpreted as the analogy of hidden layer from the feed forward neural network. The number of nodes in hidden layer of a feed forward neural network is equivalent to num_units number of LSTM units in a LSTM cell at every time step of the network.

See theimage there too!

enter image description here

logankilpatrick's user avatar
logankilpatrick
14.7k14 gold badges69 silver badges155 bronze badges
answeredNov 22, 2017 at 15:41
Arraval's user avatar
Sign up to request clarification or add additional context in comments.

3 Comments

Excellent block diagram for LSTM, Can you explain with diagram what exactly is inside of the units in the num_units of each LSTM cell, as each LSTM cell contains Input Gate, output Gate and Forget gates respectively.
@Biranchi, Inside the LSTM cell are LSTM units. In the article cited, each of thenum_units in each LSTM cells receives one pixel of a certain row of an image. The size of the image is 28x28 pixels. In the example, they used 28num_units and 28 LSTM cells. Basically each cells works on a given row of the image.
This figure perfectly summarizes everything
46

The number of hidden units is a direct representation of the learning capacity of a neural network -- it reflects the number oflearned parameters. The value128 was likely selected arbitrarily or empirically. You can change that value experimentally and rerun the program to see how it affects the training accuracy (you can get better than 90% test accuracy witha lot fewer hidden units). Using more units makes it more likely to perfectly memorize the complete training set (although it will take longer, and you run the risk of over-fitting).

The key thing to understand, which is somewhat subtle in the famousColah's blog post (find"each line carries an entire vector"), is thatX is anarray of data (nowadays often called atensor) -- it is not meant to be ascalar value. Where, for example, thetanh function is shown, it is meant to imply that the function isbroadcast across the entire array (an implicitfor loop) -- and not simply performed once per time-step.

As such, thehidden units represent tangible storage within the network, which is manifest primarily in the size of theweights array. And because an LSTM actually does have a bit of it's own internal storage separate from the learned model parameters, it has to know how many units there are -- which ultimately needs to agree with the size of the weights. In the simplest case, an RNN has no internal storage -- so it doesn't even need to know in advance how many "hidden units" it is being applied to.


  • A good answer to a similar questionhere.
  • You can look atthe source for BasicLSTMCell in TensorFlow to see exactly how this is used.

Side note:This notation is very common in statistics and machine-learning, and other fields that process large batches of data with a common formula (3D graphics is another example). It takes a bit of getting used to for people who expect to see theirfor loops written out explicitly.

answeredSep 11, 2016 at 20:13
Brent Bradburn's user avatar

9 Comments

Further questions: How much total memory is involved? How are the weights connected to the LSTM units? Note: See TensorBoard graph visualizations.
I recommendLSTM: A Search Space Odyssey sections 1-3.
Looks like there was a followup in the comments here:RNNS IN TENSORFLOW, A PRACTICAL GUIDE AND UNDOCUMENTED FEATURES
Did I get it right: "a simple RNN doesn't need to know in advance how many hidden units"? Doesn't it need to know that to construct the weights that map between the units -- which grow in count exponentially based on the number of units (even in the simplest RNN). I think that I didn't understand that aspect of the architecture when I wrote this answer (see my first comment). But note that graph visualizations don't tend to help due to the array-based notation.
...Kind of funny that, using an array-based notation, a data path with an exponential signal count can be represented by a single dark line.
|
32

The argumentn_hidden ofBasicLSTMCell is the number of hidden units of the LSTM.

As you said, you should really read Colah'sblog post to understand LSTM, but here is a little heads up.


If you have an inputx of shape[T, 10], you will feed the LSTM with the sequence of values fromt=0 tot=T-1, each of size10.

At each timestep, you multiply the input with a matrix of shape[10, n_hidden], and get an_hidden vector.

Your LSTM gets at each timestept:

  • the previous hidden stateh_{t-1}, of sizen_hidden (att=0, the previous state is[0., 0., ...])
  • the input, transformed to sizen_hidden
  • it willsum these inputs and produce the next hidden stateh_t of sizen_hidden

From Colah's blog post:LSTM


If you just want to have code working, just keep withn_hidden = 128 and you will be fine.

answeredJun 18, 2016 at 21:25
Olivier Moindrot's user avatar

5 Comments

"the input, transformed to size n_hidden" is totally cool when done like you say, with matrix multiplication. But in the mnist code example i mentioned, he is seems to be juggling all the vectors values in the batch at : x = tf.transpose(x, [1, 0, 2]) ... , to get 28 x 128 x 28 shape. I dont get that .
The RNN iterates over each row of the image. In the code of theRNN function, they want to get a list of length128 (the number of steps, or number of rows of the image), with each element of shape[batch_size, row_size] whererow_size=28 (size of a row of the image).
Is there an upper limit to the input layer size in tf ? I get segfault when increasing the dimension to thousand plus and its fine with less. Also , shouldn't it be "...they want to get a list of length 28... " there ^
Yes you are right it should be28. The only limit to the size of the input is the memory of your GPU. If you want to use higher input dimension, you should adapt your batch size so that it fits into your memory
andtf.nn.dynamic_rnn will feed thernn with data for each time step..
13

Since I had some problems to combine the information from the different sources I created the graphic below which shows a combination of the blog post (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) and (https://jasdeep06.github.io/posts/Understanding-LSTM-in-Tensorflow-MNIST/) where I think the graphics are very helpful but an error in explaining the number_units is present.

Several LSTM cells form one LSTM layer. This is shown in the figure below. Since you are mostly dealing with data that is very extensive, it is not possible to incorporate everything in one piece into the model. Therefore, data is divided into small pieces as batches, which are processed one after the other until the batch containing the last part is read in. In the lower part of the figure you can see the input (dark grey) where the batches are read in one after the other from batch 1 to batch batch_size. The cells LSTM cell 1 to LSTM cell time_step above represent the described cells of the LSTM model (http://colah.github.io/posts/2015-08-Understanding-LSTMs/). The number of cells is equal to the number of fixed time steps. For example, if you take a text sequence with a total of 150 characters, you could divide it into 3 (batch_size) and have a sequence of length 50 per batch (number of time_steps and thus of LSTM cells). If you then encoded each character one-hot, each element (dark gray boxes of the input) would represent a vector that would have the length of the vocabulary (number of features). These vectors would flow into the neural networks (green elements in the cells) in the respective cells and would change their dimension to the length of the number of hidden units (number_units). So the input has the dimension (batch_size x time_step x features). The Long Time Memory (Cell State) and Short Time Memory (Hidden State) have the same dimensions (batch_size xnumber_units). The light gray blocks that arise from the cells have a different dimension because the transformations in the neural networks (green elements) took place with the help of the hidden units (batch_size x time_step xnumber_units). The output can be returned from any cell but mostly only the information from the last block (black border) is relevant (not in all problems) because it contains all information from the previous time steps.

LSTM architecture_new

answeredDec 13, 2018 at 11:21
Henryk Borzymowski's user avatar

1 Comment

Good answer, You usually have embeddings for your input data and thus assume for every word for simplicity. So let's say each word has a distributed representation of 150 dimensions which are the features in the above diagram. Then num_units will act as the dimensionality of RNN/LSTM cell (say 128). So 150 -> 128. And hence output dimensions will be 128. Batch size and time_steps remains as it is.
9

An LSTM keeps two pieces of information as it propagates through time:

Ahidden state; which is the memory the LSTM accumulates using its(forget, input, and output) gates through time, andThe previous time-step output.

Tensorflow’snum_units is the size of the LSTM’s hidden state (which is also the size of the output if no projection is used).

To make the namenum_units more intuitive, you can think of it as the number of hidden units in the LSTM cell, or the number of memory units in the cell.

Look atthis awesome post for more clarity

answeredDec 25, 2017 at 16:22
4rshdeep's user avatar

Comments

7

I think it is confusing for TF users by the term "num_hidden". Actually it has nothing to do with the unrolled LSTM cells, and it just is the dimension of the tensor, which is transformed from the time-step input tensor to and fed into the LSTM cell.

answeredNov 1, 2017 at 10:34
P. Li's user avatar

Comments

7

This termnum_units ornum_hidden_units sometimes noted using the variable namenhid in the implementations, means that the input to the LSTM cell is a vector of dimensionnhid (or for a batched implementation, it would a matrix of shapebatch_size xnhid). As a result, the output (from LSTM cell) would also be of same dimensionality since RNN/LSTM/GRU cell doesn't alter the dimensionality of the input vector or matrix.

As pointed out earlier, this term was borrowed from Feed-Forward Neural Networks (FFNs) literature and has caused confusion when used in the context of RNNs. But, the idea is thateven RNNs can beviewed as FFNs at each time step. In this view, the hidden layer would indeed be containingnum_hidden units as depicted in this figure:

rnn-hidden-units

Source:Understanding LSTM


More concretely, in the below example thenum_hidden_units ornhid would be3 since thesize of hidden state (middle layer) is a3D vector.

enter image description here

answeredSep 28, 2018 at 20:01
kmario23's user avatar

1 Comment

You say "the input to the LSTM cell is a vector of dimensionnhid". But the input is generally of shape[batch, T, input] where theinput can be of any shape. So, when input is dynamically unrolled we would have an input of[b,t, input]. RNN would transform it as[b,t, nhid]. So, the output would be shapenhid not the input.
3

I think this is a correctly answer for your question. LSTM always make confusion.

You can refer this blog for more detailAnimated RNN, LSTM and GRUenter image description here

answeredApr 28, 2020 at 11:09
SangLe's user avatar

2 Comments

Amazing Illustrations. Thx for sharing. It finally explains what are these units that confuse everybody . I never understood why RNN are not not explained like this.
This answer contradicts the other answers in this post.
1

Most LSTM/RNN diagrams just show the hidden cells but never the units of those cells. Hence, the confusion. Each hidden layer has hidden cells, as many as the number of time steps.And further, each hidden cell is made up of multiple hidden units, like in the diagram below. Therefore, the dimensionality of a hidden layer matrix in RNN is (number of time steps, number of hidden units).

enter image description here

answeredJan 30, 2019 at 10:05
Garima Jain's user avatar

1 Comment

If you had the sentence "the dog ate the food" and each word corresponds to a single input, is the full sentence being input at an individual timestep (t = 0 for example) as opposed to each word being input into a unit at the next timestep i.e "the" (t = 0), "dog" (t = 1) etc. Im really confused to be honest.
0

The Concept of hidden unit is illustrated in this imagehttps://i.sstatic.net/QBh06.jpg.

answeredSep 19, 2019 at 5:35
Ebrahim Nasr Esfahani's user avatar

Comments

0

Following @SangLe answer, I made a picture (see sources for original pictures) showing cells as classically represented in tutorials (Source1: Colah's Blog) and an equivalent cell with 2 units (Source2: Raimi Karim 's post). Hope it will clarify confusion between cells/units and what really is the network architecture.

enter image description here

answeredOct 22, 2020 at 2:08
Alexis's user avatar

1 Comment

This answer contradicts the other answers in this post.

Your Answer

Sign up orlog in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

By clicking “Post Your Answer”, you agree to ourterms of service and acknowledge you have read ourprivacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.