
In this blog post we’ll use a recurrent neural network (RNN) to teach the iPhone toplay the drums. It will sound something like this:
The timing still needs a little work but it definitely sounds like someone playing the drums!
We’ll teach the computer to play drums without explaining what makes a good rhythm, or what even a kick drum or a hi-hat is. The RNN will learn how to drum purely from examples of existing drum patterns.
The reason we’re using arecurrent network for this task is that this type of neural network is very good at understanding sequences of things, in this case sequences of MIDI notes.
Apple’sBNNS and Metal CNN libraries don’t support recurrent neural networks at the moment, but no worries: we can get pretty far already with just a few matrix multiplications.
As usual we train the neural network on the Mac (using TensorFlow and Python), and then copy what it has learned into the iOS app. In the iOS app we’ll use the Accelerate framework to handle the math.
In this post I’m only going to show the relevant bits of the code.The full source is on GitHub, so look there to follow along.
Aregular neural network, also known as afeed-forward network, is a simple pipeline: the input data goes into one end and comes out the other end as a prediction of some kind, often in the form of a probability distribution.
The interesting thing about arecurrent neural network is that it has an additional input and output, and these two are connected. The new input gets its data from the RNN’s output, so the network feeds back into itself, which is where the name “recurrent” comes from.

I said that RNNs are good at understanding sequences. For this purpose, the RNN keeps track of someinternal state. This state is what the RNN has remembered of the sequence it has seen so far. The extra input/output is for sending this internal state from the previous timestep into the next timestep.
To make the iPhone play drums, we train the RNN on a sequence ofMIDI notes that represent different drum patterns. We look at just one element from this sequence at a time — this is called atimestep. At each time step, we teach the RNN to predict the next note from the sequence.
Essentially, we’re training the RNN to remember all the drum patterns that are in the sequence. It remembers this data in its internal state, but also in the weights that connect the inputx and the predicted outputy to this state.

Of course, we don’t want the RNN to justremember existing drum patterns — we want it to come up with new drums on its own.
To do that, we will mess a little with the RNN’s memory: we reset the internal state by filling it up with random numbers — but we don’t change the weights. From then on the model will no longer correctly predict the next note in the sequence because we “erased” its memory of where it was.
Now when we ask the RNN to “predict” the next notes in the sequence, it will come up with new,original drum patterns. These are still based on its knowledge of what “good drums” are (because we did not erase the learned weights), but they are no longer verbatim replications of the training patterns.
I mentioned we’re training on drum patterns. The dataset I used consists of a large number of MIDI files. When you open such a MIDI file in GarageBand or Logic Pro it looks like this:

The green bars represent the notes that are being played. The note C1 is a kick drum, D1 is a snare drum, G#1 is a hi-hat, and so on. The drum patterns in the dataset are all 1 measure (or 4 beats) long.
In a MIDI file the notes are stored as a series ofevents:
NOTE ON time: 0 channel: 0 note: 36 velocity: 80NOTE ON time: 0 channel: 0 note: 46 velocity: 80NOTE OFF time: 120 channel: 0 note: 36 velocity: 64NOTE OFF time: 0 channel: 0 note: 46 velocity: 64NOTE ON time: 120 channel: 0 note: 44 velocity: 80NOTE OFF time: 120 channel: 0 note: 44 velocity: 64NOTE ON time: 120 channel: 0 note: 38 velocity: 80NOTE OFF time: 120 channel: 0 note: 38 velocity: 64NOTE ON time: 0 channel: 0 note: 44 velocity: 80. . . and so on . . .To begin playing a note there is a NOTE ON event, to stop playing there is a NOTE OFF event. The duration of the note is determined by the amount of time between NOTE ON and NOTE OFF. For us, the duration of the notes isn’t really important because drum sounds are short — they aren’t sustained like a flute or violin. All we care about is the NOTE ON events, which tell us when new drum sounds begin.
Each NOTE ON event includes a few different bits of data, but for our purposes we only need to know thetimestamp and thenote number.
The note number is an integer that represents the drum sound. For example, 36 is the number for note C in octave 1, which is the kick drum. (TheGeneral MIDI standard defines which note number is mapped to which percussion instrument.)
The timestamp for an event is a “delta” time, which means it is the number ofticks we should wait before processing this event. For the MIDI files in our dataset, there are 480 ticks per beat. So if we play the drums at 120 beats-per-minute, then one second has 960 ticks in it. This is not really important to remember; just know that for each note in the drum pattern there’s also a delay measured in ticks.
Our input sequence to the RNN then has the following form:
(note, ticks) (note, ticks) (note, ticks) . . .At every timestep we insert a(note, ticks) pair into the RNN and it will try to predict the next(note, ticks) pair from the same sequence. For the example above, the sequence is:
(36, 0) (46, 0) (44, 240) (38, 240) (44, 0) . . .That’s a kick drum (36) and an open-hihat (46) on the first beat, followed by a pedal hi-hat (44) after 240 ticks, followed by a snare drum (38) and a pedal hi-hat (44) after another 240 ticks, and so on.
The dataset I used for training has 2700 of these MIDI files. I glued them together into one big sequence of 52260(note, ticks) pairs. Just think of this sequence as aginormous drum solo. This is the sequence we’ll try to make the RNN remember.
Note: This dataset of drum patterns comes from a commercial drum kit plug-in for use in audio production tools such as Logic Pro. I was looking for a fun dataset for training an RNN when I realized I had a large library of drum patterns in MIDI format sitting in a folder on my computer… and so the RNN drummer was born. Unfortunately, it also means this dataset is copyrighted and I can’t distribute it with the GitHub project. If you want to train the RNN yourself, you’ll need to find your own collection of drum patterns in MIDI format — I can’t give you mine.
You’ve seen that the MIDI note numbers are regular integers. We’ll be using the note numbers between 35 and 60, which is the range reserved in the General MIDI standard for percussion instruments.
The ticks are also integers, between 0 and 1920. (That’s how many ticks go into one measure and each MIDI file in the dataset is only one measure long.)
However, we can’t just feed integers into our neural network. In machine learning when you encode something using an integer (or a floating-point value), you imply there is an order to it: the number 55 is bigger than the number 36.
But this is not true for our MIDI notes: the drum sound represented by MIDI note number 55 is not “bigger” than the drum sound with number 36. These numbers represent completely different things — one is a kick drum, the other a cymbal.
Instead of truly being numbers on some continuous scale, our MIDI notes are examples of what’s calledcategorical variables. It’s better to encode that kind of data usingone-hot encoding rather than integers (or floats).
For the sake of giving an example, let’s say that our entire dataset only uses five unique note numbers:
36 kick drum38 snare drum42 closed hi-hat48 tom55 cymbalWe can then encode any given note number using a 5-element vector. Each index in this vector corresponds to one of those five drum sounds. A kick drum (note 36) would be encoded as:
[ 1, 0, 0, 0, 0 ]while a snare drum would be encoded as:
[ 0, 1, 0, 0, 0 ]and so on… It’s called “one-hot” because the vector is all zeros except for a one at the index that represents the thing you’re encoding. Now all these vectors have the same “length” and there is no longer an ordering relationship between them.
We do the same thing for the ticks, and then combine these two one-hot encoded vectors into one big vector calledx:

In the full dataset there are 17 unique note numbers and 209 unique tick values, so this vector consists of 226 elements. (Of those elements, 224 are 0 and two are 1.)
The sequence that we present to the RNN does not really exist of(note, ticks) pairs but is a list of these one-hot encoded vectors:
[ 0, 0, 1, 0, 0, 0, ..., 0 ] [ 1, 0, 0, 0, 0, 0, ..., 0 ] [ 0, 0, 0, 1, 0, 0, ..., 1 ]. . . and so on . . .Because there are 52260 notes in the dataset, the entire training sequence is made up of 52260 of those 226-element vectors.
The scriptconvert_midi.py reads the MIDI files from the dataset and outputs a new fileX.npy that contains this 52260×226 matrix with the full training sequence. (The script also saves two lookup tables that tell us which note numbers and tick values correspond to the positions in the one-hot vectors.)
Note: You may be wondering why we’re one-hot encoding the ticks too as these are numerical variables and not categorical. A timespan of 200 ticks definitely means that it’s twice as long as 100 ticks. Fair question. I figured I would keep things simple and encode the note numbers and ticks in the same way. This is not necessarily the most efficient way to encode the durations of the notes but it’s good enough for this blog post.
The kind of recurrent neural network we’re using is something called an LSTM or Long Short-Term Memory. It looks like this on the inside:

The vectorx is a single input that we feed into the network. It’s one of those 226-element vectors from the training sequence that combines the note number and the delay in ticks for a single drum sound.
The outputy is the prediction that is computed by the LSTM. This is also a 226-element vector but this time it contains a probability distribution over the possible note numbers and tick values. The goal of training the LSTM is to get an outputy that is (mostly) equal to the next element from the training sequence.
Recall that a recurrent network has “internal state” that acts as its memory. The internal state of the LSTM is given by two vectors:c andh. Thec vector helps the LSTM to remember the sequence of MIDI notes it has seen so far, andh is used to predict the next notes in the sequence.
At every time step we compute new values forc andh, and then feed these back into the network so they are used as inputs for the next timestep.
The most interesting feature of the LSTM is that it hasgates that can be either 0 (closed) or 1 (open). The gates determine how data flows through the LSTM layer.
The gates perform different jobs:
The inputsx andh are connected to these gates using weights —Wxf,Whf, etc. When we train the LSTM, what it learns are the values of those weights. (It does not learn the values ofh orc.)
Thanks to this mechanism with the gates, the LSTM can remember things over the long term, and it can even choose to forget things it no longer considers important.
Confused how this works? It doesn’t matter. Exactly how or why these gates work the way they do isn’t very important for this blog post. (If you really want to know,read the paper.) Just know this particular scheme has proven to work very well for remembering long sequences.
Our job is to make the network learn the optimal values for the weights betweenx andh and these gates, and for the weights betweenh andy.
To implement an LSTM any sane person would use a tool such asKeras which lets you simply writelayer = LSTM(). However, we are going to do it the hard way, using primitive TensorFlow operations.
The reason for doing it the hard way, is that we’re going to have to implement this math ourselves in the iOS app, so it’s useful to understand the formulas that are being used.
The formulas needed to implement the inner logic of the LSTM layer look like this:
f = tf.sigmoid(tf.matmul(x[t], Wxf) + tf.matmul(h[t - 1], Whf) + bf)i = tf.sigmoid(tf.matmul(x[t], Wxi) + tf.matmul(h[t - 1], Whi) + bi)o = tf.sigmoid(tf.matmul(x[t], Wxo) + tf.matmul(h[t - 1], Who) + bo)g = tf.tanh(tf.matmul(x[t], Wxg) + tf.matmul(h[t - 1], Whg) + bg)What goes on here is less intimidating than it first appears. Let’s look at the line for thef gate in detail:
f = tf.sigmoid( tf.matmul(x[t], Wxf) # 1 + tf.matmul(h[t - 1], Whf) # 2 + bf # 3 )This computes whether thef gate is open (1) or closed (0). Step-by-step this is what it does:
First multiply the inputx for the current timestep with the matrixWxf. This matrix contains the weights of the connections betweenx andf.
Also multiply the inputh with the weights matrixWhf. In these formulas,t is the index of the timestep. Becauseh feeds back into the network we use the value ofh from the previous timestep, given byh[t - 1].
Add a bias valuebf.
Finally, take thelogistic sigmoid of the whole thing. The sigmoid function returns 0, 1, or a value in between.
The same thing happens for the other gates, except that forg we use a hyperbolic tangent function to get a number between -1 and +1 (instead of 0 and 1). Each gate has its own set of weight matrices and bias values.
Once we know which gates are open and which are closed, we can compute the new values of the internal statec andh:
c[t] = f * c[t - 1] + i * gh[t] = o * tf.tanh(c[t])We put the new values ofc andh intoc[t] andh[t], so that these will be used as the inputs for the next timestep.
Now that we know the new value for the state vectorh, we can use this to predict the outputy for this timestep:
y = tf.matmul(h[t], Why) + byThis prediction performs yet another matrix multiplication, this time using the weightsWhy betweenh andy. (This is a simple affine function like the one that happens in a fully-connected layer.)
Recall that our inputx is a vector with 226 elements that contains two separate data items: the MIDI note number and the delay in ticks. This means we also need to predict the note and tick values separately, and so we use two softmax functions, each on a separate portion of they vector:
y_note[t] = tf.nn.softmax(y[:num_midi_notes])y_tick[t] = tf.nn.softmax(y[num_midi_notes:])And that’s in a nutshell how the math in the LSTM layer works. To read more about these formulas, see theWikipedia page.
Note: Even though the above LSTM formulas are taken from the Python training script and use TensorFlow to do the computations, we need to implementexactly the same formulas in the iOS app. But instead of TensorFlow, we’ll use the Accelerate framework for that.
As you know, when a neural network is trained it will learn values for the weights and biases. The same is true here: the LSTM will learn the values ofWxf,Whf,bf,Why,by, and so on. Notice that this is 9 different matrices and 5 different bias values.
We can be clever and actually combine these matrices into one big matrix:

We first put the value ofx for this timestep and the value ofh of the previous timestep into a new vector (plus the constant 1, which gets multiplied with the bias). Likewise, we put all the weights and biases into one matrix. And then we multiply these two together.
This does the exact same thing as the eight matrix multiplies from before. The big advantage is that we now have to manage only a single weight matrix forx andh (and no bias value, since that is part of this big matrix too).
We can simplify the computation for the gates to just this:
combined = tf.concat([x[t], h[t - 1], tf.ones(1)], axis=0)gates = tf.matmul(combined, Wx)And then compute the new values ofc andh as follows:
c[t] = tf.sigmoid(gates[0])*c[t - 1] + tf.sigmoid(gates[1])*tf.tanh(gates[3])h[t] = tf.sigmoid(gates[2])*tf.tanh(c[t])These two formulas forc andh didn’t really change — I just moved the sigmoid and tanh functions here.
Now when we train the LSTM we only have to deal with two weight matrices:Wx, which is the big matrix I showed you here, andWy, the matrix that for the weights betweenh andy. Those two matrices are the learned parameters that get loaded into the iOS app.
OK, let’s recap where we are now:
We’ve got a dataset of 52260 one-hot encoded vectors that describe MIDI notes and their timing. Together, these 52260 vectors make up a very long sequence of drum patterns.
We want to train the LSTM to memorize this sequence. In other words, for every note of the sequence the LSTM should be able to correctly predict the note that follows.
We have the formulas for computing what happens in an LSTM layer. It takes an inputx, which is one of these vectors describing a single drum sound, and two state vectorsh andc. The LSTM then computes new values forh andc, as well as a predictiony for what the next note in the sequence will be.
Now we need to put this all together to train the recurrent network. This will give us two matricesWx andWy that describe the weights of the connections between the different parts of the LSTM.
And then we can use those weights in the iOS app to play new drum patterns.
Note: TheGitHub repo only contains a few drum patterns since I am not allowed to distribute the full dataset. So unless you have your own library of drum patterns, there isn’t much use in doing the training yourself. However, youcan still run the iOS app, as the trained weights are included in the Xcode project.
That said, if you really want to, you can run thelstm.py script to train the neural network on the included drum patterns (see theREADME file for instructions). Don’t get your hopes up though — because there isn’t nearly enough data to train on, the model won’t be very good.
Training an LSTM isn’t very different from training any other neural network. We use backpropagation with an SGD (Stochastic Gradient Descent) optimizer and we train until the loss is low enough.
However, the nature of this network being recurrent — where the outputsh andc are always connected to the inputsh andc — makes backpropagation a little tricky. We don’t want to get stuck in an infinite loop!
The way to deal with this is a technique calledbackpropagation through time where we backpropagate through all the steps of the entire training sequence.
In the interest of keeping this blog post short, I’m not going to explain the entire training procedure here. You can find thecomplete implementation inlstm.py in the functiontrain().
However, I do want to mention a few things:
The learning capacity of the LSTM is determined by the size of theh andc vectors. A size of 200 units (or neurons if you will) seems to work well. More units might work even better but at some point you’ll get diminishing returns, and you’re better off stacking multiple LSTMs on top of each other (making the network deeper rather than wider).
It’s not practical to backpropagate through all 52260 steps of the training sequence, even though that would give the best results. Instead, we only go back 200 timesteps. After a bit of experimentation this seemed like a reasonable number. To achieve this, the training script actually sticks 200 LSTM units together and processes the training sequence in chunks of 200 notes at a time.
Every so often the training script computes the percentage of predictions it has correct. It does this on the training set (there is no validation set) so take it with a grain of salt, but it’s a good indicator of whether the training is still making progress or not.
The final model took a few hours to train on my iMac but that’s because it doesn’t have a GPU that TensorFlow can use (sad face). I let the training script run until the learning seemed to have stalled (the accuracy and loss did not improve), then I pressed Ctrl+C, lowered the learning rate in the script, and resumed training from the last checkpoint.
The model that is included in theGitHub repo has an accuracy score of about 92%, which means 8 in every 100 notes from the training sequence are remembered wrong. Once the model reached 92% accuracy, it didn’t seem to want to go much further than that, so we’ve probably reached the capacity of our model.
An accuracy of “only” 92% is good enough for our purposes: we don’t want the LSTM to literally remember every example from the training data, just enough to get a sense of what it means to play the drums.
Don’t fire the drummer from your band just yet. :–)

The question is: has the recurrent neural networkreally learned anything from the training data, or does it just output random notes?
Here’s anMP3 of randomly chosen notes and durations from the training data. It doesn’t sound like real drums at all.
Compare it withthis recording that was produced by the LSTM. It’s definitely much more realistic! (In fact, it sounds a lot like the kid down the street practicing.)
Of course, the model we’re using is very basic. It’s a single LSTM layer with “only” 200 neurons. No doubt there are much better ways to train a computer to play the drums. One way is to make the network deeper by stacking multiple LSTMs. This should improve the performance by a lot!
The weights that are learned by the model take up 1.5 MB of storage. The dataset, on the other hand, is only 1.3 MB! That doesn’t seem very efficient. But just having the dataset does not mean you know how to drum — the weights are more than just a way to remember the training data, they also “understand” in some way what it means to play the drums.
The cool thing is that our neural network doesn’t really know anything about music: we just gave it examples and it has learned drumming from that (to some extent anyway). The point I’m trying to make with this blog post is that if we can make a recurrent neural network learn to drum, then we can teach it to understand any kind of sequential data.
Finally we get to the Swift portion of this blog post!
Note: Even though you cannot train the model yourself unless you have a large dataset of MIDI drum patterns, youcan run the iOS app to hear the drums in action. The learned parameters are included in theGitHub project.
The iOS app is very simple: it just has a button. When you tap the button, the app generates a new drum sequence of 1,000 notes and plays it usingAVMIDIPlayer.
The interesting part of the app is in how it generates the drum sequence. The code uses the same LSTM math as above, except this time it’s not implemented using TensorFlow but with functions from the Accelerate framework.
In pseudocode the algorithm looks like this:
randomize the c and h vectorschoose a random starting note and ticksrepeat 1000 times: one-hot encode the note/ticks pair do the LSTM math add the predicted next note and duration to an array use this new note/ticks pair as the new inputEach iteration of the loop predicts a new note and duration, then uses this to predict another note and duration, and so on. This gives us an array of 1,000(note, ticks) pairs.
The full code is inDrummer.swift in the functionsample(). I’m not going to show every single step, but here are a few snippets.
First, we start with randomc andh vectors. Recall that these make up the internal state of the LSTM. During training this internal state kept track of where we were in the training sequence. But now we want to create original drum patterns, so we start off with a randomized memory state.
Math.uniformRandom(&c, hiddenSize, 0.1)Math.uniformRandom(&h, hiddenSize, 0.1)TheMath namespace is anenum that contains static functions that wrap around the Accelerate functionality. It makes the Accelerate framework a little easier to use. (SeeMath.swift for details.)
uniformRandom() fills up thec andh arrays with random floating-point values between -0.1 and +0.1. Feel free to experiment and make these values larger or smaller to get different prediction results.
Inside the loop, we first one-hot encode the current note number and ticks value. As before, we put these into a 226-element vector. We also appendh to this vector and add a1 at the end. As you’ve seen above, this allows us to do all of the LSTM weight calculations with a single matrix.
The code for doing the matrix multiply is:
Wx_data.withUnsafeBytes { Wx in Math.matmul(&x, Wx, &gates, 1, Wx_cols, Wx_rows)}whereWx_data is aData object with the contents of the fileWx.bin. We multiplyx, which is the vector containing the one-hot note and ticks as well ash, with the big weight matrixWx and we store the result in the new arraygates.
Let’s take a look atMath.matmul() to see what it does:
enum Math { static func matmul(_ A: UnsafePointer<Float>, _ B: UnsafePointer<Float>, _ C: UnsafeMutablePointer<Float>, _ M: Int, _ N: Int, _ K: Int) { cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, Int32(M), Int32(N), Int32(K), 1, A, Int32(K), B, Int32(N), 0, C, Int32(N)) }}This is just a simple wrapper around thecblas_sgemm() function, which multiplies matrix (or vector)A with matrixB and stores the result in matrixC. (M,N, andK are the dimensions of the matrices.)
Most of the otherMath functions are simple one-liners too, but some — such assigmoid() andsoftmax() — are more complicated, as their calculations need to combine multiple Accelerate functions.
The next bit performs the LSTM math:
gates.withUnsafeMutableBufferPointer { ptr in let gateF = ptr.baseAddress! let gateI = gateF.advanced(by: hiddenSize) let gateO = gateI.advanced(by: hiddenSize) let gateG = gateO.advanced(by: hiddenSize) // Compute the activations of the gates. Math.sigmoid(gateF, hiddenSize*3) Math.tanh(gateG, hiddenSize) // c[t] = sigmoid(gateF) * sigmoid(c[t-1]) + sigmoid(gateI) * tanh(gateG) Math.multiply(&c, gateF, &c, hiddenSize) Math.multiply(gateI, gateG, &tmp, hiddenSize) Math.add(&tmp, &c, hiddenSize) // h[t] = sigmoid(gateO) * tanh(c[t]) Math.tanh(&c, &tmp, hiddenSize) Math.multiply(gateO, &tmp, &h, hiddenSize)}In the Python code we could do this in just three lines but in Swift it’s a bit more verbose. At the end of this block of code we have the new values for the internal state vectorsc andh.
Now that we knowh, we can compute the predictiony:
Wy_data.withUnsafeBytes { Wy in Math.matmul(&h, Wy, &y, 1, Wy_cols, Wy_rows)}y.withUnsafeMutableBufferPointer { ptr in let yNote = ptr.baseAddress! let yTick = yNote.advanced(by: noteVectorSize) Math.softmax(yNote, noteVectorSize) Math.softmax(yTick, tickVectorSize)As before, we first multiplyh by the weight matrixWy to get the 226-element vectory. Then we take the softmax twice, once to predict the next note number and once to predict the next duration in ticks.
The softmax gives us two probability distributions: one for the note numbers and one for the ticks. What we do with those probability distributions is slightly different than in the training procedure:
let noteIndex = Math.randomlySample(yNote, noteVectorSize) let tickIndex = Math.randomlySample(yTick, tickVectorSize) sampled.append((index2note[noteIndex], index2tick[tickIndex]))TherandomlySample() function picks a value based on the probabilities. Suppose you have an array of probabilities[ 0.2, 0.8 ] and you runrandomlySample() on that array 100 times, it will pick the first element about 20 times and the second element about 80 times.
So here we choose one value from each probability distribution — the higher its probability the more likely we’ll choose it — and those are the note and ticks we’ll use for this timestep.
Now that we’ve predicted a new note and duration, we feed the newc andh vectors back into the network and repeat the loop. And that’s all we need to do to generate a new drum sequence! We simply run the LSTM math 1,000 times in a row.
To actually play the drums through the iPhone’s speaker, the app turns the array of predicted(note, ticks) pairs into aMusicSequence object and then gives that toAVMIDIPlayer. For the exact details, check outViewController.swift.
Give it a try! Download theGitHub repo and build theDrummer.xcodeproj file with Xcode 8. Be sure to run the app on a device because MIDI playback on the simulator is broken. And remember that each time you tap the button, the drummer starts with a random memory, so sometimes the results will be better than others. 😅
The RNN training procedure is based on themin-char-rnn.py code sample from Andrej Karpathy. See also his excellent blog post“The Unreasonable Effectiveness of Recurrent Neural Networks”.
The iOS code for MIDI playback was based onthis blog post by Gene De Lisa.
The SoundFont used in the iOS app wasdownloaded here.
New e-book:Code Your Own Synth Plug-Ins With C++ and JUCE
Interested in how computers make sound? Learn the fundamentals of audio programming by building a fully-featured software synthesizer plug-in, with every step explained in detail. Not too much math, lots of in-depth information!Get the book at Leanpub.com