In this post I will **derive the backpropagation equations** for a LSTM cell in vectorised form. It assumes basic knowledge of LSTMs and backpropagation, which you can refresh at Understanding LSTM Networks and A Quick Introduction to Backpropagation.

**Derivations**

**Forward propagation**

We will firstly remind ouselves of the forward propagation equations. The nomenclature followed is demonstrated in *Figure 1*. All equations correspond to one time step.

**Backward propagation**

Backpropagation through a LSTM is not as straightforward as through other common Deep Learning architectures, due to the special way its underlying layers interact. Nonetheless, the approach is largely the same; identifying dependencies and recursively applying the chain rule.

Cross-entropy loss with a softmax function are used at the output layer. The standard definition of the derivative of the cross-entropy loss () is used directly; a detailed derivation can be found here.

outputhidden state

output gate

cell state

input gate

forget gate

input

The above equations for forward propagation and back propagation will be calculated T times (number of time steps) in each training iteration. At the end of each training iteration, the weights will be updated using the accumulated cost gradient with respect to each weight for all time steps. Assuming Stochastic Gradient Descent, the update equations are the following:

In the next post, we will implement the above equations using Numpy and train the resulting LSTM model on real data.

Thanks Christina!! we can discuss if you wantt 🙂

LikeLiked by 1 person

What does the [:Nh:] mean ?

LikeLike

Since we are concatenating [h_(t-1), x_t] to get z_t, dJ/dh_(t-1) would be just the first dimension h_(t-1) elements of the dJ/dz_t vector, which is what she wrote in Python slice like notation

LikeLike

Θεά! Το καλύτερο post που υπάρχει εκεί έξω…

Μπράβο! 🙂

LikeLiked by 1 person

Hey, I find your notation dJ/dh_t +- dJ/dh_next a bit confusing, it might be better to rewrite this as:

dJ/dh_t += dJ/dh_next

LikeLike

thanks for the feedback 🙂 !

LikeLike