Implementing a LSTM from scratch with Numpy

In this post, we will implement a simple character-level LSTM using Numpy. It is trained in batches with the Adam optimiser and learns basic words after just a few training iterations.

The full code is available on GitHub.

Figure 1: Architecture of a LSTM memory cell


import numpy as np
import matplotlib.pyplot as plt

Data preparation

Our dataset is J.K. Rowling’s Harry Potter and the Philosopher’s Stone. I chose this text as the characteristic context and semantic structures present in the abundant dialogue, will help with evaluating the quality of results (also a huge HP fan…).

data = open('HP1.txt').read().lower()

We will compute the size of our vocabulary and mapping dictionaries from characters to indices and vice-versa, which will be used for transforming the input data to an appropriate format later on.

chars = set(data)
vocab_size = len(chars)
print('data has %d characters, %d unique' % (len(data), vocab_size))

char_to_idx = {w: i for i,w in enumerate(chars)}
idx_to_char = {i: w for i,w in enumerate(chars)}
data has 442744 characters, 54 unique

Initialisation of model parameters

We will wrap all functions in the LSTM class.

There is not conclusive evidence on which optimiser performs best for this type of architecture. In this implementation, we will use the Adam optimiser, as general empirical results demonstrate that it performs favourably compared to other optimisation methods, it converges fast and can effectively navigate local minima by adapting the learning rate for each parameter. The moving averages, \beta_{1} and \beta_{2} are initialised as suggested in the original paper. Here’s a quick explanation of how Adam works and how it compares to other methods.

After the learning rate, weight initialisation is the second most important setting for LSTMs and other recurrent networks; improper initialisation could slow down the training process to the point of impracticality. We will therefore use the high-performing Xavier initialisation, which involves randomly sampling weights from a distribution \mathcal{N}(0, \frac{1}{\sqrt{n}}), where n is the number of neurons in the preceding layer.

The input to the LSTM, z, has dimensions [vocab_size + n_h, 1] . Since the LSTM layer wants to output n_h neurons, each weight should be of size [n_h, vocab_size + n_h] and each bias of size [n_h, 1]. Exception is the weight and bias at the output softmax layer (Wv, bv). The resulting output will be a probability distribution over all possible characters in the vocabulary, therefore of size [vocab_size, 1], hence Wv should be of size [vocab_size, n_h] and bv of size [n_h, 1].

class LSTM:
    def __init__(self, char_to_idx, idx_to_char, vocab_size, n_h=100, seq_len=25, 
                          epochs=10, lr=0.01, beta1=0.9, beta2=0.999):
        self.char_to_idx = char_to_idx # characters to indices mapping
        self.idx_to_char = idx_to_char # indices to characters mapping
        self.vocab_size = vocab_size # no. of unique characters in the training data
        self.n_h = n_h # no. of units in the hidden layer
        self.seq_len = seq_len # no. of time steps, also size of mini batch
        self.epochs = epochs # no. of training iterations = lr # learning rate
        self.beta1 = beta1 # 1st momentum parameter
        self.beta2 = beta2 # 2nd momentum parameter
        #-----initialise weights and biases-----#
        self.params = {}
        std = (1.0/np.sqrt(self.vocab_size + self.n_h)) # Xavier initialisation
        # forget gate
        self.params["Wf"] = np.random.randn(self.n_h, self.n_h + self.vocab_size) * std
        self.params["bf"] = np.ones((self.n_h,1))

        # input gate
        self.params["Wi"] = np.random.randn(self.n_h, self.n_h + self.vocab_size) * std
        self.params["bi"] = np.zeros((self.n_h,1))

        # cell gate
        self.params["Wc"] = np.random.randn(self.n_h, self.n_h + self.vocab_size) * std
        self.params["bc"] = np.zeros((self.n_h,1))

        # output gate
        self.params["Wo"] = np.random.randn(self.n_h, self.n_h + self.vocab_size) * std
        self.params["bo"] = np.zeros((self.n_h ,1))

        # output
        self.params["Wv"] = np.random.randn(self.vocab_size, self.n_h) * \
        self.params["bv"] = np.zeros((self.vocab_size ,1))

        #-----initialise gradients and Adam parameters-----#
        self.grads = {}
        self.adam_params = {}

        for key in self.params:
            self.grads["d"+key] = np.zeros_like(self.params[key])
            self.adam_params["m"+key] = np.zeros_like(self.params[key])
            self.adam_params["v"+key] = np.zeros_like(self.params[key])
        self.smooth_loss = -np.log(1.0 / self.vocab_size) * self.seq_len

Utility functions

Firstly, we will compute the sigmoid activation used at the forget, input and output gate layers, and the softmax activation used at the output layer. Tanh activation is also needed but numpy.tanh is used instead.

def sigmoid(self, x):
    return 1 / (1 + np.exp(-x))

LSTM.sigmoid = sigmoid

def softmax(self, x):
    e_x = np.exp(x - np.max(x)) # max(x) subtracted for numerical stability
    return e_x / np.sum(e_x)

LSTM.softmax = softmax

Although exploding gradients is not as prevalent for LSTMs as for RNNs, we will limit the gradients to a conservative value using clip_grads. After backpropagating through all LSTM cells, we will reset the gradients using reset_grads.

def clip_grads(self):
    for key in self.grads:
        np.clip(self.grads[key], -5, 5, out=self.grads[key])

LSTM.clip_grads = clip_grads

def reset_grads(self):
    for key in self.grads:

LSTM.reset_grads = reset_grads

The last utility function that we will create is for updating the weights using Adam. Note that the weights are updated using the accumulated gradients for all time steps.

def update_params(self, batch_num):
    for key in self.params:
        self.adam_params["m"+key] = self.adam_params["m"+key] * self.beta1 + \
                                    (1 - self.beta1) * self.grads["d"+key]
        self.adam_params["v"+key] = self.adam_params["v"+key] * self.beta2 + \
                                    (1 - self.beta2) * self.grads["d"+key]**2

        m_correlated = self.adam_params["m" + key] / (1 - self.beta1**batch_num)
        v_correlated = self.adam_params["v" + key] / (1 - self.beta2**batch_num) 
        self.params[key] -= * m_correlated / (np.sqrt(v_correlated) + 1e-8) 

LSTM.update_params = update_params

Forward propagation for a time-step

We will propagate forwards through each LSTM cell using forward_step. The mathematical form of the forward and backward propagation equations can be found in my previous post.

A LSTM cell depends on the previous cell’s state (like Neural Networks). forward_step therefore takes as input the previous hidden state (h_prev) and previous cell state (c_prev). At the beginning of every training iteration, the previous hidden states are initialised to zero (i.e. at t = -1), but for subsequent time-steps, they correspond to the hidden states at t-1, where t is the current time-step.

def forward_step(self, x, h_prev, c_prev):
    z = np.row_stack((h_prev, x))

    f = self.sigmoid(["Wf"], z) + self.params["bf"])
    i = self.sigmoid(["Wi"], z) + self.params["bi"])
    c_bar = np.tanh(["Wc"], z) + self.params["bc"])

    c = f * c_prev + i * c_bar
    o = self.sigmoid(["Wo"], z) + self.params["bo"])
    h = o * np.tanh(c)

    v =["Wv"], h) + self.params["bv"]
    y_hat = self.softmax(v)
    return y_hat, v, h, o, c, c_bar, i, f, z

LSTM.forward_step = forward_step

Backward propagation for a time-step

After forward propagation, we will pass the updated values of the last LSTM cell to backward_step and propagate the gradients backwards to the first LSTM cell.

dh_next and dc_next are initialised to zero at t = -1, but take the values of dh_prev and dc_prev that backward_step returns in subsequent time steps.
In addition, it is worth clarifying:

1. As weights are shared by all time steps, the weight gradients are accumulated.
2. We are adding dh_next to dh, because as Figure 1 shows, h is branched in forward propagation in the softmax output layer and the next LSTM cell, where it is concatenated with x. Therefore, there are two gradients flowing back. This applies to dc as well.
3. There are four gradients flowing towards the input layer from the gates, therefore dz is the summation of those gradients.

def backward_step(self, y, y_hat, dh_next, dc_next, c_prev, z, f, i, c_bar, c, o, h):
    dv = np.copy(y_hat)
    dv[y] -= 1 # yhat - y

    self.grads["dWv"] +=, h.T)
    self.grads["dbv"] += dv

    dh =["Wv"].T, dv)
    dh += dh_next
    do = dh * np.tanh(c)
    da_o = do * o*(1-o)
    self.grads["dWo"] +=, z.T)
    self.grads["dbo"] += da_o

    dc = dh * o * (1-np.tanh(c)**2)
    dc += dc_next

    dc_bar = dc * i
    da_c = dc_bar * (1-c_bar**2)
    self.grads["dWc"] +=, z.T)
    self.grads["dbc"] += da_c

    di = dc * c_bar
    da_i = di * i*(1-i) 
    self.grads["dWi"] +=, z.T)
    self.grads["dbi"] += da_i

    df = dc * c_prev
    da_f = df * f*(1-f)
    self.grads["dWf"] +=, z.T)
    self.grads["dbf"] += da_f

    dz = (["Wf"].T, da_f)
         +["Wi"].T, da_i)
         +["Wc"].T, da_c)
         +["Wo"].T, da_o))

    dh_prev = dz[:self.n_h, :]
    dc_prev = f * dc
    return dh_prev, dc_prev

LSTM.backward_step = backward_step

Forward and backward propagation for all time-steps

The forward and backward propagation steps will be executed within the forward_backward function. Here, we iterate over all time steps and store the results for each time step in dictionaries. In the forward propagation loop, we also accumulate the cross entropy loss.

forward_backward exports the cross entropy loss of the training batch, in addition to the hidden and cell states of the last layer which are fed to the first LSTM cell as hprev and prev of the next training batch.

def forward_backward(self, x_batch, y_batch, h_prev, c_prev):
    x, z = {}, {}
    f, i, c_bar, c, o = {}, {}, {}, {}, {}
    y_hat, v, h = {}, {}, {}

    # Values at t= - 1
    h[-1] = h_prev
    c[-1] = c_prev

    loss = 0
    for t in range(self.seq_len): 
        x[t] = np.zeros((self.vocab_size, 1))
        x[t][x_batch[t]] = 1

        y_hat[t], v[t], h[t], o[t], c[t], c_bar[t], i[t], f[t], z[t] = \
        self.forward_step(x[t], h[t-1], c[t-1])

        loss += -np.log(y_hat[t][y_batch[t],0])


    dh_next = np.zeros_like(h[0])
    dc_next = np.zeros_like(c[0])

    for t in reversed(range(self.seq_len)):
        dh_next, dc_next = self.backward_step(y_batch[t], y_hat[t], dh_next, 
                                              dc_next, c[t-1], z[t], f[t], i[t], 
                                              c_bar[t], c[t], o[t], h[t]) 
    return loss, h[self.seq_len-1], c[self.seq_len-1]

LSTM.forward_backward = forward_backward

Sampling character sequences

As training progresses, we will use the sample function to output a sequence of characters from the model, of total length sample_size.

def sample(self, h_prev, c_prev, sample_size):
    x = np.zeros((self.vocab_size, 1))
    h = h_prev
    c = c_prev
    sample_string = "" 
    for t in range(sample_size):
        y_hat, _, h, _, c, _, _, _, _ = self.forward_step(x, h, c)        
        # get a random index within the probability distribution of y_hat(ravel())
        idx = np.random.choice(range(self.vocab_size), p=y_hat.ravel())
        x = np.zeros((self.vocab_size, 1))
        x[idx] = 1
        #find the char with the sampled index and concat to the output string
        char = self.idx_to_char[idx]
        sample_string += char
    return sample_string

LSTM.sample = sample


Next, we define the function to train the model. train takes as input a corpus of text (X) and outputs a list of losses for each training batch (J) as well as the trained parameters.

In order to speed up training, we will train our data in batches. The number of batches (num_batches) is given by the total number of characters in the input text (len(X)) divided by the number of characters that we want to use in each batch (seq_len), which is user-defined. The input text goes through the following processing steps:

1. Firstly, we trim the characters at end of the input text that don’t form a full sequence
2. When we iterate over each training batch, we slice the input text in batches of size seq_len
3. We map each character in the input (and output) batch to an index, using idx_to_char, effectively converting the input batch to a list of integers

I have mentioned earlier that h_prev and c_prev are set to zero at the beginning of every training iteration. This means that the states for the samples of each batch will be resused as initial states for the samples in the next batch. Keras, the high-level neural networks API, refers to this as “making the RNN stateful”. If each training batch is independent, then h_prev and c_prev should be reset after training each batch.

def train(self, X, verbose=True):
    J = []  # to store losses

    num_batches = len(X) // self.seq_len
    X_trimmed = X[: num_batches * self.seq_len]  # trim input to have full sequences

    for epoch in range(self.epochs):
        h_prev = np.zeros((self.n_h, 1))
        c_prev = np.zeros((self.n_h, 1))

        for j in range(0, len(X_trimmed) - self.seq_len, self.seq_len):
            # prepare batches
            x_batch = [self.char_to_idx[ch] for ch in X_trimmed[j: j + self.seq_len]]
            y_batch = [self.char_to_idx[ch] for ch in X_trimmed[j + 1: j + self.seq_len + 1]]

            loss, h_prev, c_prev = self.forward_backward(x_batch, y_batch, h_prev, c_prev)

            # smooth out loss and store in list
            self.smooth_loss = self.smooth_loss * 0.999 + loss * 0.001

            # check gradients
            if epoch == 0 and j == 0:
                self.gradient_check(x_batch, y_batch, h_prev, c_prev, num_checks=10, delta=1e-7)


            batch_num = epoch * self.epochs + j / self.seq_len + 1

            # print out loss and sample string
            if verbose:
                if j % 400000 == 0:
                    print('Epoch:', epoch, '\tBatch:', j, "-", j + self.seq_len,
                          '\tLoss:', round(self.smooth_loss, 2))
                    s = self.sample(h_prev, c_prev, sample_size=250)
                    print(s, "\n")
    return J, self.params

LSTM.train = train


Finally, we will run the training process for 5 iterations.

model = LSTM(char_to_idx, idx_to_char, vocab_size, epochs = 5, lr = 0.01)

J, params = model.train(data)
Epoch: 0 	Batch: 0 - 25 	Loss: 99.72
Jo!~zyy64 f??0xtnqr7"“jol31irp?a*bmm;-efx;vb.;a9:-5l.'7v,“a;xk.x6gx(6si8–mqpj*jq!udfgymvi“9tbp h!h4ken052cihessw-:5\2\74f~s1pt“9'nvx?ysuh0m;,jjn~yu5e48dib3paq"m2z9–56~“7597d 
E!“ qmw0p)gj'-x10341-iq:)yv)-dep.:)sy~2*9aj:6?–j'dd:78hfl4,y60ya-jntp 

Epoch: 0 	Batch: 400000 - 400025 	Loss: 47.69
ope at cou.... 

"dove the anased, hermione out on ascowkri weing thee blimesace ind y-u her ceuld you haspill, cilleler for, he tour you not ther ele jatt te you bleat purofed coed a, himsint to gonl. sbangout gotagone cos it hould," siree?"in's he. 

Epoch: 1 	Batch: 0 - 25 	Loss: 47.36
?" the muspe the swertery," harnyound." 

ait sheer in erey -- in you,?r juven a putced of the bain kem the gould the and ande it kiry stheng u he sarmy to a veingl, you hak was fory. 

"wave tay gotnts onte file somione walled to the von! hem ape ea 

Epoch: 1 	Batch: 400000 - 400025 	Loss: 42.05
antin i marst. 


"in't woated. 

"" havey to leder and reamy doyed to bopreas slop ay?" har'y and tree tho gosteng bedoning but thap hard anythiont.". the comes beining bucked to there wilss a lifged a diglt beed. i gusted got carst witsheri 

Epoch: 2 	Batch: 0 - 25 	Loss: 42.32

"rexursell been them tungerlly dednow, put one a't seaner. 

herriokge, he ston's glinged sely he mught. "ye,." bac it, he said just gally, you," said, "is ed. the some coired. ic was tookstse," 

and ron," said screchens: a bryaut of he stall 

Epoch: 2 	Batch: 400000 - 400025 	Loss: 39.16
uldeding what tears from the dooms fronlaim the one aw thr again. in he piond beate, he'piled weellay's fouddout plait," lelt can't a noich arope-ficling hermoling, and hermione, the wlithing to durglevor evevingses agothere and haw and might of them 

Epoch: 3 	Batch: 0 - 25 	Loss: 39.68
ay, he cain!" said hagrid magalay's low, and our shunds?" tarng, engon't yooh was them. see -- shat." 

it weot sift you mased was ears oworeing us i donn." 

"arwuin upcors avely from maying that one. pury. mer quted would to at the coulding tomk am 

Epoch: 3 	Batch: 400000 - 400025 	Loss: 37.28
ribla gry toued, we maingh the praying they out harry and simptsor streat, neacly sunding up inta tice of the ortepled at hergion grinned you just franteriogess, blioking, but want a firs-'s they the coods behind hermione aw they goinging toke.. he s 

Epoch: 4 	Batch: 0 - 25 	Loss: 37.92
ain, askeds to grow." 

"i was a drofey strent for yelle a pupazely unyerseached the steninied, soub, we'll it, but cemsting!" 

peeze, cillod, but would suef freens gundes, forlit brood neder high quidey plyfoid me -- more..., i'me gove, in the purs 

Epoch: 4 	Batch: 400000 - 400025 	Loss: 35.93
wing. "she --" 

"they dunned. "there, poing's nowed him i poents cuddents wils a the trank of snare-aching was what ,"drown out of the and going. how exven you a stone of his sayt aborghtaney deat who it wecenes to dight," so mach down she cleaved d 

Although we have just trained our LSTM for a few iterations, it has learnt to correctly spell words like harry and hermione, and how to start dialogue with quotation marks.

plt.plot([i for i in range(len(J))], J)
plt.xlabel("#training iterations")
plt.ylabel("training loss")


This implementation prioritised simplicity and clarity over performance, thus there are a few enhancements that are worth mentioning:

Gradient checking: To check the backpropagation calculation, we can numerically approximate the gradient at a point and compare it to the model’s backprop gradient. Although this was not discussed in this post, it is included in my implementation on GitHub.
Regularisation: Large recurrent neural networks tend to overfit the training data. Dropout can be used to generalise results better, however it needs to be applied differently than to feedforward neural networks otherwise performance is not enhanced.
Gradient clipping: To prevent exploding gradients, we have clipped their maximum value. This, however, does not improve performance consistently. Rather than clipping each gradient independently, clipping the global norm of the gradient yields more significant improvements.
Learning the initial state: We initialised the initial LSTM states to zero.  Instead of fixing the initial state, we can learn it like any other parameter, which can improve performance and is also recommended by Hinton.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s