Generative Adversarial Networks (GANs) have achieved tremendous success in generating high-quality synthetic images and efficiently internalising the essence of the images that they learn from. Their potential is enormous, as they can learn to do that for any distribution of data.

In order to keep up with the latest advancements, I decided to explore their theoretical underpinnings by implementing a simple GAN in Python using Numpy. In this post, I will go through the implementation steps based on Ian Goodfellow’s Generative Adversarial Nets paper.

The full code is available on GitHub.

**Architecture**

**GAN**

We will implement a GAN that generates handwritten digits. The basic principle of GANs is inspired by the two-player zero-sum game, in which the total gains of two players are zero, and** each player’s gain or loss of utility is exactly balanced by the loss or gain of the utility of another player** [1]. It comprises of two models:**1.** **Generator:** learns to generate new images, which have a similar data distribution to the real dataset. Crucially, it has no direct access to the real images; it only learns through its interaction with the discriminator.**2. Discriminator: **learns to distinguish candidates produced by the generator from the real data distribution. It outputs the probability that an input image is from the real data distribution rather than the generator distribution.

Although both models are typically described by convolutional neural nets, they could be implemented by any form of differentiable system that maps data from one space to another.** This implementation uses multilayer preceptors **as they are less computationally prohibitive and much easier to code from scratch.

**Generator**

The generator, , is fed **random noise**, , from a** normal distribution with zero mean and standard deviation of 1** (range [-1,1]). As we will train both the generator and discriminator using **mini-batch gradient descent**, the input noise will be a numpy array of size *[batch size, input layer size]*.

The output of the generator will be a batch of flattened images of size *[batch size, image dimension **2]* . The *image dimension* corresponds to the number of pixels of the training images (for MNIST, each image has size 28×28 pixels).

**Discriminator**

The discriminator, , will be fed a batch of real images from the MNIST dataset and a batch of fake images from the generator.** It will output the probability that the input images are real or fake. **

The original paper suggests training the discriminator for k steps before training the generator for one step. **We will choose the least computationally expensive solution, k=1**, therefore training the discriminator and generator equally.

Practitioners have been experimenting with the more sophisticated Deep Convolutional GANs to determine the optimal activation functions for the hidden and output layers [2]. I have found their recommendations to be very effective for a simple GAN as well.

In the generator network, it is recommended to use a **ReLU activation in the hidden layers and tanh activation in the output layer**. It was observed that using a bounded activation allowed the model to learn more quickly to saturate [2]. In the **discriminator network, leaky ReLU was found to work well **[3,4]** in the hidden layers**, **as it prevents the vanishing gradient problem. **At the output layer, a **sigmoid activation **is commonly used which squeezes pixels that would appear grey toward either black or white, resulting in crisper images. This is in contrast to the original GAN paper, which used the maxout activation.

**Cost function**

Training of GANs involves balancing two conflicting objectives:**1.** Training to maximise the probability of assigning the correct label to the training examples, and samples from the generator, . The discriminator therefore wants to maximise:

where is the training data distribution and the noise prior

which is equivalent to minimising:

This is just the standard cross-entropy cost that is minimised when training a binary classifier with a sigmoid output. The only difference is that the classifier is trained on two mini-batches of data; one coming from the dataset, where the label is 1 for all examples, and one coming from the generator, where the label is 0 for all examples. **2.** Training to minimise the likelihood of the generated images not coming from the real data distribution. In other words, is trying to maximally confuse the discriminator. It tries to minimise:

Typically, an alternative, non-saturating training criterion is used for the generator:

**Implementation**

**Imports**

Let’s start by importing *numpy*, *matplotlib.pyplot* and other useful libraries.*Keras.datasets* is imported to get access to the MNIST dataset, *imageio* to generate a gif from sample images at each training iteration and *Path* to define the location where sample images will be exported that are used to generate the gif.

import imageio from keras.datasets import mnist import matplotlib.pyplot as plt import numpy as np from pathlib import Path

**Data Loading**

We need a set of real handwritten digits to give the discriminator a starting point in distinguishing between real and fake images. We’ll use MNIST, a benchmark dataset in deep learning. It consists of 70k images of handwritten digits compiled by the U.S. National Institute of Standards and Technology from Census Bureau employees and high school students.

As we will only use the train data, the test data (10k images) will be ignored.

(x_train, y_train), (_, _) = mnist.load_data() print("y_train.shape",y_train.shape) print("x_train.shape",x_train.shape)

y_train.shape (60000,)

x_train.shape (60000, 28, 28)**Initialisation**

We will wrap all functions in the *GAN* class.

It takes a long time to train a GAN properly.** On a single GPU, a GAN might take hours, and on a single CPU more than a day.** In addition, GANs are difficult to optimise. For these reasons, **I recommend trying to generate one digit at a time**, by limiting the training data from the digits 0-9 to the digit specified in the *numbers* list.

We will use **a mini-batch size of 64 ( batch_size).** The input layer of the discriminator is determined by the size of the training images

*[batch_size, image dimension **2]*, which needs to match the output of the generator, i.e. the fake images. The number of neurons at the input layer of the generator (

*input_layer_size_g*) as well as the hidden layers of both models (

*hidden_layer_size_g*,

*hidden_layer_size_d*) need to be defined by us.

Next, to visualise training performance, we can generate a gif of sample images. If

*create_gif*is enabled, a grid of sample images will be saved in your local directory by default and their filename will be stored in the

*filenames*list to enable sourcing the images for stitching at the end of training.

While GANs are commonly used with momentum to adapt the learning rate or the Adam optimiser,

**we will use a simple step decay**quantified by the

*decay_rate*.

Finally, all weights will be initialised from a

**zero-centered Normal distribution with standard deviation determined by the**[5]. It makes sure the weights are ‘just right’ by keeping the signal in a reasonable range of values through many layers.

*Xavier*algorithmclass GAN: def __init__(self, numbers, epochs=100, batch_size=64, input_layer_size_g=100, hidden_layer_size_g=128, hidden_layer_size_d=128, learning_rate=1e-3, decay_rate=1e-4, image_size=28, display_epochs=5, create_gif=True): # -------- Initialise hyperparameters --------# self.numbers = numbers # chosen numbers to be generated self.epochs = epochs # #training iterations self.batch_size = batch_size # #of training examples in each batch self.nx_g = input_layer_size_g # #neurons in the generator's input layer self.nh_g = hidden_layer_size_g # #neurons in the generator's hidden layer self.nh_d = hidden_layer_size_d # #neurons in the discriminator's hidden layer self.lr = learning_rate # how much newly acquired info. overrides old info. self.dr = decay_rate # learning rate decay after every epoch self.image_size = image_size # # pixels of training images self.display_epochs = display_epochs # interval for displaying results self.create_gif = create_gif # if true, a gif of sample images will be made self.image_dir = Path('./GAN_sample_images') # new folder in current directory if not self.image_dir.is_dir(): self.image_dir.mkdir() self.filenames = [] # stores filenames of sample images if create_gif is True # -------- Initialise weights with Xavier method --------# # -------- Generator --------# self.W0_g = np.random.randn(self.nx_g, self.nh_g) \ * np.sqrt(2. / self.nx_g) #100x128 self.b0_g = np.zeros((1, self.nh_g)) # 1x100 self.W1_g = np.random.randn(self.nh_g, self.image_size ** 2) \ * np.sqrt(2. / self.nh_g) #128x784 self.b1_g = np.zeros((1, self.image_size ** 2)) #1x784 # -------- Discriminator --------# self.W0_d = np.random.randn(self.image_size ** 2, self.nh_d) \ * np.sqrt(2. / self.image_size ** 2) #784x128 self.b0_d = np.zeros((1, self.nh_d)) # 1x128 self.W1_d = np.random.randn(self.nh_d, 1) \ * np.sqrt(2. / self.nh_d) # 128x1 self.b1_d = np.zeros((1, 1)) # 1x1

**Data Preprocessing**

Five pre-processing steps were applied to the training data:**1.** Limiting it to the subset of digits selected by the user through the *numbers* list**2.** Removing images that can’t be part of a full training batch**3.** Flattening the images in an array with 784 values representing each pixel’s intensity**4.** Scaling the images the range of the tanh activation function [-1,1]**5.** Shuffling it to enable convergence

def preprocess_data(self, x, y): x_train = [] y_train = [] # limit the data to a subset of digits from 0-9 for i in range(y.shape[0]): if y[i] in self.numbers: x_train.append(x[i]) y_train.append(y[i]) x_train = np.array(x_train) y_train = np.array(y_train) # limit the data to full batches only num_batches = x_train.shape[0] // self.batch_size x_train = x_train[: num_batches * self.batch_size] y_train = y_train[: num_batches * self.batch_size] # flatten the images (_,28,28)->(_, 784) x_train = np.reshape(x_train, (x_train.shape[0], -1)) # normalise the data to the range [-1,1] x_train = (x_train.astype(np.float32) - 127.5) / 127.5 # shuffle the data idx = np.random.permutation(len(x_train)) x_train, y_train = x_train[idx], y_train[idx] return x_train, y_train, num_batches GAN.preprocess_data = preprocess_data

**A**ctivation Functions

Here, we will implement the activation functions that will be used in forward propagation. Numpy’s *tanh* is used directly. **The leaky ReLU function ( lrelu) effectively acts as the relu function when the alpha parameter is set to zero**.

def lrelu(self, x, alpha=1e-2): return np.maximum(x, x * alpha) GAN.lrelu = lrelu

def sigmoid(self, x): return 1. / (1. + np.exp(-x)) GAN.sigmoid = sigmoid

As usual, the derivatives of the activation functions will be needed in backward propagation.

def dlrelu(self, x, alpha=1e-2): dx = np.ones_like(x) dx[x < 0] = alpha return dx GAN.dlrelu = dlrelu

def dsigmoid(self, x): y = self.sigmoid(x) return y * (1. -y) GAN.dsigmoid = dsigmoid

def dtanh(self, x): return 1. - np.tanh(x)** 2 GAN.dtanh = dtanh

**Forward propagation**

Next, we will implement forward propagation for the generator and discriminator network. After the input layer, each layer applies the affine transformation followed by an activation function .

In the generator, the random noise, *z*, is propagated through the network to produce a batch of fake images, *a1_g.*

In the discriminator, a batch of images (real or fake), *x*, are propagated through the network to predict a classification (real or fake).

def forward_generator(self, z): self.z0_g = np.dot(z, self.W0_g) + self.b0_g self.a0_g = self.lrelu(self.z0_g, alpha=0) self.z1_g = np.dot(self.a0_g, self.W1_g) + self.b1_g self.a1_g = np.tanh(self.z1_g) # range [-1,1] return self.z1_g, self.a1_g GAN.forward_generator = forward_generator

def forward_discriminator(self, x): self.z0_d = np.dot(x, self.W0_d) + self.b0_d self.a0_d = self.lrelu(self.z0_d) self.z1_d = np.dot(self.a0_d, self.W1_d) + self.b1_d self.a1_d = self.sigmoid(self.z1_d) # output probability [0,1] return self.z1_d, self.a1_d GAN.forward_discriminator = forward_discriminator

**Backward propagation**

The GAN setup is reminiscent of reinforcement learning, where the generator is receiving a reward signal from the discriminator letting it know whether the generated data is accurate or not. The key difference with GANs, however, is that **we can backward propagate gradient information from the discriminator to the generator,** so the generator knows how to adapt its parameters in order to produce output data that can mislead the discriminator.

We will start by backward propagating the real image gradients through the discriminator and then the fake image gradients through the generator. To do so, we need to pass following information to *backward_discriminator*:**1.*** x_real*: a batch of real images from the training data**2.** *z1_real:* logit output from the discriminator **3. ***a1_real:* the discriminator’s output predictions for the real images**4.*** x_*fake: a batch with fake images produced by the generator**5.** *z1_fake* logit output from the discriminator **6. ***a1_fake:* the discriminator’s output predictions for the fake images

The gradients are derived by simply differentiating the loss function with respect to each parameter. I will not derive the gradients here as there are many tutorials online, for instance Andrew Ng’s video lectures. For an intuitive understanding of backward propagation, I recommend Andrej Karpathy’s blog.

def backward_discriminator(self, x_real, z1_real, a1_real, x_fake, z1_fake, a1_fake): # -------- Backprop through Discriminator --------# # J_D = np.mean(-np.log(a1_real) - np.log(1 - a1_fake)) # real input gradients -np.log(a1_real) da1_real = -1. / (a1_real + 1e-8) # 64x1 dz1_real = da1_real * self.dsigmoid(z1_real) # 64x1 dW1_real = np.dot(self.a0_d.T, dz1_real) db1_real = np.sum(dz1_real, axis=0, keepdims=True) da0_real = np.dot(dz1_real, self.W1_d.T) dz0_real = da0_real * self.dlrelu(self.z0_d) dW0_real = np.dot(x_real.T, dz0_real) db0_real = np.sum(dz0_real, axis=0, keepdims=True) # fake input gradients -np.log(1 - a1_fake) da1_fake = 1. / (1. - a1_fake + 1e-8) dz1_fake = da1_fake * self.dsigmoid(z1_fake) dW1_fake = np.dot(self.a0_d.T, dz1_fake) db1_fake = np.sum(dz1_fake, axis=0, keepdims=True) da0_fake = np.dot(dz1_fake, self.W1_d.T) dz0_fake = da0_fake * self.dlrelu(self.z0_d, alpha=0) dW0_fake = np.dot(x_fake.T, dz0_fake) db0_fake = np.sum(dz0_fake, axis=0, keepdims=True) # -------- Combine gradients for real & fake images--------# dW1 = dW1_real + dW1_fake db1 = db1_real + db1_fake dW0 = dW0_real + dW0_fake db0 = db0_real + db0_fake # -------- Update gradients using SGD--------# self.W0_d -= self.lr * dW0 self.b0_d -= self.lr * db0 self.W1_d -= self.lr * dW1 self.b1_d -= self.lr * db1 GAN.backward_discriminator = backward_discriminator

In *backward_generator*, we will calculate the gradients at the beginning and end of the discriminator but we won’t update the discriminator weights.

def backward_generator(self, z, x_fake, z1_fake, a1_fake): # -------- Backprop through Discriminator --------# # J_D = np.mean(-np.log(a1_real) - np.log(1 - a1_fake)) # fake input gradients -np.log(1 - a1_fake) da1_d = -1.0 / (a1_fake + 1e-8) # 64x1 dz1_d = da1_d * self.dsigmoid(z1_fake) da0_d = np.dot(dz1_d, self.W1_d.T) dz0_d = da0_d * self.dlrelu(self.z0_d) dx_d = np.dot(dz0_d, self.W0_d.T) # -------- Backprop through Generator --------# # J_G = np.mean(-np.log(1 - a1_fake)) dz1_g = dx_d * self.dtanh(self.z1_g) dW1_g = np.dot(self.a0_g.T, dz1_g) db1_g = np.sum(dz1_g, axis=0, keepdims=True) da0_g = np.dot(dz1_g, self.W1_g.T) dz0_g = da0_g * self.dlrelu(self.z0_g, alpha=0) dW0_g = np.dot(z.T, dz0_g) db0_g = np.sum(dz0_g, axis=0, keepdims=True) # -------- Update gradients using SGD --------# self.W0_g -= self.lr * dW0_g self.b0_g -= self.lr * db0_g self.W1_g -= self.lr * dW1_g self.b1_g -= self.lr * db1_g GAN.backward_generator = backward_generator

### Sampling & GIF generation

*Sample_images* will enable us to view digits from the generator’s distribution at the frequency defined by the user through the *display_epoch* hyperparameter. After training each batch, we will generate a grid of sample images (but not show it if the frequency criterion is not met) and save it in the *GAN_sample_images *folder in your current directory.

def sample_images(self, images, epoch, show): images = np.reshape(images, (self.batch_size, self.image_size, self.image_size)) fig = plt.figure(figsize=(4, 4)) for i in range(16): plt.subplot(4, 4, i + 1) plt.imshow(images[i] * 127.5 + 127.5, cmap='gray') plt.axis('off') # saves generated images in the GAN_sample_images folder if self.create_gif: current_epoch_filename = self.image_dir.joinpath(f"GAN_epoch{epoch}.png") self.filenames.append(current_epoch_filename) plt.savefig(current_epoch_filename) if show == True: plt.show() else: plt.close() GAN.sample_images = sample_images

At the end of training, we will generate a gif from the sample images of the generator if *create_gif* is initialised to True. This can be achieved with *imageio* in a few lines of code, which can read from filenames, file objects, http, zipfiles and bytes.

def generate_gif(self): images = [] for filename in self.filenames: images.append(imageio.imread(filename)) imageio.mimsave("GAN.gif", images) GAN.generate_gif = generate_gif

**Training**

Finally, we will define a function to train the model. *train* takes as input raw images and labels and outputs the loss of the Generator and Discriminator at each training step.

In order to speed up training, we will train our data in batches. The number of batches (*num_batches*) is determined by the total number of training images divided by the user-defined *batch_size*.

def train(self, x, y): J_Ds = [] # stores the disciminator losses J_Gs = [] # stores the generator losses # preprocess input; note that the labels aren't needed x_train, _, num_batches = self.preprocess_data(x, y) for epoch in range(self.epochs): for i in range(num_batches): # ------- PREPARE INPUT BATCHES & NOISE -------# x_real = x_train[i * self.batch_size: (i + 1) * self.batch_size] # 64x784 z = np.random.normal(0, 1, size=[self.batch_size, self.nx_g]) # 64x100 # ------- FORWARD PROPAGATION -------# z1_g, x_fake = self.forward_generator(z) z1_d_real, a1_d_real = self.forward_discriminator(x_real) z1_d_fake, a1_d_fake = self.forward_discriminator(x_fake) # ------- CROSS ENTROPY LOSS -------# # ver1 : max log(D(x)) + log(1 - D(G(z))) (in original paper) # ver2 : min -log(D(x)) min -log(1 - D(G(z))) (implemented here) J_D = np.mean(-np.log(a1_d_real) - np.log(1 - a1_d_fake)) J_Ds.append(J_D) # ver1 : minimize log(1 - D(G(z))) (in original paper) # ver2 : maximize log(D(G(z))) # ver3 : minimize -log(D(G(z))) (implemented here) J_G = np.mean(-np.log(a1_d_fake)) J_Gs.append(J_G) # ------- BACKWARD PROPAGATION -------# self.backward_discriminator(x_real, z1_d_real, a1_d_real, x_fake, z1_d_fake, a1_d_fake) self.backward_generator(z, x_fake, z1_d_fake, a1_d_fake) if epoch % self.display_epochs == 0: print(f"Epoch:{epoch:}|G loss:{J_G:.4f}|D loss:{J_D:.4f}|D(G(z))avg:{np.mean(a1_d_fake):.4f}|D(x)avg:{np.mean(a1_d_real):.4f}|LR:{self.lr:.6f}") self.sample_images(x_fake, epoch, show=True) # display sample images else: self.sample_images(x_fake, epoch, show=False) # reduce learning rate after every epoch self.lr = self.lr * (1.0 / (1.0 + self.dr * epoch)) # generate gif if self.create_gif: self.generate_gif() return J_Ds, J_Gs GAN.train = train

We can now train our GAN by alternating the training of the discriminator and the generator. As discussed earlier, to get quick results, I recommend running the model for one digit only, which is defined in the *numbers* list.

numbers = [3] model = GAN(numbers, learning_rate = 1e-3, decay_rate = 1e-4, epochs = 100) J_Ds, J_Gs = model.train(x_train, y_train)

The next figure, visualises the loss of the discriminator and generator at each training step. As training progresses, the generator error lowers, implying that the images it generates are better and better. While the generator improves, the discriminator’s error increases, because the synthetic images are becoming more realistic each time.

plt.plot([i for i in range(len(J_Ds))], J_Ds) plt.plot([i for i in range(len(J_Gs))], J_Gs) plt.xlabel("# training steps") plt.ylabel("training cost") plt.legend(['Discriminator', 'Generator']) plt.show()

**Final thoughts**

GANs have gained a reputation for being difficult to optimise. Without the right network architecture, hyperparameters, and training procedure, the discriminator can overpower the generator, or vice-versa. You can experience this yourself by trying to optimise the GAN implemented in this tutorial for all digits (0-9). The two most common failure modes are:**1. The generator overpowers the discriminator (mode collapse)**. The generator can collapse to a parameter setting where it always emits the same point that the discriminator believes is highly realistic. You can recognise mode collapse in your GAN if it generates very similar images. Mode collapse can sometimes be corrected by strengthening the discriminator in some way—for instance, by adjusting its learning rate or by reconfiguring its layers.**2. The discriminator overpowers the generator**, classifying generated images as fake with absolute certainty. When the discriminator responds with absolute certainty, it leaves no gradient for the generator to descend.

Practitioners have amassed many strategies to mitigate these instabilities and improve the performance of GANs [5, 6] . A summary of key strategies can be found on this GitHub repository. These should be regarded as techniques that are worth trying out, not as best practices. As implementing and testing these techniques with Numpy would be extremely time-consuming, I recommend using a deep learning library like TensorFlow. You can find my improved version of a GAN, implemented with TensorFlow 2.0, here.

**References**

**[1]** Wang Kunfeng, Gou Chao, Duan Yanjie, Lin Yilun, Zheng Xinhu and Wang Fei-Yue. (2017). “Generative Adversarial Networks: Introduction and Outlook”.**[2]** Radford Alec, Metz Luke and Chintala Soumith. (2015). “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”.**[3]** Maas Andrew L, Hannun Awni Y, and Ng Andrew Y. (2013). “Rectifier nonlinearities improve neural network acoustic models”. **[4]** Xu Bing, Wang Naiyan, Chen Tianqi, and Li Mu. (2015). “Empirical evaluation of rectified activations in convolutional network”.**[5]** Glorot Xavier and Bengio Yoshua. (2010). “Understanding the difficulty of training deep feedforward neural networks”.**[6]** Salimans Tim, Goodfellow Ian, Zaremba Wojciech, Cheung Vicki, Radford Alec and Chen Xi. (2016). “Improved techniques for training GANs”.