Generative Adversarial Networks (GANs) have achieved tremendous success in generating high-quality synthetic images and efficiently internalising the essence of the images that they learn from. Their potential is enormous, as they can learn to do that for any distribution of data.
In order to keep up with the latest advancements, I decided to explore their theoretical underpinnings by implementing a simple GAN in Python using Numpy. In this post, I will go through the implementation steps based on Ian Goodfellow’s Generative Adversarial Nets paper.
The full code is available on GitHub.
We will implement a GAN that generates handwritten digits. The basic principle of GANs is inspired by the two-player zero-sum game, in which the total gains of two players are zero, and each player’s gain or loss of utility is exactly balanced by the loss or gain of the utility of another player . It comprises of two models:
1. Generator: learns to generate new images, which have a similar data distribution to the real dataset. Crucially, it has no direct access to the real images; it only learns through its interaction with the discriminator.
2. Discriminator: learns to distinguish candidates produced by the generator from the real data distribution. It outputs the probability that an input image is from the real data distribution rather than the generator distribution.
Although both models are typically described by convolutional neural nets, they could be implemented by any form of differentiable system that maps data from one space to another. This implementation uses multilayer preceptors as they are less computationally prohibitive and much easier to code from scratch.
The generator, , is fed random noise, , from a normal distribution with zero mean and standard deviation of 1 (range [-1,1]). As we will train both the generator and discriminator using mini-batch gradient descent, the input noise will be a numpy array of size [batch size, input layer size].
The output of the generator will be a batch of flattened images of size [batch size, image dimension **2] . The image dimension corresponds to the number of pixels of the training images (for MNIST, each image has size 28×28 pixels).
The discriminator, , will be fed a batch of real images from the MNIST dataset and a batch of fake images from the generator. It will output the probability that the input images are real or fake.
The original paper suggests training the discriminator for k steps before training the generator for one step. We will choose the least computationally expensive solution, k=1, therefore training the discriminator and generator equally.
Practitioners have been experimenting with the more sophisticated Deep Convolutional GANs to determine the optimal activation functions for the hidden and output layers . I have found their recommendations to be very effective for a simple GAN as well.
In the generator network, it is recommended to use a ReLU activation in the hidden layers and tanh activation in the output layer. It was observed that using a bounded activation allowed the model to learn more quickly to saturate . In the discriminator network, leaky ReLU was found to work well [3,4] in the hidden layers, as it prevents the vanishing gradient problem. At the output layer, a sigmoid activation is commonly used which squeezes pixels that would appear grey toward either black or white, resulting in crisper images. This is in contrast to the original GAN paper, which used the maxout activation.
Training of GANs involves balancing two conflicting objectives:
1. Training to maximise the probability of assigning the correct label to the training examples, and samples from the generator, . The discriminator therefore wants to maximise:
where is the training data distribution and the noise prior
which is equivalent to minimising:
This is just the standard cross-entropy cost that is minimised when training a binary classifier with a sigmoid output. The only difference is that the classifier is trained on two mini-batches of data; one coming from the dataset, where the label is 1 for all examples, and one coming from the generator, where the label is 0 for all examples.
2. Training to minimise the likelihood of the generated images not coming from the real data distribution. In other words, is trying to maximally confuse the discriminator. It tries to minimise:
Typically, an alternative, non-saturating training criterion is used for the generator:
Let’s start by importing numpy, matplotlib.pyplot and other useful libraries.
Keras.datasets is imported to get access to the MNIST dataset, imageio to generate a gif from sample images at each training iteration and Path to define the location where sample images will be exported that are used to generate the gif.
We need a set of real handwritten digits to give the discriminator a starting point in distinguishing between real and fake images. We’ll use MNIST, a benchmark dataset in deep learning. It consists of 70k images of handwritten digits compiled by the U.S. National Institute of Standards and Technology from Census Bureau employees and high school students.
As we will only use the train data, the test data (10k images) will be ignored.
y_train.shape (60000,)x_train.shape (60000, 28, 28)
We will wrap all functions in the GAN class.
It takes a long time to train a GAN properly. On a single GPU, a GAN might take hours, and on a single CPU more than a day. In addition, GANs are difficult to optimise. For these reasons, I recommend trying to generate one digit at a time, by limiting the training data from the digits 0-9 to the digit specified in the numbers list.
We will use a mini-batch size of 64 (batch_size). The input layer of the discriminator is determined by the size of the training images [batch_size, image dimension **2], which needs to match the output of the generator, i.e. the fake images. The number of neurons at the input layer of the generator (input_layer_size_g) as well as the hidden layers of both models (hidden_layer_size_g, hidden_layer_size_d) need to be defined by us.
Next, to visualise training performance, we can generate a gif of sample images. If create_gif is enabled, a grid of sample images will be saved in your local directory by default and their filename will be stored in the filenames list to enable sourcing the images for stitching at the end of training.
While GANs are commonly used with momentum to adapt the learning rate or the Adam optimiser, we will use a simple step decay quantified by the decay_rate.
Finally, all weights will be initialised from a zero-centered Normal distribution with standard deviation determined by the Xavier algorithm. It makes sure the weights are ‘just right’ by keeping the signal in a reasonable range of values through many layers.
Five pre-processing steps were applied to the training data:
1. Limiting it to the subset of digits selected by the user through the numbers list
2. Removing images that can’t be part of a full training batch
3. Flattening the images in an array with 784 values representing each pixel’s intensity
4. Scaling the images the range of the tanh activation function [-1,1]
5. Shuffling it to enable convergence
Here, we will implement the activation functions that will be used in forward propagation. Numpy’s tanh is used directly. The leaky ReLU function (lrelu) effectively acts as the relu function when the alpha parameter is set to zero.
As usual, the derivatives of the activation functions will be needed in backward propagation.
Next, we will implement forward propagation for the generator and discriminator network. After the input layer, each layer applies the affine transformation followed by an activation function .
In the generator, the random noise, z, is propagated through the network to produce a batch of fake images, a1_g.
In the discriminator, a batch of images (real or fake), x, are propagated through the network to predict a classification (real or fake).
The GAN setup is reminiscent of reinforcement learning, where the generator is receiving a reward signal from the discriminator letting it know whether the generated data is accurate or not. The key difference with GANs, however, is that we can backward propagate gradient information from the discriminator to the generator, so the generator knows how to adapt its parameters in order to produce output data that can mislead the discriminator.
We will start by backward propagating the real image gradients through the discriminator and then the fake image gradients through the generator. To do so, we need to pass following information to backward_discriminator:
1. x_real: a batch of real images from the training data
2. z1_real: logit output from the discriminator
3. a1_real: the discriminator’s output predictions for the real images
4. x_fake: a batch with fake images produced by the generator
5. z1_fake logit output from the discriminator
6. a1_fake: the discriminator’s output predictions for the fake images
The gradients are derived by simply differentiating the loss function with respect to each parameter. I will not derive the gradients here as there are many tutorials online, for instance Andrew Ng’s video lectures. For an intuitive understanding of backward propagation, I recommend Andrej Karpathy’s blog.
In backward_generator, we will calculate the gradients at the beginning and end of the discriminator but we won’t update the discriminator weights.
Sampling & GIF generation
Sample_images will enable us to view digits from the generator’s distribution at the frequency defined by the user through the display_epoch hyperparameter. After training each batch, we will generate a grid of sample images (but not show it if the frequency criterion is not met) and save it in the GAN_sample_images folder in your current directory.
At the end of training, we will generate a gif from the sample images of the generator if create_gif is initialised to True. This can be achieved with imageio in a few lines of code, which can read from filenames, file objects, http, zipfiles and bytes.
Finally, we will define a function to train the model. train takes as input raw images and labels and outputs the loss of the Generator and Discriminator at each training step.
In order to speed up training, we will train our data in batches. The number of batches (num_batches) is determined by the total number of training images divided by the user-defined batch_size.
We can now train our GAN by alternating the training of the discriminator and the generator. As discussed earlier, to get quick results, I recommend running the model for one digit only, which is defined in the numbers list.
The next figure, visualises the loss of the discriminator and generator at each training step. As training progresses, the generator error lowers, implying that the images it generates are better and better. While the generator improves, the discriminator’s error increases, because the synthetic images are becoming more realistic each time.
GANs have gained a reputation for being difficult to optimise. Without the right network architecture, hyperparameters, and training procedure, the discriminator can overpower the generator, or vice-versa. You can experience this yourself by trying to optimise the GAN implemented in this tutorial for all digits (0-9). The two most common failure modes are:
1. The generator overpowers the discriminator (mode collapse). The generator can collapse to a parameter setting where it always emits the same point that the discriminator believes is highly realistic. You can recognise mode collapse in your GAN if it generates very similar images. Mode collapse can sometimes be corrected by strengthening the discriminator in some way—for instance, by adjusting its learning rate or by reconfiguring its layers.
2. The discriminator overpowers the generator, classifying generated images as fake with absolute certainty. When the discriminator responds with absolute certainty, it leaves no gradient for the generator to descend.
Practitioners have amassed many strategies to mitigate these instabilities and improve the performance of GANs [5, 6] . A summary of key strategies can be found on this GitHub repository. These should be regarded as techniques that are worth trying out, not as best practices. As implementing and testing these techniques with Numpy would be extremely time-consuming, I recommend using a deep learning library like TensorFlow. You can find my improved version of a GAN, implemented with TensorFlow 2.0, here.
 Wang Kunfeng, Gou Chao, Duan Yanjie, Lin Yilun, Zheng Xinhu and Wang Fei-Yue. (2017). “Generative Adversarial Networks: Introduction and Outlook”.
 Radford Alec, Metz Luke and Chintala Soumith. (2015). “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”.
 Maas Andrew L, Hannun Awni Y, and Ng Andrew Y. (2013). “Rectifier nonlinearities improve neural network acoustic models”.
 Xu Bing, Wang Naiyan, Chen Tianqi, and Li Mu. (2015). “Empirical evaluation of rectified activations in convolutional network”.
 Glorot Xavier and Bengio Yoshua. (2010). “Understanding the difficulty of training deep feedforward neural networks”.
 Salimans Tim, Goodfellow Ian, Zaremba Wojciech, Cheung Vicki, Radford Alec and Chen Xi. (2016). “Improved techniques for training GANs”.