Variational autoencoder for Lego faces
I spent some of my time off this winter reading David Foster's excellent book Generative Deep Learning: Teaching Machines to Paint, Write, Compose, and Play. The book is about generative models: models that learn to describe how data sets are structured and that can be used to create new data. Some of the specific models it covers include autoencoders, generative adversarial networks (GANs), and recurrent neural networks (RNNs).
I decided to try write a model described in the book on my own, and I went with a variational autoencoder (VAE) because they train relatively quickly. Because of this, I wouldn't have to wait long to figure out if I had made the model right or mangled it. I also wanted to do something kind of novel with the model, so I compiled a new data set for it to train on: 3800 images of Lego minifig faces. The end product can be used for interesting applications including generating new Lego faces and morphing smoothly between Lego faces.
This post will cover the following topics:
- Background information about VAEs
- Implementation details (training set, model details)
- Model evaulation (image reconstruction, visualizing the latent space)
- The fun stuff (generating new images, image morphing)
All of the data and code I used to make the model, plots, and animations in this post are publicly available in the project repo.
Background: the high-level, handy-wavy explanation of how a variational autoencoder works
This section covers what a VAE is and what it does at a high level. If you already know how a VAE works, you can safely skip this section.
An autoencoder (AE) has two main parts: an encoder and a decoder. The encoder takes high-dimensional input, an image in this case, and maps it to a point in a low-dimensional space. The point is called a latent vector, and the low-dimensional space is called the latent space. Ideally, each dimension in the latent space contains some kind of information. You could imagine that for the case of images of Lego faces, one dimension might indicate face color while another might indicate eye size, and so on. The following shows an example of how an AE's encoder might map images to a latent space:
The decoder, on the other hand, takes a latent vector from the latent space as input and expands it back to the form of the original input (in this case, an image).
During training, the encoder and decoder are connected, and their weights are optimized to minimize the difference between each training image that enters the encoder and each reconstructed image that exits the decoder. The following diagram shows the process of an image being compressed into a latent vector by the encoder and then being reconstructed by the decoder:
A vanilla AE's encoder is connected directly and deterministically to the latent vector. Variational autencoders (VAEs) change things up by having their encoders define a distribution from which the latent vector will be randomly drawn. When a VAE trains, it not only minimizes the difference between input images and reconstructions, but it also minimizes the divergence between the encoder's distribution and the normal distribution.
The upshot of this added randomness is that when two nearby vectors are put into a VAE's decoder, the reconstructed images are also similar. No such assurance exists for a vanilla AE. This property gives you the ability to do some fun stuff with a VAE that I'll get into later.
One last important note before you move on is that the meaning of the latent space's dimensions varies from model to model. There is no canonical set of image features that's encoded into the latent space (for example, the first dimension doesn't always control color). Instead, the model is free to choose which features to encode and how to encode them so long as it improves its performance. By optimizing in this way, the model gains a deep understanding of the underlying structure of the data. With this understanding, the model can take on complex tasks, including generating novel images it hasn't encountered before.
If you want a bit more detail, check out Kevin Frans' writing post. If you're looking for a more in-depth discussion of the theory and math behind VAEs, Tutorial on Variational Autoencoders by Carl Doersch is quite thorough.
This section covers the specifics of the trained VAE model I made for images of Lego faces. It includes a description of how I obtained and curated the training set. It also includes some discussion of the model's structure and how I trained it.
The training set
I got images of Lego faces from two sources:
- Bricklink, a marketplace for Lego parts and sets. I made a simple scraper to get the URL of the largest image for each Lego head on the site and to save that image locally.
- Christoph Bartneck, Ph.D, a researcher and Lego fan. He's taken thousands of photos of Lego minifig faces, and they're all excellent. If you're interested in Lego minifigures, you should check out his books. I learned of his work by way of an article by Daniel Wolfe about the emotions of Lego faces.
Once I had all the images in one place, I wrote some quick scripts to make them uniform. For example, several hundred of the Bricklink images contained two or three photos side-by-side, so I made a script to identify these and split them into individual photos. The scripts I used to pull and process the images are in the dataset_scripts directory in my project repo.
I eventually reached the end of what I could reasonably automate and had to manually pick through the images. I removed several hundred images for quality issues including bad lighting and strange camera angles. Several hundred others did not contain faces. In the end, I had a 3800 image training set, and I was a bit concerned that its small size would preclude me from doing anything interesting with it.
I trained the model on images from the training set downsized from 128x128 to 64x64 (only 4 MB of training data total). The VAE's encoder has 5 convolutional layers and its decoder has 5 convolutional transpose layers. I used 200 dimensions for the latent vector. The overall model has 2.7 million parameters and weighs in at 10 MB when trained. I trained it for two hours on a Google Colab GPU instance.
I made no special effort to improve the VAE's performance or to choose an optimal architecture. I recognize that the number of layers and the size of the latent vector is probably overkill for this task. My sole aim was to get a usable model to do the fun stuff below.
My VAE implementation and some associated utilities are in the ml directory of my project repo. The code is largely based on snippets from the book Generative Deep Learning by David Foster. I also put a pretrained model in the trained_model directory of my project repo.
This section covers two ways to informally evaluate the model's performance and its understanding of the underlying structure of the training set.
A quick way to assess the quality of a trained VAE model is to compare raw images with their reconstructed counterparts. The following shows images of Lego faces and their reconstructions:
The results are okay! At a high level, it looks like the model gets that a minifig's face should have eyes, eyebrows, and a mouth. It also renders the shape, color, and lighting of the heads with fair accuracy.
More interesting than what the model gets right is what it gets wrong. Across the board, it has trouble with expressions. Take the first raw image: the face is an angry, mustached, and goateed. Meanwhile, its reconstruction lacks any kind of facial hair, and it has a an expression that's closer to DreamWorks Face than anger.
The last raw image and reconstruction go in the opposite direction: from happy to angry, or at least displeased. The raw image is of a simple smiley face with eyebrows. You'd think it'd be hard to mess up, but its reconstruction is much sterner, with a frown that approaches a sneer.
If your initial reaction to all this is "okay, so you managed to take 4 MB of images and store it in 10 MB of model and still end up with terrible reproductions" - you're right! Luckily, VAEs can do more than encode and decode single images, and I'll get to that in the model applications section.
Visualizing the latent space
Another way to build some understanding of a trained VAE model is to see where in the latent space it thinks each of the images belongs. If we can plot each image's location in the latent space, we can begin to understand which images the model thinks are similar.
You can start by putting all the images into the VAE's encoder and getting latent vectors out (which each represent an image's location in the latent space). However, each of the latent vectors have 200 dimensions. It's inadvisable to plot data in three dimensions, let alone 200.
Fortunately, there exist many methods for dimensionality reduction. One such method is t-SNE (t-distributed stochastic neighbor embedding). I'm not going to get into the details of how it works because that's well out of scope for this post, but the upshot is this: you can use t-SNE to take 200-dimensional vectors and to turn them into 2-dimensional vectors while minimizing the amount of information you lose.
I put all the images in the training set through the VAE encoder to get their 200 dimension latent vectors. I then used t-SNE to turn these 200 dimension vectors into 2-dimensional vectors. Finally, I plotted the input images at these coordinates:
The following are some crops from the full plot that demonstrate how faces with similar features cluster together:
In the first image, you can see that faces with black hair and mustaches are in a group. In the second image, four heads with dark balaclavas and white headbands sit close to one another. In the last image, heads with orange visors (and a jack-o-lantern?) are clustered together. The full-size image has other little clusters like these.
A word of warning: it's important that you don't to assign too much meaning to this visualization or really any made using t-SNE. Because the latent vectors have been compressed so severely, you are unable to see many of the distinctions that the model actually makes. Still, you can begin to build some intuition about what the model understands about faces. And I think it looks neat, too!
Model applications (the fun stuff)
This section covers two of neat things that you can do with the trained VAE: generating new images and morphing between existing images.
Generating new images
A VAE is called a generative model because you can use it to create new images it's never seen before. So let's make some! The mechanics are pretty simple. First, you choose a random vector from the latent space. You then pass this vector into the VAE's decoder and get a brand new image out that most likely does not exist in the training set. The following image depicts this process:
So how well does this work in practice? The results range from nightmarish to somewhat convincing. The following are faces generated randomly by the process described above:
Nearly all the generated images have the standard features of a Lego face: eyes, eyebrows, and a mouth. They almost all get the shape of a Lego head right, too. But geez, when the VAE misses, it really misses. Some of the generated images put the facial features in the wrong place entirely while others are essentially just noise.
Look at the five following images. Hopefully you noticed that they're actually animated and morphing between two different faces.
The process of morphing between images is a little different from generating new images. Instead of jumping straight into the decoder, you actually need to use the encoder, too. There are four main steps:
- Choose two images that you want to morph between
- Put both images into the VAE's encoder and get a latent vector out for each
- Choose several intermediate vectors between the two latent vectors
- Take the intermediate vectors and pass them into the VAE's decoder to generate images
The following image depicts this process:
Note that none of the faces in the transition between the two input faces exist in the training set. These are brand new Lego faces that nobody has ever seen before!
Let's take a closer look at two of these morphs:
Notice how every single intermediate face is completely plausible on its own. I don't think I'd be able to guess which of these is real and which of these is fake if I didn't already know. Going from left to right, you can see the mouth naturally close and the eyebrows widen.
I think this morph is particularly neat because you can see the first face's eyebrows move up and spread out to become hair. Additionally, you can see the mustache subtly grow and the smile slowly contract at each step.
Be aware that not all transitions are as clean or as neat as the ones shown above - I specifically chose these because they look good. Still, if you run through some combinations on your own, it's likely that you can find morphs that look even better than these.
I hope this was an interesting look at my variational autoencoder trained on Lego faces! Again, all of the data and code I used to make the model, plots, and animations in this post are publicly available in the project repo.
The project repo also includes a quickstart guide to running the code on Google Colab, a free environment to run Jupyter Notebooks with free optional GPU/TPU instances. It takes less than a minute to set up, and once it's ready, you can generate your own Lego faces and make your own face morph animations.
Finally, if you find any errors in this post or if you just want to chat, feel free to reach out via email at [email protected]. Thanks for reading!
If you want to buy the book Generative Deep Learning from Amazon and you want to throw a buck my way, here's an affiliate link. I didn't write this post to sell you anything, I just liked the book. None of the other links in this post are affiliate links.
Thanks to Devin Logan for her incredibly helpful input on earlier drafts of this post