Exploiting synthetic images for real-world image recognition

Creating big datasets is often difficult or expensive which causes people to augment their dataset with rendered images. This often fails to significantly improve accuracy due to a difference in distribution between real and rendered datasets. This paper shows that the gap between synthetic and real-world image distributions can be closed by using GANs to convert the synthetic data to a dataset which has the same distribution as the real data. Training this GAN requires only a fraction of the dataset traditionally required to get a high classification accuracy. This converted data can subsequently be used to train a classifier with a higher accuracy than a classifier trained only on the real dataset.

Introduction

Figure 1. Technique used to convert synthetic dataset to be used in training.

Deep learning has revolutionised visual object recognition. Thanks to huge datasets and fast hardware (GPUs), current object recognition approaches have near-human accuracy.

Because creating big datasets is often very expensive, start to turn to rendered images to augment their datasets. However, training networks on rendered images may not achieve the desired accuracy due to a gap between synthetic and real image distributions.

In this paper the impact of the distribution gap between synthetic and real image distributions is decreased by using a GAN to modify rendered images to have the same distribution as real images. This technique is shown to be useful for inflating very small datasets to a level where they can be used to create more accurate classifiers.

Related work

The problem of creating rendered datasets with a distribution close to a real dataset can be solved by using domain adaptation. This is generally done using Generative Adversarial Networks [1, 2, 3] trained to create samples based on images in the rendered dataset that are indistinguisable from images in the real dataset. We decided to perform this research on the GAN as described by Bousmalis et al due to the available TensorFlow implementation.

Inflating datasets using synthetic data and GANs

Figure _. Overview of the GAN architecture.

The basis of this research is a GAN that converts images from a rendered dataset distribution to an image that appears to come from the distribution of a real dataset. This GAN consists of three parts which are shown in figure _. The first part is a generator network that does the image conversion. The second part of the GAN is a discriminator, which tries to distinguish between the output of the generator and images from the real dataset. The third part of the GAN is a classifier that predicts the label of images coming from either the real dataset, the generator or the rendered dataset. All these networks are trained in parallel. The discriminator and classifier are trained to reduce the amount of misclassifications. The generator is trained to minimise the amount of pixels changed in the image, to minimise the loss in classification by the classifier and to try to fool the discriminator to classify the image as real.

To test the effects of inflating datasets when using this technique, we first trained the GAN with a real and rendered dataset. We then apply the trained GAN on the rendered dataset to create a new synthetic dataset. A combined dataset consisting of the synthetic dataset and the real dataset was used to train a second classifier.

Experiments

Figure 2. Example images of the MNIST_font dataset.

We evaluated this technique on the MNIST dataset together with a rendered dataset generated by rendering open source font digits in multiple rotations for all number classes.

The synthetic dataset contains samples from 149 fonts, with each digit rendered having 47 variations each with a distinct rotation between -47 and 46 degrees. All images were normalised using the same algorithm used to normalise the MNIST samples. A random sample of 10,030 of those images were put in a test set and the remaining 60,000 images were used as the rendered dataset.

The GAN was modified to use 10% of the training data as a validation set instead of a constant 1,000 samples as this allowed for smaller sample sizes.

We used these two datasets to train the GAN using 375,000 steps with batchsize 32 and subsequently applied the trained GAN on the MNIST_font dataset to create the MNIST_GAN dataset. This made the MNIST_GAN dataset have the exact same size as MNIST_font.

Each classifier was tested on the MNIST test set which resulted in an classification accuracy.

Results

r = 0.001 - top is from MNIST_font, bottom is the generated image in MNIST_GAN.

r = 0.007 - top is from MNIST_font, bottom is the generated image in MNIST_GAN.

r = 0.100 - top is from MNIST_font, bottom is the generated image in MNIST_GAN.

r = 0.700 - top is from MNIST_font, bottom is the generated image in MNIST_GAN.

Figure 3. Outputs from multiple runs of the GAN with different amounts of original training data. The network seems to be unable to keep the label consistent when there is not enough data.

To validate whether the GAN was creating useful images, we looked at the resulting images in MNIST_GAN:

From visual inspection it is apparent that the GAN has trouble keeping the labels consistent if there is not enough real training data. An example of this are the 1 and the 2 of r = 0.007 in figure 3: they both result in something that looks like the same 8 instead of something that looks like a 1 or 2 respectively. This means that the classifier trained on this new dataset will receive two very similar samples with conflicting labels. We measured the accuracy on MNIST of C-MNIST_GAN and found low accuracy for lower ratios. This supports the conclusion that there is an issue with labeling when the GAN is not trained with enough real data.

C-MNIST_font performance

Figure 4. Performance of C-MNIST_font with varying amounts of MNIST_font training data. Performance decreases with more training samples indicating that MNIST_original and MNIST_font have different characteristics.

To make sure we actually test whether the GAN is able to close the distribution gap we looked at the accuracy of C-MNIST_font. For this dataset the error shows an inverse relation between the amount of samples and the performance of the classifier, which can be seen in figure 4. This indicates that a gap exists between the rendered font dataset and the MNIST test dataset.

C-MNIST_original against C-MNIST_original+GAN

Figure 5. Performance of classifiers with different amounts of MNIST training data. Training with the GAN data improves results after a minimal amount of original MNIST samples.

We compared the accuracy of C-MNIST_font, C-MNIST_GAN, C-MNIST_original and C-MNIST_original+GAN to see whether training with the MNIST_original+GAN dataset is better than training with one of the individual datasets. The result of this comparison can be seen in figure 5.

C-MNIST_original+GAN is only more accurate than the C-MNIST_original when the GAN is trained on more than 385 real images (r = 0.007). There seems to be a minimum amount of real samples after which the GAN starts to produce meaningful data.

Using MNIST as a test case, this technique is able to close the distribution gap even with as little 0.7% real data in the resulting dataset.

Figure _. Error of C-MNIST_original+GAN compared to the error of C-MNIST_original, lower is better. Lowest relative error was found at 5,500 MNIST samples.

We also compared the difference in wrong predictions between C-MNIST_original and C-MNIST_original+GAN. As is shown in figure _, the highest reduction in error was achieved with 5,500 MNIST images and 49,500 font images. This indicates that for MNIST, this technique becomes more effective when the ratio between real and rendered images becomes less extreme.

C-MNIST_original against C-MNIST_{original+font}

Figure 6. Performance of C-MNIST_{original+font} versus the performance of C-MNIST_original for varying amounts of training data. Performance is almost the same indicating no advantage to including font data in the training set directly.

To confirm that the improved results of C-MNIST_original+GAN can not be solely explained by the addition of the rendered font data, we compared C-MNIST_original with C-MNIST_{original+font} to see if the addition of font data increased the accuracy. From the results in figure 6 we had to conclude that C-MNIST_{original+font} has performance roughly equal to C-MNIST_original at the ratios where C-MNIST_original+GAN is more accurate. The addition of more font data cannot explain the additional accuracy of C-MNIST_original+GAN indicating this increase is a property of the transformation made by the GAN.

Discussion

Due to a limited amount of time, no finetuning has been done on the other hyperparameters of the GAN. Optimizing these hyperparameters might further improve classification accuracy which should result in a further reduction in the amount of required samples from the real dataset.

We measured the performance of C-MNIST_original+GAN versus C-MNIST_original. It showed that even though the generated samples are not good samples for their target class, C-MNIST_original+GAN quickly becomes much more accurate than C-MNIST_original before the samples look like useful samples. This effect can partially be explained by results shown by Rolnick et al. that indicate that deep learning is robust to massive label noise. This would mean that the network is still capable of learning higher level features from the mislabeled samples and is able to succesfully ignore the bad labeling.

Conclusion

Our goal was to use a GAN to close the distribution gap between rendered and real datasets.

Using MNIST as an example, we were able to show that it's possible to create a rendered dataset. We've shown that a distribution gap exists between the dataset we created and the real MNIST dataset and were able to close this gap using our described technique.

We've shown that GANs can be used to inflate trainingsets by reducing the gap between synthetic and real datasets. Furthermore, they can do so with very little real training data. This technique can be very useful for problems where only a small real dataset is available.

Exploiting synthetic images for real-world image recognition

Abstract

Introduction

Related work

Inflating datasets using synthetic data and GANs

Experiments

Results

C-MNIST_font performance

C-MNIST_original against C-MNIST_original+GAN

C-MNIST_original against C-MNIST_{original+font}

Discussion

Conclusion

Appendix

Exploiting synthetic images for real-world image recognition

Abstract

Introduction

Related work

Inflating datasets using synthetic data and GANs

Experiments

Results

C-MNISTfont performance

C-MNISToriginal against C-MNISToriginal+GAN

C-MNISToriginal against C-MNISToriginal+font

Discussion

Conclusion

Appendix

C-MNIST_font performance

C-MNIST_original against C-MNIST_original+GAN

C-MNIST_original against C-MNIST_{original+font}