It must have been over a year ago, that I first had this idea.
I was admiring a “speed modelling” video: a time lapse of someone creating a digital 3D sculpture based on a single picture.
As someone who was only just starting to explore the world of ML, I remember naively thinking “Ah! This is a perfect job for AI! If a person can do it, then it must be easy to train a neural network to do the same thing!”
I started thinking about ways to generate this dataset, and soon I had an answer. I had a way to synthesise as many data pairs as I like. One piece of data would be a 2D image, and its partner would be a 3D model of the scene in the picture.
The original idea
Conceptually, my idea was simple. I would write a script in Blender 3D that creates random scenes, and renders them.
At the face of it, it doesn’t sound very useful to have a lot of rendered randomness. But alongside the rendered image, my script would also export the 3D model that generated it. This would give you the exact type of labelled data that supervised machine learning algorithms love. Your “X”, the input, would be the rendered image. Your “Y”, or the output, would be the 3D geometry of the picture.
In this time, I had a fairly limited understanding of machine learning. My idea of a neural network was something that magically takes one type of data and transforms it into another.
As such, my plan was rather simplistic, as it just involved creating arbitrary landscapes in Blender, rendering them, and hoping for the best.
After gaining just a small amount of experience in the field, I realised a few changes that I had to make if this was to have any chance of succeeding.
Making the idea more feasible
Completing Andrew Ng’s excellent machine learning course gave me more time to reflect on the idea, and the limitations of my existing model.
I realised that a model trained on complete randomness would not generalise to real life objects, because objects in real life are not random. They follow a number of patterns, with subtle relationships between any two parts of an object. For example, a trained 3D artist may be able to model a whole 3D head from a single profile photo, but only because she knows that most faces are roughly symmetrical. Or, she could model a table with a hidden leg, because she knows it is likely to look like the other legs.
This sort of interpolation of unseen data was essential, as I wanted the model to reconstruct the unseen parts of objects. The aim was for this trained model to take a single photograph, taken from any angle, and reconstruct the whole object, including the side facing away from the camera. As a side effect (at least in theory), a model with this sort of understanding would also be able to handle obscured data in photos.
So the challenge was to build a dataset that conveys these relationships between parts of an object. A simple solution would be to train on 3D scans of everyday objects. However, available datasets that I could find were quite small. I feared that a model trained on ordinary examples might not generalise to new objects well. The danger of overfitting a dataset like this is also present, because of the likely complexity of a neural network capable of this sort of reconstruction.
I faced a dilemma. On one hand, I wanted a neural network trained on this dataset to be as robust as possible. The challenge would be to produce a diverse dataset without resorting to generating noise.
I settled on a compromise. My script would take a seed set of 3D models (no corresponding image necessary), and produce permutations of them. For example, it would put these objects into a number of realistically lit scenarios, and render from different angles. It would also add distortions to the objects so that even a single seed object could lead to infinite variations.
Because the ultimate aim is to create a reconstruction model that can generalise to unseen classes, using a seed set of models might seem to defeat the point. However, I believe a framework like can learn to reconstruct a large number of classes very easily. To train on an additional class will take just a single example 3D model. It is conceivable that a network that is capable of creating 3D reconstructions of a sufficiently diverse set of classes may generalise to new ones, as long as they bear resemblance to a previously seen one.
I’ve created a simple Blender script that aims to do this. As I write this post, it is creating its first batch of training data. Once I have attempted to train a model on it I will be posting the results here on my website.
Before running the script, the user must choose 3 options:
- The script is capable of simulating outdoor or indoor lighting. As such, the user can choose what percentage of the rendered images should be indoors, and what percentage should be outdoors. This will be useful in sets that are predominated by e.g. cars and houses (all largely outdoors), or furniture and household items (all largely indoors).
- The second setting is how many pairs of data to produce. If there is no limit, this can be set to 0 to produce data indefinitely. I recommend setting it to 1 for testing, as Blender becomes unresponsive while the script runs.
- Thirdly, and lastly, the user should point the script to a folder containing the seed dataset in the form of .stl files. The outputs will be in a subfolder called “exports” that is placed within this folder.
How the script works: basic steps
Here is a brief outline of how my script works:
- Firstly, it loads all .stl files found in the selected folder. These are added to a separate layer in Blender.
- An inner loop runs, depending on the number of data pairs the user wants to create. It will also run indefinitely if the user sets it to create “0” data pairs.
- Each time the loop runs, it carries out these steps:
- The script randomly picks an object and duplicates it to the main Blender layer (layer 1).
- It decides whether to simulate an indoors or an outdoors scene, depending on the probability assigned by the user. The script then simulates the scene.
- The script creates the camera in a random location, though it always points at the subject. Focal length is chosen intelligently so the subject always takes up a substantial portion of the view , regardless of the camera distance.
- A simple distortion is applied to the object to add variation. At the moment, this rudimentary. Currently, this simply involves taking a random subset of vertices which are scaled and translated.
- The resulting object is rendered. The resulting image is saved.
- The final, distorted .stl is also exported to the same folder with the same file name.
- There is currently a bug which means that when an .stl files start with a letter, it must be a capital letter.
- To introduce more variation, the current form of the script randomly rotates objects in all axes. It would perhaps be more realistic to only rotate them about the z axis. After all, a chair can be facing left or right, but it is rare to see one upside down.
- It is heavily dependent on the quality of the dataset (like all projects involving synthetic datasets).
- It is likely to be a high bias problem, requiring training on vast datasets.
I considered alternative ways to create this type of dataset:
- A 3D scanner could create many scans of everyday objects, with a corresponding picture taken of each. Though this data would be high quality, and representative of real life objects, it would be prohibitively time consuming to 3D scan large numbers of objects.
I can think of a few ways that this can be useful.
- Aiding 3D artists. This is one of the most obvious ones. It can be used in any situation that photogrammetry can, but with a more streamlined pipeline. There are also situations that photogrammetry is not suitable; for example, photogrammetry does not deal well with transparent or reflective subjects. The method I have described can easily create training examples that contain reflective or transparent objects. A neural network trained successfully on such a dataset will be robust to objects of any material.
- Improve robustness of neural networks to new viewpoints. One disadvantage of convolutional neural networks is to do with the viewpoint of the training examples. For example, if someone trained a cat or dog classifier on photos that are all head on, it might not recognise a profile shot of a cat or a dog. A model with spacial understanding may be able to learn classifications that are more robust to pictures taken from angles not seen in the training set.I just wanted to share a quick thought. If you wanted to use this sort of dataset to improve robustness of a neural network to viewpoint variation, here is one way to do it. It would involve two neural nets – a “reconstructor” and a “classifier”. The “reconstructor” is a neural net that should be trained on the dataset above. In other words, it would take 2D images as inputs and learn to reconstruct their 3D structure as the output. This 3D structure data can be fed to an image classification network (the “classifier”). However, instead of taking images as inputs, it can take the output of the first neural net. In other words, it will learn to classify the 3D structure of objects rather than 2D images. This architecture will be inherently invariant to images taken from different points of view.
- Aiding scientists. I have a paper in the works that compares the accuracy of traditional photogrammetry with gold standard CT scanning. Though it’s in early stages, our initial findings are promising. For many models, the median error between out photogrammetry model and the ground truth is smaller than the resolution of the CT scanner. I would love to see how a model like this compares.
- Guard against adversarial examples. This one is purposefully last on the list, because I think it’s the least likely to hold true. I still thought it was worth sharing, because of the great difficulty, and potential dangers, that adversarial examples pose. Essentially, my thinking was that with a well designed dataset generator, and infinite computing power, you could train on hundreds of millions of training examples and far surpass the abilities of the any 3D modelling artist.It may be reasonable to think that this neural network would be more robust to adversarial examples, because of the ability to train it on arbitrarily large datasets and the ease of introducing noise into this dataset. The reason I doubt it now, is because the second neural network described in point 2 would not enjoy these same benefits. As such, an adversarial example would simply have to create a 3D structure that “tricks” the classifier network.
A note on licensing
I have created a script for Blender 3D that creates datasets, as described above. It is currently very rudimentary, and produces training examples that do not subjectively look like real photos. That said, if anyone would like to have a look at the code I am perfectly happy to open source it. Let me know in a comment below.