Digital twins and synthetic data generation

Nvidia's annual conference on their recent innovations, GTC, crowned the latest trend in machine learning: the use of synthetic data 🦾. Nvidia announced a data generation platform of-the-shelf, the Omniverse, addressing primarily the case of blockchain warehouses 👷📦

Omniverse image for warehouses

Behind this name lies a simple concept: by replicating the context of data capture in a digital environment, it becomes possible to generate a much higher quantity 📈 and quality 🥇 of datapoints of interest. Indeed, before an AI model can be trained and deployed, it is crucial to prepare a quality dataset, representative of the targeted use case. Let's take the one we are interested in, the detection of dangerous objects at airport controls.

The immediate solution would be to prepare luggage, scan it, collect the images and annotate them. This way is very costly in time and resources, but it guarantees that the collected images have the same distribution as the images actually received by the model once deployed. However, it has the major disadvantage of having to be continuously updated for each new object to be detected.

The Omniverse solution, on the other hand, promises greater flexibility. It requires reproducing the RX 🩻 acquisition conditions in a sufficiently realistic 3D rendering engine (typically Blender), and drawing up a list of 3D models of interest. Thus, the famous "digital twins" are extremely realistic images close to those taken in reality.

The advantages of this solution are not the least:

Real scan of a gun without magazine
High definition synthetic scan of a gun

The only downside: are the models learned on these artificial data adapted to the real world 🌍 ? Nothing is less sure, and even if validations on real data are acceptable, a fine-tuning phase on real data only is recommended after training on large volumes of synthetic data. Validation on synthetic data is the critical subject that will allow in the future to almost automatically approve models in production.

Continued advances in rendering engines, driven by hardware improvements , promise increasingly realistic and faster data generation and shorter AI development and deployment cycles 🚀

Louis Combaldieu