- Image: everything possible/Shutterstock
Data is the lifeblood of machine learning models. But what happens when there is limited access to this much-coveted resource? As many projects and companies are beginning to show, this is where synthetic data can be a viable if not superior alternative.
What is synthetic data?
Synthetic data can be defined as information which is manufactured artificially and not obtained by direct measurement. The idea of “fake” data is not a new or revolutionary concept at its core. It’s actually a different labeling of a method of generating test or training data for models lacking the available or necessary information needed to function.
In the past, a lack of data has led to the convenient approach of using a randomly generated set of data points. Although this may have been sufficient for educational and testing purposes, random data is not something you would want to train any kind of prediction model from. This is where the idea of synthetic data differs; it’s reliable.
Synthetic data is, essentially, the distinct idea that we can be smart with how we produce randomized data. Such an approach can thereby be applied to more sophisticated use cases rather than just tests.
How is synthetic data manufactured?
While synthetic data is not created differently to random data—just through more complex sets of input—it does serve a different purpose and therefore has unique requirements.
The synthetic approach is based on and limited to certain criteria that is fed as input beforehand. In practice, it’s not random at all. It’s actually based on a sample set of data with certain distributions and criteria that guides the possible range, distribution and frequency of the data points. Basically, the aim is to replicate real data in order to populate a larger dataset, which will then be expansive enough to train machine learning models.
SEE: Artificial Intelligence Ethics Policy (TechRepublic Premium)
This method becomes particularly interesting when exploring the deep learning methods used to refine synthetic data. Algorithms can be pitted against one other with the goal of outperforming each other in their ability to produce and identify synthetic data. Essentially, the aim here is to create an artificial arms race for producing hyper-realistic data.
Why is synthetic data needed in the first place?
If we can’t collect the valuable resources we need to advance our civilization, which applies to anything from farming food to generating fuel, then we find a way of creating it. The same principle now applies to the area of data for machine learning and AI.
It’s crucial to have a very large sample size of data when training algorithms, otherwise the patterns identified by the algorithm are at risk being too simple for real world applications. It’s actually pretty logical. Just as human intelligence tends to take the easiest path to solve a problem, the same constantly happens when training machine learning and AI.
For instance, let’s apply this to an object recognition algorithm that can accurately identify a dog from a selection of cat images. With too small an amount of data, the AI runs the risk of relying on patterns that are not fundamental features of the objects it’s trying to identify. In this case, it may still work, but when it encounters data that does not follow the initially identified pattern, it falls flat.
How is synthetic data used to train AI?
So, the solution? We draw a lot of animals that are slightly different in order to force the network to find the underlying structure of the image, not just the placement of certain pixels. But rather than draw a million dogs by hand, it’s better to build a system, designed solely to draw dogs that can be used for training the algorithm built for classification—which is essentially what we’re doing when providing synthetic data to train machine learning algorithms.
There are, however, obvious pitfalls in this method. Simply generating data from nothing is not going to be representative of the real world and will therefore result in an algorithm that most likely can’t function when it encounters real data. The solution is to collect a subset of data, analyze and identify trends and ranges in it, and then use this data to generate a large pool of random data that very likely represents how the data would look if we collected all of it ourselves.
This is where the value of synthetic data truly lies. No longer do we have to run around tirelessly collecting data which then needs to be cleaned and processed before use.
How is synthetic data a solution to the growing focus on data privacy?
The world is currently experiencing a very strong shift, especially in the EU, toward the increased protection of privacy and the data we generate with our online presence. In the fields of machine learning and AI, the tightening of data protection proves to be a recurring hurdle. Quite often, restricted data is exactly what’s needed for training algorithms to perform and provide value for end users, especially for B2C solutions.
Generally, the problem of privacy is overcome when a private individual decides to use a solution and therefore approves that their data will be used. The problem here is that it’s very hard to get users to give you their private data before you have a solution that provides enough value to hand it over. As a result, providers can often get stuck in a chicken and egg dilemma.
SEE: How to choose the right data privacy software for your business (TechRepublic)
The solution can and might be the synthetic approach, in which a company can obtain a subset of data through early adopters. From here, they can use this information as the foundation for generating enough data for training their machine learning and AI. This approach can drastically reduce the time-consuming and costly need for private data and still work to develop algorithms for their actual users.
For certain industries embroiled in the bureaucratic slog for data, such as healthcare, banking and legal, synthetic data provides an easier approach to accessing previously unobtainable volumes of data, removing what is often a limitation for new and more advanced algorithms.
Can synthetic data replace real data?
The problem with real data is that it isn’t generated with the intention to train machine learning and AI algorithms; it’s simply a byproduct of the events that happen all around us. As stated before, this obviously puts limitations on the availability and ease of collection but also the parameters of the data and the chances of flaws (outliers) that might disrupt the results. This is why synthetic data, which can be tailored and controlled, is more efficient at training models.
However, despite its superior training applications, synthetic data will, inevitably, always rely on at least a small subset of real data for its own creation. So no, synthetic data will never replace the initial data it needs to be based on. More realistically, it will significantly reduce the amount of real data required for the training of the algorithms, a process that requires significantly more data than the testing—generally 80% of the data goes to the training, with the other 20% going to testing.
Ultimately, if approached correctly, synthetic data provides a quicker and more efficient way to obtain the data we need at a lower cost than obtaining it from the real world and with a reduced need to poke the hornet’s nest of data privacy.
Christian Lawaetz Halvorsen is the chief technology officer and co-founder of Valuer, the AI-driven platform revolutionizing the way businesses obtain information crucial to their strategizing and decision-making. With an MSc in Engineering, Product Development and Innovation from the University of Southern Denmark, Christian continues to refine Valuer’s technical infrastructure using the most optimal combination of human and machine intelligence.