No, the World doesn’t Need Another Synthetic Data Company

Jennifer Prendki

4 years ago

Let’s begin with the obvious: every machine learning project starts with data. Whether that data needs to be labeled, collected, generated, cleaned, munged, or fussed with in any way, shape, or form, we all understand that machine learning requires not only data, but good data — and enough of it.

This is a serious problem for a lot of ML practitioners. They might have good data but not enough of it. They might have a lot of data but too much of it is unhelpful or even harmful. But at the end of the day, what that means is they’ll need to get more data for their models to be successful. And increasingly, those machine learning professionals are looking not to data collection but to synthetic data as a solution.

It’s easy to understand why. For starters, data collection is surprisingly difficult. Collection operations can be costly. It can be hard to know what you need to collect. Even if you know what you think you need to collect, sometimes, that data doesn’t really appreciably help your models. Even then, for a lot of projects, that data will need to be labeled. And we all know the pitfalls of labeling: not only does it generally require a third-party company, but labeling is slow, expensive, and often error-prone.

Synthetic data skirts a lot of those issues. It’s usually cheaper, you can tweak it, and there’s often no need for labeling since the data is being generated for a stated and expressed purpose. What’s more, synthetic data is getting pretty mature. Just take a gander at wonderfully named thispersondoesnotexist.com for example: those aren’t real people but you wouldn’t guess that immediately, would you?

But while synthetic data has its uses, it’s nowhere near the silver bullet some prognosticators and VCs with money in the game want you to believe it is. The synthetic facial data we used as an example? That’s a good use case. Something like receipt generation to train an optical character recognition model for an expense reporting company? Same idea. There’s plenty of real, historical data you can use to generate a sufficient amount of synthetic. Receipts, after all, are fairly similar.

That said, what if your machine learning project doesn’t need common data like receipts or faces. What if you need pictures of weeds? What if you’re trying to identify products in videos? Those spaces are far less amenable to synthetic data simply because they’re less common. There aren’t millions of examples from which to generate a database of weeds or soda cans in a soap opera. And because there aren’t a ton of examples, the chances the generated synthetic data is going to be redundant or unhelpful is a lot higher.

It goes beyond these boutique use cases too. What if you’re trying to identify explicit imagery? Or do really any work in natural language processing? This is a lot of nuance for synthetic data generation. After all, the definition of “explicit” varies by use case and business. In other words: what’s acceptable on Reddit is different from what’s acceptable for Disney. And language is full of nuance, colloquialisms, irony, turns of phrase, etc. You can’t simply generate a bunch of conversations that scan like a human actually said them. A lot of generated language sits in a strange uncanny valley where they feel stilted or translated too many times.

In other words, some data is simply easier to generate than other data. If your machine learning project doesn’t fall into one of those buckets, chances are, synthetic data won’t be helpful for you.

And no, that’s not all. It’s far easier to generate “common” data than outlier data. Now oftentimes, models readily understand that common stuff already, so you’re simply adding redundant examples that can sometimes lead to overfitting or other biases. Past that, outlier data is often very important. Take, for example, the problem of autonomous vehicles. It’s much easier to generate common frames like a street scene with a few cars rolling down the highway than it is an accident. But that accident data is really valuable and often crucial — -and you don’t want to generate non-realistic accident data either. That could lead to serious problems.

Then there’s the issue of novelty. New cars are released every year. Scooters became prevalent in cities just a few years back. Our world changes constantly. Because synthetic data is, by its nature, generated from historical records, it cannot account for these changes until you’ve collected and labeled data the old fashioned way.

Moreover, the biases from synthetic data in models are not yet widely understood. Say a person is helping choose which data is created. Suddenly, you’re injecting the possibility of confirmation bias. Say that person controls almost no part of the data generation problem. Now, you’re looking at running into a lot of the same issues you have with general data collection.

Lastly, synthetic data often carries with it pretty hefty compute costs. If you’re creating your own data, you need to figure out what trade off you’re willing to make here, whereas if you’re using a third party, those costs are likely being transferred to you regardless.

So what does that all mean? Well, let’s remember where we started. Every machine learning project requires data and most machine learning projects need more data than they have. Can you solve that with synthetic data? For certain use cases, of course you can. Synthetic data brings with it a lot of the same issues other regular data has (it can be useless or harmful, it can cause biases etc.) but there are times when it’s appropriate.

If you do choose to use synthetic data, it’s important to not assume it will all be useful or that it won’t have some of the problems that come with regularly collected data. We went through a few of those above. But what that means is you’ll want to analyze which synthetic data is helpful to your model, which is harmful, and what’s not particularly useful. By doing that, you’ll understand what’s really helping your model learn and then you can prioritize the creation of those specific types of data. At Alectio, we use active learning to monitor your model’s performance over each learning loop and interpret the signals its sending to help you make that choice.

Similarly, if you don’t want to use (or can’t use) synthetic data, consider doing something a little less glamorous: creating a data collection and curation strategy. We touched on this before, but active learning can be really valuable here. With active learning, you’re understanding how a model’s preferences and needs change over time. Those signals can allow you to collect data that will actually be helpful (and not redundant or harmful) for your model.

At Alectio, we use active learning to understand these exact signals. We can understand what data your model wants to see next and when your model runs out of the data it really needs. That helps you prioritize the data to collect and curate next, which means you won’t be wasting resources collecting useless or harmful data or squandering budget to create synthetic data that won’t really move the needle. And if you’d like us to show you how we do it, we’d be happy to.