How to save on data labeling without sacrificing model quality
Technologists will remember the 2010s as the decade of Big Data.
Data storage became cheap enough that companies started hoarding data without even knowing quite what to do with it. Data collection became ubiquitous, thanks in no small part to the internet of things, which allowed for entire new streams of valuable, actionable data. And data processing benefited immensely from the emerging power of GPUs and TPUs to train more robust deep learning models.
Having a preponderance of data to power business operations is, generally speaking, a really positive thing. But with 90% of the ML models in production today using a supervised learning approach, the success of these projects is hugely dependent on a company’s ability to label its data accurately and efficiently.
And that’s easier said than done.
If you’re unfamiliar with the way labels are typically collected, you might assume that labeling is an easy task. Those whose careers have been dedicated to data labeling know this couldn’t be further from the truth.
Of course, the concept of annotation in itself is deceptively simple. Typically, data is annotated by people tasked with generating what is called “ground truth” by the experts. That generated label (we talk about annotation if the label is more sophisticated than a simple value, like in the case of segmentation or bounding box annotations) is the class or value that we want models trained on that dataset to predict for that specific data point. For a simple example, an annotator might look at an image and label an object from a pre-existing ontology of classes. The label is fed to the machine for the process of learning. In a sense, labeling is the injection of human knowledge into the machine, which makes it a critically important step in developing a high-performance ML model. At the risk of being reductive, good labels drive good models.
The problem is, in the real world, just because these labels seem easy on their face doesn’t mean they’re easy in practice.
Here, we have a simple image with some simple instructions. A human annotator is asked to draw a rectangle around any people he sees in the raw, unlabeled image. Easy, right? Here, the image is clear; there is only one person and the person is not occluded. Yet there are many ways to get it wrong: for instance, drawing a box that is too large could cause the model to overfit on background noise and lose its ability to determine relevant patterns. Draw a box that’s too small, and you might miss the interesting patterns. And that doesn’t even account for the cases where the annotator, pressured to process data faster or to make more money, either makes a mistake (honest or not!) or fails to understand the instructions properly.
Think now that this is actually an easy use case, as the image doesn’t contain too many objects (real life cases actually look much messier) and the task at hand is objective. Images with multiple, partially occluded people are tougher. Same goes for something like sentiment analysis, where there is inherent subjectivity present.
And we could go on. But the point is: even simple labeling tasks are fraught with potential errors. Those errors amount to bad data you’re feeding your models. And chances are, you know where that leads.
Trust the mob!
With all those challenges lurking, it can feel a lot harder to ever gather labels of sufficiently high quality to train a decent ML model. And yet supervised machine learning still has made a huge jump ahead over the last few years. So what gives?
The “trick” is that while it is unwise to trust a single human annotator, relying on a larger group actually typically gets us pretty far. That’s because if one annotator makes a mistake, it’s fairly unlikely that another one would make that exact same mistake. In other words, by collecting labels from multiple people for every single record, we are likely to get rid of most outliers and will ensure a higher likelihood that the label is the actual ground truth.
For instance, if there is a 10% chance for an annotator to make a mistake on a binary use case, the probability for two annotators to get it wrong already drops at 0.1 x 0.1 = 0.01 = 1%, which is definitely better, but not quite enough for every (or even most) applications. Three or four annotators drops that number even further.
Quality vs. quantity: can volume make up for bad labels?
The next question is whether you can combat inaccurately labeled training data with just more and more training data. In other words, can quantity fight poor quality? We discussed some of those tradeoffs elsewhere but one thing we were able to show was that some classes were more likely to be sensitive to bad labels than other classes and that the amount of additional data needed to fix the effect of bad labels varied immensely across the dataset. And it didn’t really matter what model was being used. Bad labels pollute training data more and are significantly more harmful than just having less data that’s well-labeled.
In other words, not all data should be given the same attention when it comes to labeling. We also proved that sensitivity to labeling noise is nearly independent of the model. That is great news because it means a generic model can be used to identify the problematic clusters of data requiring a higher labeling accuracy, and rely on that information strategically when labeling the data.
What is active learning, and how does it relate to data labeling?
In other words, having a smart labeling strategy is much more important than you might think. Admittedly, it’s not as sexy as researching the newest model, but it’s oftentimes more vital to pushing a successful model to production. And if you ask any data labeling expert what comes to mind when discussing how to reduce their labeling budgets, they’ll often say active learning.
Active learning is a category of ML algorithms, a subset of semi-supervised learning which relies on an incremental learning approach and offers an elegant framework to allow ML scientists to work with datasets that are only partially labeled.
Active learning is based on the simple yet powerful assumption that not all data is equally impactful on the model and that all data is not learned at the same pace by the model. That’s both because datasets often contain a lot of duplicative information (which means that some records can individually be valuable but offer very little incremental improvement when used in conjunction with similar records) and also because some data does not contain any relevant information.
An example? Say you want to train an OCR (Optical Character Recognition) algorithm but have a dataset where a large fraction of records do not contain any text at all. By incrementally adding more training records, active learning dynamically finds the records that are the hardest for the model to learn (or those that are the most beneficial) allowing the ML scientist to focus on the most important data first. Which is to say active learning selects the “right” records to label. And that’s something budget holders can appreciate.
But active learning misses something important. It cannot necessarily surmise the number of annotations that need to be collected for a given record.
Generally, companies have always considered labeling frequency (or number of annotations per record) as a static parameter, which had to be predetermined as a function of their labeling budget and the required labeling accuracy. In other words: you label each image five times or each sentence three times. There really was no theory behind how to choose an optimal frequency, and no framework to dynamically tune that frequency as a function of a record’s criticality to the model, its tolerance to noise, and how difficult it is to label it.
Our recent study changes that. We now have data to model the impact of labeling accuracy to the model’s accuracy and build strategies to optimize how to spend our labeling budget. Understanding which records need more labels and which don’t opens the door to smartly optimized labeling budgeting strategies.
How to save money with a better labeling strategy
So: labeling accuracy is a function of the number of annotations per record and the probability of mislabeling a record while model accuracy is a function of the accuracy of the labeling process and the size of the training set. When you combine those? You can start to figure out exactly how to marry a model’s accuracy to its labeling budget.
We simulated the situation of a 250,000 sample dataset for which each annotation would cost 25¢, assumed a mislabeling probability of 40% across the dataset, and got the following relationships between the budget spent on labeling and the model’s accuracy for 5 different volumes of training data:
Right away, you can see that the strategic criteria to optimize the budget is actually not the volume of data but rather the number of annotations per record.
For instance, with a budget of $50K, the customer is better off labeling 50K records 4 times each. This strategy is 16% more accurate compared to labeling 200K records once.
However, if the customer needs her model to reach an 86% model accuracy, she would be making a better choice to use 200K training records 3 times; this would save her about $50K (more than 25%) compared to a strategy of 5 labels per record on a training set of size 150K records.
If we now represent the same simulated data but group it by number of annotations per record, it’s very clear that labeling the data once per record is simply a bad idea. Even two annotations grossly outperforms a single, fallible labeler. Annotating 3 times per record seems a safe bet most of the time, while at much above that you may have some issue justifying that cost.
Now, it’s important to note that so far, we have still chosen to use a constant number of annotations for all records. But since recently we were able to determine that some classes were more sensitive than others to labeling noise, it seems reasonable to try to exploit that fact to further tweak and optimize our labeling budget.
In another simulation, where we now have a binary classification image classification problem on a balanced 500K dataset, assuming a more favorable mislabeling probability of 25% and modeling one class with a significantly higher sensitivity to noise than the other. We analyzed four different labeling strategies for this problem:
- Strategy 1: we label the least sensitive class twice, and the most sensitive one three times
- Strategy 2: we label the least sensitive class three times, and the most sensitive one twice
- Strategy 3: we label both classes three times
- Strategy 4: we label both classes twice
We can see that strategy 3 applied to the entire dataset leads to the strongest accuracy; however, it is barely better than strategy 1, which is significantly cheaper. Strategy 1 seems to be the best for really low labeling budgets, while strategy 4 is better for medium budgets.
Tuning the labeling strategy at the class level leads of course to a chicken-and-egg problem, as the labels are necessary to classify the data; however, the same sensitivity study can be combined with an unsupervised approach and applied on clusters instead of classes; this research is one of our current focus areas.
So where does this all leave us? For starters, we hope you’ve internalized that more data isn’t always a good thing. Cleaner, better-labeled, more accurate data, on the other hand, is.
But really, what’s most important to take away from our experiments is this: you’re likely spending too much money labeling data. If you’re like a lot of ML practitioners, you’re likely labeling too much without digging into the quality and how much both your good and bad labels are affecting your models. Remember: different classes and different problems require different labeling schemas. Leveraging active learning to understand which classes need extra attention from labelers holds the promise of building more accurate models at a fraction of the cost and time needed now. And that’s something that should get everyone in your company on board.