We tried to ‘lie’ to our classification model by feeding mis-labeled data and analyzed the results…
[This article has been coauthored by Jennifer Prendki and Akanksha Devkar from Alectio, and has been published in Analytics Vidhya]
In a previous blog, we focused on trying to understand how particular classes could be more or less sensitive to the induction of noise in their labels, and to variations on the volume of training data used. If some classes were more ‘similar’ to each other, it would be easier for a model to make mistakes between instances from look-alike classes than between examples pulled from entirely different ones. We felt that this relative sensitivity was an understudied topic in Machine Learning, especially considering the potential gain of some research in this area could yield. For instance:
- If not all classes, and eventually all records, are equally sensitive to bad labels, then shouldn’t we spend more effort getting those ones labeled properly?
- If two classes were too similar to each other, wouldn’t that be an indication that the data is not well adapted to the number of classes we chose? That the ML scientist’s best bet would be to merge those two classes?
- Could class sensitivity be “fixed” by increasing the size of our training dataset? Or would such sensitivity remain, regardless of the volume of data used?
- Is class sensitivity model-dependent, or is it a property intrinsic to the data itself?
In order to attempt to answer some of these questions, we performed a series of “stress-testing” experiments using an easy and intuitive classification problem from which we modified the training set.
In particular, we ran two different experiments:
1. In the first one focused on labeling noise induction, we gradually injected noise in the training labels of the original version of the CIFAR-10 dataset and observed its effect on the accuracy and interclass confusion.
2. In the second one, we studied the impact of data volume reduction by gradually decreasing the size of training CIFAR-10 data and observing its effect on the accuracy and interclass confusion.
We found that the ‘cat’ class within the CIFAR-10 dataset was the most sensitive for both the experiments we ran followed by the ‘bird’ and ‘dog’ classes.
After starting out with a small (shallow) custom deep neural network in these experiments, we naturally started wondering if class sensitivity was model-dependent? In other terms, would it be a wise investment in time for a data scientist to optimize his/her model if such effects were observed, or would their time be better spent on getting more data or better labels? If we used a deeper network, would the same classes be affected the most? Taking last blog’s experiments one step further, we studied the effect of labeling noise induction and data volume reduction on different models.
We sampled a few models that are well-known to most ML scientists for the purpose of repeating our study and tested class sensitivity:
- UnResNet18 (ResNet18 without the skip connections)
At this point, we have focused solely on Deep Learning models, but chose relatively diverse architecture in order to get closer to an answer to the questions asked earlier.
If you haven’t read the details about our earlier work, fear not: we will explain the protocol in detail.
Prior to running our series of experiments, we train baseline versions for each one of the models cited above, so that we can study the discrepancies obtained when inducing noise or reducing data volume. In Figure 1, we show the baseline results for each model; every time, the model is trained using the full (50K-example) training set, which we refer to as S0% to say that 0% noise has been injected in the training set.
Here are the details of the baseline results for all models:
A first look at the baseline confusion matrices across different models shows (unsurprisingly) that not all models perform equally even though the training set is the same. LeNet does very poorly compared to the other (deeper) models, even without any noise induced; the ‘bird’ class, which was particularly sensitive to noise in our original study, is problematic even when using a perfectly clean training set. The ‘dog’ class appears to be the least accurate in both ResNet18 and UnResNet18, while LeNet shows the ‘cat’ class as being the most problematic, just like in the case of our original custom (and shallow) model. The ‘cat’ and ‘dog’ classes are equally inaccurate in the case of GoogLeNet model.
Yet while lots of differences can be noted when comparing the confusion matrices, we also observe a lot of commonalities; for instance, cats are always mostly confused for dogs and, to a lesser degree, for frogs. This is a strong indication that class similarity is in fact intrinsic to the data, and hence that some sensitivity will pre-dominate regardless of the model that is used. After all, it isn’t hard to see how cats are more easily mistaken for dogs than they would be for planes. This is an important result since it clearly shows the existence of intrinsic weaknesses in a training set that models
We can also observe that across the different model baselines, that cross-class confusion is directional (i.e., the confusion matrix is not symmetrical), just like we observed in the analysis from the previous blog post.
- Overall, the same group of classes are confused for each other, regardless of the model that is chosen.
- Cross-class confusion is directional, i.e. class A can be easily confused for class B even if class B is rarely confused for class A.
Experiment 1: Labeling Noise Induction
We follow the same protocol than the one used in our previous study and we repeat the following steps for adding labeling noise to the clean CIFAR-10 training dataset:
- Randomly select x% of the training set,
- Randomly shuffle the labels for those select records, call this sample S_x,
- Train all different models on S_x,
- Analyze the confusion matrix across all models.
Again, as we did for our custom model to take into account the variance from selecting such a sample randomly, we re-ran the same experiment 5 times and averaged the results.
Below, we show the results for the average confusion matrix for S_10%, i, 1≤i≤5 across different models.
We can see from figure 2 that LeNet reacts very badly to the labeling pollution (something which was kind of expected looking at its baseline confusion matrix), as the model is failing to perform well on the clean CIFAR-10, we can very well anticipate that it will perform very badly with pollution. ‘cat’ and ‘bird’ are the weakest classes in both ResNet18 and UnResNet18. Looking at the relative confusion matrix across models, we can see that GoogLeNet is very stable to labeling pollution compared to other models. Even in GoogLeNet, ‘cat’ and ‘bird’ are the weakest classes.
Figure 3 below shows the top 4 least accurate classes across models, we can see that ‘cat’ is almost unequivocally the weakest class. And the models are somewhat biased about the second weakest class between ‘bird’ and ‘dog’. It is worth noting that ‘cat’ is the least accurate across all models except LeNet and ‘bird’ and ‘cat’ turn out to be the most sensitive classes with labeling noise pollution. This probably tells us that classes that are sensitive remain sensitive regardless of the model used on them.
Figure 3 shows the classes that are least accurate across different models and we can infer that ‘cat’, ‘bird’ and ‘dog’ are the most affected with the labeling noise induction. After observing the results across different neural networks of varying depth and distinct architectures, we see that they tend to find the same classes confusing or hard to learn.
- The same classes are impacted by labeling pollution, regardless of the model used.
Experiment 2: Data Reduction
The next experiment we performed was based on changing the data volume and observing effects on accuracy as well as the class sensitivity. This is because when increasing the amount of labeling noise, we also automatically reduce the amount of ‘good’ data available to the model, and hence the results of experiment 1 show a combination of both effects (less good data + more bad data), which we are trying to decouple.
In this experiment we chose to reduce the size of the training set (without inducing any voluntary noise) and see the effects on the confusion matrix. Note, that this data reduction experiment was performed on the clean CIFAR-10, i.e. without any labeling pollution.
We can see that the ‘cat’ and ‘bird’ classes are the most affected with data volume reduction throughout across different models. The classes impact by volume reduction are overall the same ones than those impacted by pollution, which begs the question: is labeling noise causing issues only because it reduces the volume of good training data, or are bad labels themselves confusion the model (note: we will answer this question in depth in our third and last blog post on the topic)?
- The same classes are impacted when reducing the amount of training data, regardless of the model used.
- The classes that were impacted by injection of labeling noise are, generally speaking, the same ones that are sensitive to volume reduction.
This study brought us to an interesting observation: that the same classes were consistently affected both by the reduction of the size of the training set, and by the injection of wrong labels regardless of the model we chose. This supports the hypothesis that class sensitivity is in fact not model-independent, but rather a data-intrinsic property. If that is true, it is strong evidence that data scientists would be better advised to spend time curating their data as opposed to tuning their models.
But our study also brought us to some follow-up questions: since the same classes were sensitive to noise injection and volume reduction, and since every noise injection experiment also caused the reduction of good (correctly labeled) data, it is fair to ask:
- If labeling noise would have an impact on a significantly large training set (i.e., does labeling quality matter as much for large datasets?), and,
- If the effect of bad labels could be compensated by volume of data.
These are the questions we will cover on the last sequel of these series.