Sampling Bias in Data Science
How to Deal with Your Model’s Hidden Enemy
Let’s imagine the following situation. You are working on a multi-class image classification problem, say, the MNIST Handwritten Digits problem (One of the most popular image classification datasets in the world).
You have ensured that the data to be used for training your Convolutional Neural Network (CNN) model is properly pre-processed and cleaned. You even split the dataset into training and validation sets, following the data science modeling and evaluation conventions.
Now, as you begin the modeling phase, you start to notice that no matter what CNN architecture you choose, your model is having a really high error rate during the evaluation, especially while making a prediction for the digit ‘3’. You even tried various hyperparameter optimization and regularization techniques but the problem still persists.
So, is it that this problem just can’t be solved? Or that probably you will have to wait for months or even years in the hope of a groundbreaking discovery in the field of Computer Vision and Deep Learning that might possibly solve this problem of yours?
The answer to the above two questions is thankful no. There are high chances that the actual problem does not lie within the CNN architectures but is actually hidden within your data that you are using for training. And this hidden problem within your dataset is known as Sampling Bias – A problem that might seem trivial enough to easily overlook, but will absolutely ruin your model’s inference performance.
In this article, we will learn what sampling bias is, and the different techniques that you can use in order to handle the bias within your data.
What is Sampling Bias?
We will understand sampling bias with the help of our MNIST dataset example. Let’s say, after splitting the dataset with around 60000 instances, we get training and validation sets of sizes 50K and 10K instances respectively. Now, we performed this splitting randomly. What this means is that we had no control over how many instances of each target class (digits from 0-9) will be present in the dataset. Ideally, for equal representation in the training set, each digit class should have around 5000 instances in the training set. However, upon checking the label counts, we observed that the digit ‘3’ had only around 400 instances in the training set.
In other words, one can say that the sample we considered for training (the training set), was biased against the digit ‘3’. Now I hope you can correlate this with the bad inference performance that our model was giving while making inference on the digit ‘3’. Due to this bias in our training sample, the model wasn’t getting enough data to train for recognizing the target class ‘3’.
This is sampling bias. Sampling bias occurs in the data when members of a population or a sample belonging to a certain class have a higher probability of being selected into a sample over the members belonging to some other target class. This works the other way around too, i.e., when some target class is less likely to be selected into a sample, like in the MNIST dataset example.
Here, the ‘population’ refers to the entire dataset. ‘Sample’ on the other hand refers to a subset of the population, for example, the training or validation splits of the dataset.
Now that we know what sampling bias is, let us have a look at the different means that will help you overcome the bias. One thing to be noted is that since this article is more oriented towards the beginners, we won’t be going into very complex statistical terminology.
More Data is Always Better
There are two primary reasons why your training sample might be biased.
The first and the most common reason is bias within the population (i.e., our original data) itself. If your original dataset is already suffering from sampling bias, there are high chances that your training sample will suffer from sampling bias too. Now, the reason why your original dataset had a bias can be due to a variety of reasons, mostly depending on the process of data collection or some geographical and socio-economic factors.
The second reason why your training sample might be biased is because of sheer coincidence. It may occur that while data within the population was perfectly biased but during the random splitting of data into test and validation sets caused more instances from a certain class to be selected into the training set. Thus, training a model on such a dataset will result in a significantly better inference on the data class the training sample was biased for, over the target classes the sample was biased against.
But what if I tell you that both these reasons, the bias within the population itself or the bias generated during random sampling, have one thing in common? Yes, while the two things may seem completely different, both these have one common fix. More data. Here’s how more data can fix the problem.
In the first case where we observed a bias within the sample itself, we can identify the classes that have an inferior representation within the dataset. Then, based on these observations, we can collect more data for these under-represented classes, thus balancing the representations each target class has within the population.
In the second case, where the sampling bias was generated coincidentally during the random selection of instances into the training sample, by adding more data to the population, we can increase the randomness of the selection process. While this won’t ensure a reduced bias with certainty, this will most likely reduce the chances of getting a biased training sample, as, during the process of randomly shuffling and selecting data instances, there will be a lot more instances to select from.
Now data collection can be a bit tricky. Sometimes you will be able to easily find some similar data on the Internet from sources like Kaggle, Github, etc. Other times, there won’t be any data similar directly available, and in such a case, you will have to scrap your data.
Here’s a list of resources that you can use to find more data:
- Github Repository- awesome-public-datasets by awesome data
- Google Public Data
- UCI Machine Learning Repository
Stratified Sampling to the Rescue
The problem with random splitting to generate the training and validation samples is that we don’t have any sort of control over how the final distribution of these samples will turn out to be. As a result, despite having a well-balanced dataset, you might end up with a training or validation sample with a huge sampling bias.
Here’s where stratified splitting comes to the rescue. Performing stratified splitting on the dataset allows the class distribution to be preserved while making the splits. This means if you had a balanced original dataset, the training and validation samples will also be balanced. Thus, by using a stratified sampling technique, you can have control over how the final distribution within the samples will turn out to be.
Now, let us see how you can implement stratified sampling using Scikit-Learn and Python.
First, let us generate a random dataset.
Output of the above code:
Now that we have created the dataset, we will perform the stratified splits. Upon completion, we will get the indexes of the data instances for the training and validation split.
Output of the above code:
Here, the blue shaded indexes belong to the class ‘0’ and the non-shaded ones belong to the class ‘1’. As you can see, the ratio of class ’0’: class ‘1’:: 1:4 of the original dataset is maintained across the training and validation splits.
So, now we know how to deal with sampling bias within our data. Hope these techniques will come handy next time you work on your Machine Learning or Deep Learning projects. If you like this article, share it with your friends and other Data Science enthusiasts.