This article was originally posted on our company website as three articles. Lakera’s developer platform enables ML teams to ship fail-safe computer vision models.
All machine learning models deployed into production should be tested for a few fundamental properties. In previous blogs, we have written extensively on testing machine learning models for model robustness (read our blog on fuzz testing and metamorphic relations), testing for data bugs, and techniques from traditional software engineering like regression testing. In this post, we will take a look at model bias as well as data bias, and what can be done to prevent either.
1. Data representativity.
“Data is a reflection of the inequalities that exist in the world”–Beena Ammanath, AI for Good Keynote. While this might be true, developers have great potential to curb model bias and data bias in their computer vision systems.
Testing whether bias is present in a computer vision system is key to understanding how it will perform in operation. Bias manifests itself in numerous ways, from data collection and annotation to the features that a model uses for prediction.
Let’s start by looking at data representativity and the model tests that empower you to uncover pesky biases early in your development process.
Collecting data.
Bias can first appear when collecting and annotating data. The data that you use to build and evaluate a computer vision model must reflect what you intend to use it for: this is referred to as data representativity.
A radiology diagnostic tool to be deployed in southern France must be evaluated on patients from local demographics. The diagnostic tool should also be evaluated on images captured with machines present in the target hospitals. Past research has focused on guidelines that can be followed when collecting and annotating data for training and testing to mitigate such bias.
How do you know if you have the data that matters?
Once you have collected data, it is essential to confirm that it is representative of the target population. While establishing this from image data alone is challenging, image metadata can prove to be very useful. In previous posts, we have introduced the notion of metadata and why it contains semantic information key to evaluating machine learning models–in particular in computer vision. If the sex and age of patients are available, as well as the model of the machine that was used for the collection of the images, we can create unit tests to check data for each relevant slice is present in the datasets. This way we can build up a comprehensive test suite, that allows us to ensure the data as a whole is representative and identify areas where it isn’t–thus effectively guiding the data collection process.
Leave no outlier behind.
Finally, representativity in the literature refers to a match to the target population: for example, if 99.9% of the target population is between 20 and 70 years old, an evaluation dataset should reflect this. This however disregards the importance of the tails of the distribution and is a key difference between building prototypes and production-ready systems. Indeed, an ML model may achieve excellent accuracy on an evaluation dataset containing data in the 20 to 70-year-old range, even if it performs poorly on 80-year-olds. If the product is intended to work on patients of all ages, then it is paramount to explicitly test on slices belonging to the tail of the distribution, even if they are rarely encountered in practice.
As in the illustration below, aggregate evaluation metrics, such as accuracy, precision, and recall may be misleading: it is important to explicitly measure performance for all relevant slices.
In conclusion, find out who your target groups are, big or small, and that you have enough data for all of them. You can use metadata as a tool to find groups that matter.
2. Shortcut learning.
Nobel Prize-winning economist, Daniel Kahneman once remarked:
“By their very nature, heuristic shortcuts will produce biases, and that is true for both humans and artificial intelligence, but their heuristics of AI are not necessarily the human ones”. This is certainly the case when we talk about “shortcut learning”.
Despite careful testing on the data side, model bias can reveal itself more directly in what the computer vision system learns. This issue of a computer vision model using the wrong visual features for prediction is referred to as shortcut learning.
Looking in the wrong places.
The black-box nature of many computer vision models renders such shortcuts difficult to find, and as a result, trained models tend not to generalize well to unknown environments. In the paper Recognition in Terra Incognita, Caltech researchers showcase a classification model that does well at finding cows on an evaluation set but fails when asked to classify cows by the beach or other unusual environments. For a computer vision models, visual features indicating grass and mountains may contribute to detecting a cow in the image, while beach or indoor features may heavily weigh against it. It is expected that the model uses such features, but their impact should be understood before deploying such models in production. A company building a cow detector unaware of this fact would disappoint some coastal clients, creating reputational risk.
How to detect shortcuts.
In this paper, the authors show that face detection benchmarks achieve above-random performance even after removing the hair, face, and clothes of subjects. This indicates that irrelevant background features are being used for prediction. Another piece of research identifies an initial list of such biases that can appear in practice for medical applications. Similar ablation experiments, where the parts of the image relevant for prediction are masked out, can be useful in identifying such shortcuts. Metadata can be a powerful tool to detect and test for some of these shortcuts as well. Statistical dependence between metadata dimensions and the performance of the model can surface concerning shortcuts: if the demographic of a patient is highly correlated with performance then further investigation is needed!
To summarize, shortcut learning happens when your computer vision system is looking at the wrong visual features to make predictions. Such shortcuts can be detected from image data alone, for instance, by measuring reasonable performance despite masking out the regions of the image that matter for prediction. They can also be detected by referring back to your metadata: if there is a strong link between metadata parameters and the performance of the model, then it’s worth taking a closer look. Having practices in place during the machine learning model evaluation process to detect these shortcuts is key to a high-performing model.
3. Drift and monitoring.
If the past three years have taught us anything, it is that the world around us can take unexpected turns. The same can be true for your computer vision models.
Unforeseen data may be presented to the computer vision model during operation despite careful mitigation of datasets and shortcuts. One such phenomenon is data drift.
A hospital may change their x-ray machine and keep using the same computer vision model to diagnose, even though the system was not trained with this kind of input data. Similarly, an autonomous car solely built for European streets notable for their twists and turns, may not perform as expected if deployed in an American city.
Fail, but fail gracefully.
ML models tend to fail silently and make predictions regardless, albeit erroneous ones. One can mitigate operational bias by adding the right mitigation strategies: the wider ML system should detect in operation if an image looks “suspicious” or “unknown”, and gracefully fail (for example, by asking the doctor for a closer look).
Out-of-distribution detection.
The problem of finding such problematic inputs is called out-of-distribution detection. The challenging problem involves comparing the distribution of high-dimensional objects. If you’re interested in learning more about it, the research in the area is extensive [1], [2], [3]. Note that out-of-distribution detection is a key part of many learning systems. For example, Generative Adversarial Networks train a discriminator network whose sole task is to detect if a generated image is “suspicious” when judged against a reference dataset. Systems in production should be endowed with an out-of-distribution detector in order to detect problematic samples on the fly. If a problematic image is detected, the system should fail gracefully, thus reducing the risk of silent failures of your computer vision system.
It is essential to keep data drift in mind once your system is in production. Keeping the data and model up-to-date is just a part of any AI’s lifecycle. In the meantime, ensure that mitigation strategies are in place so those suspicious outcomes are detected and looked at by humans in the loop.
- [1] “Deep Anomaly Detection with Outlier Exposure“, Hendrycks, 2019
- [2] “FRODO: Free rejection of out-of-distribution samples: application to chest x-ray analysis”, Çallı et al.¸ 2019
- [3] “Efficient Out-of-Distribution Detection in Digital Pathology Using Multi-Head Convolutional Neural Networks”, Linmans, 2020