Test machine learning the right way: Metamorphic relations

Lakera
4 min readNov 8, 2022

--

This article was originally posted on our company website as part of a four article series on machine learning testing. Lakera’s developer platform enables ML teams to ship fail-safe computer vision models.

In this part of our machine learning testing series, we’ll look at metamorphic relations — a technique used to multiply your available data and labels. We discuss how they can be used for machine learning model evaluation. Metamorphic relations help extend the test coverage of your ML (machine learning) system beyond what can be achieved through normal data collection. This testing series has previously covered multiple aspects around how to evaluate machine learning models such as testing for data bugs and regression testing.

The test oracle problem.

The test oracle problem is not specific to ML, and it is well known from traditional software [1]. It refers to determining the correct test output for a given test input.

Let’s look at an example from medical imaging. Imagine that you are building an ML system for medical imaging that is used as a diagnostic tool for cancer histopathology. The input = images of histopathology samples. The output = cancer or no cancer diagnosis.

The test oracle problem presents itself because you have some input image data, but you don’t know the label (cancer/no cancer).

This is solved by having the image annotated. You can send these images to histopathologists, who can play the role of the test oracle by adding a label to each sample image. The problem is that these images are scarce to begin with and the ones that you do have will be expensive to annotate.

The combinatorial number of scenarios needed for thorough machine learning evaluation requires more data and labels than can be realistically collected. For example, relevant scenarios become too large when looking at variations in the color of the image, the type of microscope used to take the image, the zoom level, etc. As a result only a part of relevant conditions can be tested for, leading to insufficient test coverage.

In come metamorphic relations. Take the image that you already have and rotate it. You could then send this rotated image to be re-annotated to solve the test oracle problem. But because you know that the label for the rotated image is still cancer, you don’t need to.

That’s how metamorphic relations can contribute to solving the oracle problem.

What are metamorphic relations?

Metamorphic relations are a great way of extending the test coverage of your ML model. A metamorphic relation [2]:

“Refers to the relationship between the software input change and output change.”

To return to the example of a square function, an easily tested metamorphic relation (ignoring numerical issues) is:

f(-input) = f(input)

This is a powerful concept that can be applied to ML as well! Two classes of metamorphic relations that are well known in computer vision are:

a) Image augmentations (e.g., rotation) that affect the label in a known way and act as a dataset multiplier;

b) Using temporal relations in video sequences (e.g., two successive image frames in a 30Hz video sequence are likely similar) that act as supervisory signals. Both have been applied in the context of (self-)supervised learning to create more robust ML models [3].

How can we leverage this concept for model testing in machine learning?

Example: Using metamorphic relations for medical image testing.

We illustrate the use of metamorphic relations when looking at how to test machine learning models for our histopathology example. We can make use of metamorphic relations to write model unit tests and increase the test coverage in our ML testing suites.

Rotating, shifting the color or defocusing the histopathology image should lead to the same label (cancer/no cancer).
An example of test specifications based on metamorphic relations.

We’d certainly expect this ML system to work if the input image is rotated by 180 degrees. Shifts in the color intensity of the image should also not change the system output. Neither should slightly out-of-focus samples.

These problem insights or, in this context, metamorphic relations can be used to create clear test specifications and to build these model unit tests. Not only does this multiply your available test data but it also ensures that your ML model behaves according to the specifications via machine learning unit testing.

An example of machine learning model specifications and corresponding tests. For example, if the model should be robust to 180 degree rotations, a unit test can augment the image by rotating it, and verify that the model’s output remains unchanged.
Specification and tests for your ML model

So, why bother augmenting your test data if you’re already adding them to your training set? Truth be told, there is no guarantee that adding these augmentations to your training set ensures the desired behavior of your trained model. We observed this on state-of-the-art object detection models which are not robust to augmentations used during training. But testing your model for desired behavior gives confidence for certain inputs and will likely discover and prevent many ML model bugs.

Not convinced? Similar metamorphic relations were applied to the testing of neural networks for autonomous driving by Tian et al. in DeepTest [4]. They found thousands of erroneous (and sometimes grave) behaviors in state-of-the-art deep neural networks for self-driving cars.

To summarize, metamorphic relations are a great way to thoroughly test your ML system. In addition to regression tests, they should not be forgotten in your development cycles when testing ML models. Our follow-up article on fuzz-testing provides illustrations on how to leverage the concept of metamorphic relations to stress-test ML models.

  • [1] “Test oracle, wikipedia.org, 2022.
  • [2] “Machine Learning Testing: Survey, Landscapes and Horizons”, Zhang et al., arXiv.org, 2019.
  • [3] “Self-supervised Learning”, Zisserman A., Prairie Artificial Intelligence Summer School, 2018.
  • [4] “DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars”, Tian et al., arXiv.org, 2017.

--

--

No responses yet