[Updated for YOLOv8] How robust are pre-trained object detection ML models like YOLO or DETR?

7 min readNov 2, 2022

This article was originally posted on our company website. Lakera’s developer platform enables ML teams to ship fail-safe computer vision models.

26 Jan 2023 — We have updated this article to include the new YOLOv8 models. This includes an extensive model evaluation and robustness benchmark of YOLOv8 models of different sizes (s,n,m,l,x). The new models are compared against YOLOv5 and YOLOv8. Spoiler: YOLOv8’s performance improvements did not bring a corresponding improvement in model robustness.

We tested the robustness of state-of-the-art computer vision models to assess their generalization ability. Here is what we found.

TL;DR

The commonly used machine learning (ML) metrics can be misleading when assessing performance in the real world.
The models are not robust to real-world factors such as geometric changes, blur, noise, and lighting changes.
Interestingly, this also holds when augmentations were used as part of training.
Larger models are not more robust. On some dimensions, they get worse. YOLO has better ML robustness properties than the larger, transformer-based DETR.
ML testing and robustness testing help you assess the generalization abilities of your system.
The performance improvements from YOLOv5 to YOLOv8 do not come with a corresponding improvement in robustness. The two families have roughly similar robustness properties. In this case, model size did not matter.

State-of-the-art pre-trained object detection models can be easily fine-tuned to achieve competitive ML metrics on our own validation datasets. But what does it take to prepare these for production? What are the potential weaknesses of these models that we should be aware of? And how do those impact your choice of a model to fine-tune on your own datasets?

To find out, we took several standard high-performance open source models like Ultralytics’ YOLOv8 and YOLOv5, and Meta’s DETR for a test drive with MLTest to benchmark their generalization capabilities.

But before we get there, why is ML testing and robustness testing important to assess model generalization?

The problem.

Today the following workflow is a common experience for us computer vision developers:

Use an open-source model pre-trained on, say, COCO that achieves competitive metrics.
Fine-tune it on data specific to our use case.

When such models are launched into real-world environments, however, “unknown unknowns” are often encountered, and the models start surfacing issues.

So the main challenge becomes assessing model generalization during development: how will the model behave once it is in the complex real world?

In this blog, after dissecting the robustness properties of state-of-the-art computer vision models, we will argue that the gap between a pre-trained model and a real-world high-performer is often significant. As a result, fine-tuning such a model on our initial datasets is only the beginning, and most of the work lies ahead.

The good news: while validation metrics only provide limited insights into real-world model performance, many of the issues leading to poor model generalization are identifiable before release.

Why model robustness matters and how to check the robustness of a model.

Variations in image properties like lighting, motion artifacts, or image quality loss are ubiquitous in production. The real-world data distribution is undoubtedly richer, and a moving, ever-evolving target. By measuring the ML robustness of your system to such factors, you can assess the risk that your model is overfitting to the properties of your data distribution, as well as its ability to handle the inevitable variations it will encounter. Low robustness is indicative of poor generalization.

We ran Lakera’s MLTest to assess the robustness of our candidate models. As part of this, MLTest generated universal robustness tests which are barely perceptible to the naked eye and frequently occur during operation, without using white or black-box adversarial attacks. Judge for yourself, can you tell which of the following are original COCO images, and which have been modified?

Can you tell which of the following are original COCO images, and which have been modified?

So which models did we put to the test?

The candidate models.

We benchmarked the robustness of the following models:

YOLOv8 family: YOLOv8 is fresh off-the shelves and has shown some impressive improvements over Yolo v5. But how does it fair in terms of robustness, has that suffered due to the larger models compared to Yolo v5? We investigated five different model sizes (n, s, m, l, x).
YOLOv5 family: We also took a look at five older YOLOv5 models from Ultralytics of different sizes (n, s, m, l, x). These models are not state-of-the-art but fare well compared to the best models out there as judged by aggregate metrics.
DETR transformers: Additionally we considered Meta’s DETR models based on transformer architectures, taking sizes S, M, L, and H into account.

For reference, these models achieve the following competitive validation metrics on COCO:

Aggregate metrics of the candidate models. We use our own implementation of COCO mAP etc, so numbers may differ from the ones reported elsewhere.

The COCO validation set, however, does not represent the real world. What are these metrics hiding? What is the likelihood that the model will generalize once released into the wild or fine-tuned?

What we found.

The following plots summarize MLTest’s risk score for different models and model robustness tests. The score is between 0 and 100, where 100 represents the highest risk and 0 stands for a lower risk model. The score represents the percentage of the dataset where the model’s behavior is heavily impacted by MLTest’s robustness testing. We plot the aggregate risk score (lower is better) computed by MLTest for all main risk factors, each of these consisting of several individual tests:

Model robustness vs model size for YOLO v8.

Model robustness vs model size for YOLO v5.

Model robustness vs model size for DETR.

Here are a few side-by-side examples of how the smallest image changes affect model performance, as identified by MLTest.

*Left: original image. Right: image augmented with motion artifacts.*

*Left: original image. Right: image with a modified lighting source.* *Green: true positive. Red: false positive. Yellow: original label.*

We take away a couple of insights from our experiments:

Mild transformations have significant effects on model robustness.

As you can see from the plots above, for both model families, mild transformations have a dramatic impact on model robustness, both for YOLO and for DETR models. On YOLO, the models become more robust as size increases, though not uniformly: models become less robust to low image quality as size increases, for example. DETR models do not become more robust as size increases.

Robustness issues are found even if training-time augmentations are used.

Interestingly, for YOLO models, this applies also to augmentations that were used during training (e.g. median blur, equalization, grayscale). While it is unexpected, it is not surprising either: adding a few lines of code with these augmentations is not a silver bullet.

We should explicitly test that these augmentations have the intended effect and calibrate the augmentation pipeline carefully. Interestingly, these augmentations also do not trivially transfer: while median blur was used during training, the overall blur risk factor still fared poorly. Based on our code inspection, very few augmentations were used during training in the case of DETR models. Does this explain the robustness issues we observe here?

Larger models achieve higher metric scores but not model robustness.

We see a consistent pattern: larger models are not universally better. Within the YOLO family, there is no significant and consistent increase in robustness as models expand and in some risk factors like Image quality, it even gets worse.

Transformer-based models achieve better metrics than YOLO on the validation set but fare much worse in terms of model robustness. These properties should be taken into account when selecting the core of a production model.

What can we do about it?

Our experiments indicate that these pre-trained systems are likely far from robust computer vision. This has implications when choosing a model to fine-tune: the models with the highest validation metrics may not be the most robust, and thus may generalize poorly on your specific problem. A few practices that help us build systems that generalize to the real world:

Include systematic ML testing in your development process well beyond standard metrics. To begin with, analyze the robustness of your model to assess red flags for generalization. Insights can be used to guide data collection and synthetic augmentations. We’re clearly biased but believe that MLTest is a fantastic solution for this.
Adding augmentations during training is not enough. We have found that even with these augmentations, model robustness is not a given. We should explicitly test if the augmentations have the intended effect and calibrate your training-time augmentation pipeline (e.g. setting the correct probability and strength) carefully.
Model robustness is just the beginning. Aggregate metrics also hide underperforming slices of data. Identifying those metrics and collecting the required data is a critical component of systematic testing, which we will cover in a follow-up post.

We can’t wait to test some more SOTA models very soon. So stay tuned for updates here!

Want to test your own models?

MLTest is the easiest way to assess the generalization capabilities of your models. You can learn more about it here or get started right away. Also, feel free to get in touch with us at mateo@lakera.ai.