A Beginner's Guide to Certifiable Robustness -

Image Credit: Generated by Google Gemini 3.

Machine Learning (ML) models will be a cornerstone of our technical progress in this and the following decades. Especially since the launch of ChatGPT in November 2022, the transformative power of these models across a wide range of areas in our society has become clear to the wider public. What is less known are the risks stemming from the multitude of failure modes machine learning models exhibit. In this blog post, we will focus on a particular threat relevant to the application of such models in safety-critical areas, namely adversarial examples. After introducing the problem, we will discuss a potential way to mitigate this issue, called robustness certification.

The Threat of Adversarial Examples

So what is an adversarial example? In essence, it is an input to the machine learning system, e.g., an image, particularly crafted to induce unwanted behavior, e.g., misclassification. For example, Sharif et al.¹ showed that specially colored eyeglass frames, when worn, can fool facial recognition systems into classifying the person as someone else. Similarly, Eykholt et al.² showed that strategically placing stickers on traffic signs can lead to incorrect road sign classification by machine learning systems for autonomous driving.

However, adversarial examples do not only refer to malicious input encountered when deploying a trained ML system. Another risk comes from an adversary potentially altering the training data ML models are trained on to induce faulty behavior (e.g., misclassification of test data) in the trained models. This is called data poisoning and is of particular interest in the era of large ML systems trained on large, often unchecked corpora from the internet. Famously, Carlini et al.³ showed how, for only $60, they took control of several URLs used to source parts of two famous web-scale datasets and thus could have inserted malicious training datapoints at will. They also disclosed that it is easy to insert malicious changes into Wikipedia at the right time, such that they will be forever archived as Wikipedia snapshots commonly used for training datasets.

The Attack-Defense Arms Race

Since adversarial examples were first described for deep neural networks in 2014 by Szegedy et al.⁴, a multitude of so-called "defenses" have been proposed. These try to mitigate the issue of adversarial examples by proposing changes to model architectures, preprocessing schemes for data points, or specific training schemes to improve the robustness of ML models to adversarial examples. However, these defenses are "empirical" in the sense that they cannot guarantee the non-existence of adversarial examples. In particular, it is usually only a matter of time until a proposed defense is broken by a later, more advanced attack, i.e., a more advanced way to create adversarial examples.⁵ This has led to an arms race where defenses are developed for new attacks and new attacks are developed for new defenses. This is where robustness certificates step into the picture to break out of this vicious cycle.

The Way Out: Certifiable Robustness

So what is certifiable robustness? Let us first focus on the case where one is given a trained ML model f. Then, a certifiable robustness method takes as input (i) a test data point x, and (ii) a set ℬ(x) of potentially corrupted versions of x, and outputs whether the prediction of f on x stays the same for all potentially corrupted x̃ ∈ ℬ(x). As the method proves the existence or non-existence of an adversarial example in ℬ(x), if it outputs that the model f is robust, we can be sure that there are indeed no attacks that can threaten our model. However, if it outputs "unrobust", we know that there exists an adversarial example and we can take appropriate domain-specific measures. A good and recent overview detailing the approaches developed so far to tackle this problem can be found in Li et al.⁶ However, achieving certifiable robustness comes with certain challenges.

Challenges in Certification

Realistic perturbations: To solve the certification problem, one has to mathematically define a certain set of possible perturbed inputs ℬ(x). This can be challenging, as it is not always clear how to mathematically capture realistic perturbations. Thus, the largest body of work has been developed to certify against so-called ℓ_p-bounded adversaries. There, the perturbation set is usually defined based on a given ℓ_p-norm constraint as ℬ(x) = { x' | ‖x − x'‖_p ≤ δ }, where ℓ₂ and ℓ∞ are common choices,⁶ and δ represents some predefined positive scalar often called the "attack budget" or "attack strength". However, there are many realistic scenarios, e.g., the stickers on the stop sign in the first figure, that cannot be captured by ℓp-norm constraints. A new approach is to use advances in generative modeling to generate adversarial examples outside common ℓ_p-bounded constraints.⁷ Abstract interpretation is one method that has shown early success in certifying against perturbations generated by generative models,⁸ but the space of certifying against realistic adversaries is still mostly unexplored.
Scalability: It has been shown that solving the certification problem exactly is NP-hard.⁹ This has led to the development of so-called inexact or incomplete verification methods that on some inputs can output "I don't know" instead of robust or unrobust. This allows the certification process to be significantly sped up. However, state-of-the-art inexact verifiers such as α,β-CROWN¹⁰ are still only able to verify at most medium-sized datasets and, e.g., do not scale to ImageNet. Scalability can further be increased by allowing the verification output to hold only with a certain (high) probability. The most famous approach doing so is called Randomized Smoothing,¹¹ and it can scale to ImageNet. However, it has other disadvantages such as deriving only a probabilistic certificate for so-called "smoothed" classifiers, inference still coming with significant cost (and not scaling to web-scale datasets), and certified radii suffering from the curse of dimensionality.¹²
Data Poisoning: In data poisoning, the adversary is allowed to perturb the training dataset 𝒟. Thus, a certificate takes as input the set of all possible perturbed datasets ℬ(𝒟), and one is interested in the question of whether a test datapoint x is correctly classified by models resulting from training on any 𝒟' ∈ ℬ(𝒟). This problem is inherently harder as it requires a certain mathematical understanding of the training dynamics of neural networks to quantify the effect that changes to the training data have on the final predictions. Therefore, work in this area is still in its infancy, with approaches either (i) extending randomized smoothing to smooth over training datasets, or only applying to (ii) simple classifiers, (iii) ensemble classifiers, or (iv) differentially private learners. An overview of work on poisoning certification can be found in Gosch et al.¹³ All approaches currently suffer from issues such as large computational complexity and applicability to limited types of classifiers that exclude vanilla neural networks such as MLPs or CNNs. Gosch et al.¹³ use the neural tangent kernel to capture the training dynamics of neural networks to, for the first time, allow certification against poisoning of classic neural networks. However, they also suffer from poor scalability. In Sabanayagam et al.¹⁴, this approach is extended to exact certification against label poisoning, the first exact certificate against a poisoning perturbation model for neural networks, yet scalability is still a crucial limitation.

Conclusion

One can conclude that certifiable robustness is an exciting field of study with many open challenges still remaining. A crucial question is how the insights and methods developed in rather academic settings can be transferred to real-world use cases. This can span applications such as LLM safety,¹⁵ verifying neural-network-based controllers in robotic systems,¹⁶ or helping translate regulatory requirements, e.g., established by the AI Act, into clear technical specifications.

References

Sharif et al. "Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-Art Face Recognition." SIGSAC 2016.
Eykholt et al. "Robust Physical-World Attacks on Deep Learning Visual Classification." CVPR 2018.
Carlini et al. "Poisoning Web-Scale Training Datasets is Practical." IEEE S&P 2024.
Szegedy et al. "Intriguing Properties of Neural Networks." ICLR 2014.
Carlini & Wagner. "Towards Evaluating the Robustness of Neural Networks." IEEE S&P 2017.
Li et al. "SoK: Certified Robustness for Deep Neural Networks." IEEE S&P 2023.
Kollovieh et al. "Assessing Robustness via Score-Based Adversarial Image Generation." arXiv:2310.04285, 2023.
Mirman et al. "Robustness Certification with Generative Models." PLDI 2021.
Katz et al. "Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks." CAV 2017.
Verified-Intelligence. "alpha-beta-CROWN." GitHub 2024. https://github.com/Verified-Intelligence/alpha-beta-CROWN
Cohen et al. "Certified Adversarial Robustness via Randomized Smoothing." ICML 2019.
Wu et al. "Completing the Picture: Randomized Smoothing Suffers from the Curse of Dimensionality for a Large Family of Distributions." AISTATS 2021.
Gosch et al. "Provable Robustness of (Graph) Neural Networks Against Data Poisoning and Backdoor Attacks." TMLR 2025.
Sabanayagam et al. "Exact Certification of (Graph) Neural Networks Against Label Poisoning." ICLR 2025.
Kumar et al. "Certifying LLM Safety Against Adversarial Prompting."
Yang et al. "Lyapunov-Stable Neural Control for State and Output Feedback: A Novel Formulation." ICML 2024.

Author: Lukas Gosch

The Threat of Adversarial Examples

The Attack-Defense Arms Race

The Way Out: Certifiable Robustness

Challenges in Certification

Conclusion

References

RELATED

Can We Trust the Uncertainty of Causal Foundation Models?

Exploring XAI Methods for Interpretability of Large Language Models

Robotic Decision Making via Diffusion Models

Responsible Textual Generative Models (Part I): Generating Truthful Content

Random Convolutions: A Simple Way to Boost Generalization

Neuromorphic Computing: A Brain-inspired Approach to Robot Intelligence

Introduction to Embodied Instruction Following

From Unlucky Strikers to Statistical Learning Theory

Performative Prediction

What even is differential privacy?

Mitigating Domain shifts

A gentle introduction to uncertainty quantification

Welcome to the relAI Blog

A Beginner’s Guide to Certifiable Robustness

Author: Lukas Gosch

The Threat of Adversarial Examples

The Attack-Defense Arms Race

The Way Out: Certifiable Robustness

Challenges in Certification

Conclusion

References

RELATED