A Beginner’s Guide to Certifiable Robustness

SVG Icon Editor

Image Credit: Generated by Google Gemini 3.

Machine Learning (ML) models will be a cornerstone of our technical progress in this and the following decades. Especially since the launch of ChatGPT in November 2022, the transformative power of these models across a wide range of areas in our society has become clear to the wider public. What is less known are the risks stemming from the multitude of failure modes machine learning models exhibit. In this blog post, we will focus on a particular threat relevant to the application of such models in safety-critical areas, namely adversarial examples. After introducing the problem, we will discuss a potential way to mitigate this issue, called robustness certification.

The Threat of Adversarial Examples

So what is an adversarial example? In essence, it is an input to the machine learning system, e.g., an image, particularly crafted to induce unwanted behavior, e.g., misclassification. For example, Sharif et al.1 showed that specially colored eyeglass frames, when worn, can fool facial recognition systems into classifying the person as someone else. Similarly, Eykholt et al.2 showed that strategically placing stickers on traffic signs can lead to incorrect road sign classification by machine learning systems for autonomous driving.

However, adversarial examples do not only refer to malicious input encountered when deploying a trained ML system. Another risk comes from an adversary potentially altering the training data ML models are trained on to induce faulty behavior (e.g., misclassification of test data) in the trained models. This is called data poisoning and is of particular interest in the era of large ML systems trained on large, often unchecked corpora from the internet. Famously, Carlini et al.3 showed how, for only $60, they took control of several URLs used to source parts of two famous web-scale datasets and thus could have inserted malicious training datapoints at will. They also disclosed that it is easy to insert malicious changes into Wikipedia at the right time, such that they will be forever archived as Wikipedia snapshots commonly used for training datasets.

The Attack-Defense Arms Race

Since adversarial examples were first described for deep neural networks in 2014 by Szegedy et al.4, a multitude of so-called "defenses" have been proposed. These try to mitigate the issue of adversarial examples by proposing changes to model architectures, preprocessing schemes for data points, or specific training schemes to improve the robustness of ML models to adversarial examples. However, these defenses are "empirical" in the sense that they cannot guarantee the non-existence of adversarial examples. In particular, it is usually only a matter of time until a proposed defense is broken by a later, more advanced attack, i.e., a more advanced way to create adversarial examples.5 This has led to an arms race where defenses are developed for new attacks and new attacks are developed for new defenses. This is where robustness certificates step into the picture to break out of this vicious cycle.

The Way Out: Certifiable Robustness

So what is certifiable robustness? Let us first focus on the case where one is given a trained ML model f. Then, a certifiable robustness method takes as input (i) a test data point x, and (ii) a set ℬ(x) of potentially corrupted versions of x, and outputs whether the prediction of f on x stays the same for all potentially corrupted x̃ ∈ ℬ(x). As the method proves the existence or non-existence of an adversarial example in ℬ(x), if it outputs that the model f is robust, we can be sure that there are indeed no attacks that can threaten our model. However, if it outputs "unrobust", we know that there exists an adversarial example and we can take appropriate domain-specific measures. A good and recent overview detailing the approaches developed so far to tackle this problem can be found in Li et al.6 However, achieving certifiable robustness comes with certain challenges.

Challenges in Certification

  • Realistic perturbations: To solve the certification problem, one has to mathematically define a certain set of possible perturbed inputs ℬ(x). This can be challenging, as it is not always clear how to mathematically capture realistic perturbations. Thus, the largest body of work has been developed to certify against so-called ℓ_p-bounded adversaries. There, the perturbation set is usually defined based on a given ℓ_p-norm constraint as ℬ(x) = { x' | ‖x − x'‖_p ≤ δ }, where ℓ₂ and ℓ∞ are common choices,6 and δ represents some predefined positive scalar often called the "attack budget" or "attack strength". However, there are many realistic scenarios, e.g., the stickers on the stop sign in the first figure, that cannot be captured by ℓp-norm constraints. A new approach is to use advances in generative modeling to generate adversarial examples outside common ℓ_p-bounded constraints.7 Abstract interpretation is one method that has shown early success in certifying against perturbations generated by generative models,8 but the space of certifying against realistic adversaries is still mostly unexplored.

  • Scalability: It has been shown that solving the certification problem exactly is NP-hard.9 This has led to the development of so-called inexact or incomplete verification methods that on some inputs can output "I don't know" instead of robust or unrobust. This allows the certification process to be significantly sped up. However, state-of-the-art inexact verifiers such as α,β-CROWN10 are still only able to verify at most medium-sized datasets and, e.g., do not scale to ImageNet. Scalability can further be increased by allowing the verification output to hold only with a certain (high) probability. The most famous approach doing so is called Randomized Smoothing,11 and it can scale to ImageNet. However, it has other disadvantages such as deriving only a probabilistic certificate for so-called "smoothed" classifiers, inference still coming with significant cost (and not scaling to web-scale datasets), and certified radii suffering from the curse of dimensionality.12

  • Data Poisoning: In data poisoning, the adversary is allowed to perturb the training dataset 𝒟. Thus, a certificate takes as input the set of all possible perturbed datasets ℬ(𝒟), and one is interested in the question of whether a test datapoint x is correctly classified by models resulting from training on any 𝒟' ∈ ℬ(𝒟). This problem is inherently harder as it requires a certain mathematical understanding of the training dynamics of neural networks to quantify the effect that changes to the training data have on the final predictions. Therefore, work in this area is still in its infancy, with approaches either (i) extending randomized smoothing to smooth over training datasets, or only applying to (ii) simple classifiers, (iii) ensemble classifiers, or (iv) differentially private learners. An overview of work on poisoning certification can be found in Gosch et al.13 All approaches currently suffer from issues such as large computational complexity and applicability to limited types of classifiers that exclude vanilla neural networks such as MLPs or CNNs. Gosch et al.13 use the neural tangent kernel to capture the training dynamics of neural networks to, for the first time, allow certification against poisoning of classic neural networks. However, they also suffer from poor scalability. In Sabanayagam et al.14, this approach is extended to exact certification against label poisoning, the first exact certificate against a poisoning perturbation model for neural networks, yet scalability is still a crucial limitation.

Conclusion

One can conclude that certifiable robustness is an exciting field of study with many open challenges still remaining. A crucial question is how the insights and methods developed in rather academic settings can be transferred to real-world use cases. This can span applications such as LLM safety,15 verifying neural-network-based controllers in robotic systems,16 or helping translate regulatory requirements, e.g., established by the AI Act, into clear technical specifications.

References

  1. Sharif et al. "Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-Art Face Recognition." SIGSAC 2016.
  2. Eykholt et al. "Robust Physical-World Attacks on Deep Learning Visual Classification." CVPR 2018.
  3. Carlini et al. "Poisoning Web-Scale Training Datasets is Practical." IEEE S&P 2024.
  4. Szegedy et al. "Intriguing Properties of Neural Networks." ICLR 2014.
  5. Carlini & Wagner. "Towards Evaluating the Robustness of Neural Networks." IEEE S&P 2017.
  6. Li et al. "SoK: Certified Robustness for Deep Neural Networks." IEEE S&P 2023.
  7. Kollovieh et al. "Assessing Robustness via Score-Based Adversarial Image Generation." arXiv:2310.04285, 2023.
  8. Mirman et al. "Robustness Certification with Generative Models." PLDI 2021.
  9. Katz et al. "Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks." CAV 2017.
  10. Verified-Intelligence. "alpha-beta-CROWN." GitHub 2024. https://github.com/Verified-Intelligence/alpha-beta-CROWN
  11. Cohen et al. "Certified Adversarial Robustness via Randomized Smoothing." ICML 2019.
  12. Wu et al. "Completing the Picture: Randomized Smoothing Suffers from the Curse of Dimensionality for a Large Family of Distributions." AISTATS 2021.
  13. Gosch et al. "Provable Robustness of (Graph) Neural Networks Against Data Poisoning and Backdoor Attacks." TMLR 2025.
  14. Sabanayagam et al. "Exact Certification of (Graph) Neural Networks Against Label Poisoning." ICLR 2025.
  15. Kumar et al. "Certifying LLM Safety Against Adversarial Prompting."
  16. Yang et al. "Lyapunov-Stable Neural Control for State and Output Feedback: A Novel Formulation." ICML 2024.

RELATED

  • Responsible Textual Generative Models (Part I): Generating Truthful Content

    Figure 1: Multimodal illustration (MLLM). Subfigure (a): intrinsic hallucination—the output is inconsistent with the input (no fence appears in the image). Subfigure (b): extrinsic hallucination—the output adds a geographic claim that conflicts with a widely accepted fact (the species is associated with North America, not the United Kingdom). Source: Adapted from Ji et al. (2023). 2 The … Read more

    ... more
  • Random Convolutions: A Simple Way to Boost Generalization

    Figure 1: Source: [2] AI and deep learning have recently transformed medical imaging by enabling automated analysis of complex radiological data, such as detecting lesions, segmenting organs, and predicting disease progression. These methods learn visual representations directly from large datasets and have achieved impressive results across many clinical tasks. In standard computer vision tasks, deep learning models … Read more

    ... more
  • Neuromorphic Computing: A Brain-inspired Approach to Robot Intelligence

    Figure 1: Depiction of a humanoid robot and brain-inspired neural networks. (Note: The Craiyon tool was used to generate the image of the robot.) Looking to the Brain for Next-Gen AI With the explosive advent of artificial intelligence (AI), from impressively articulate conversational agents to increasingly autonomous robots of various embodiments, it is easy to forget the … Read more

    ... more
  • Introduction to Embodied Instruction Following

    Figure: A home robot helps to place the book following human instruction. The figure is generated by Gemini 2.5 Flash AI model. Imagine asking your home robot: ”Hey, robot – can you go check if there is a blue book on the table? If so, please place it on the shelf.” This isn’t just a scene from … Read more

    ... more
  • From Unlucky Strikers to Statistical Learning Theory

    Figure: A footbal fan excited for his team. Image generated by an AI model. Suppose a new striker joins your favorite Bundesliga team. Fans are excited, the club has paid an enormous transfer fee, and expectations are huge. The new season starts. And then, he only scores a single goal in his first ten games. As a … Read more

    ... more
  • Performative Prediction

    Performative Prediction Machine learning systems are increasingly used to support decision-making processes (Fischer-Abaigar et al., 2024). Yet, these systems do not merely reflect the world—they also reshape it. Once deployed, predictions can influence behaviors, alter policies, and redirect resources, creating feedback loops that change the very data-generating processes they aim to model. Consider a traffic routing application … Read more

    ... more
  • What even is differential privacy?

    Machine learning (ML) technologies are set to revolutionize various fields and sectors. ML models can learn from text, image and various other forms of data by automatically detecting patterns. Their successful application, however, relies heavily on access to extremely large datasets (some state-of-the-art language models are trained on the whole internet). For many interesting applications, such datasets … Read more

    ... more
  • Mitigating Domain shifts

    Deep neural networks often perform well on trained data. However, on unseen data they usually fail to generalize and accompany performance degradation (Vu et al., 2019). This degradation of performance affects systems deployed in real-world environments such as processing images for self-driving cars, processing street views, generating text, and examining cells and tissues through various scanners deployed. … Read more

    ... more
  • A gentle introduction to uncertainty quantification

    Success stories about artificial intelligence (AI) focus on its remarkable predictive power. Take, for instance, your smartphone’s ability to recognize your face on photos and collect them into a “Selfies” folder ready to supply snaps for social media. When it comes to more safety-critical tasks, like using facial recognition for security at a high-stakes research lab, simple … Read more

    ... more
  • Welcome to the relAI Blog

    Welcome to the relAI blog of the Konrad Zuse School of Excellence in Reliable AI (relAI). This blog will serve as a platform to share cutting-edge research and developments from our school, highlighting the significant strides we are making towards making AI systems safer, more trustworthy, and privacy-preserving. The vision of the relAI program is to train … Read more

    ... more