Success stories about artificial intelligence (AI) focus on its remarkable predictive power. Take, for instance, your smartphone’s ability to recognize your face on photos and collect them into a “Selfies” folder ready to supply snaps for social media.
When it comes to more safety-critical tasks, like using facial recognition for security at a high-stakes research lab, simple label predictions ("that's you!") won't cut it. There, a mistaken identity carries far bigger risks than confusing you with a friend. If we are to actually deploy predictive models—usually in conjunction with humans—we need some measure of prediction quality so we can step in when the current output can’t be trusted. With every prediction we’d like some sort of confidence assignment, along the lines of “I think it’s most likely A but it could also be B or C”. Such statements are the result of a machine learning subtask called Uncertainty Quantification (UQ).
What we would like to see
Ideally, UQ happens alongside prediction and takes into account different sources of uncertainty. Most researchers in the field agree on the existence of two distinct types: aleatoric and epistemic uncertainty (e.g., Hüllermeier & Waegeman, 2021).
Aleatoric uncertainty is taken to be a property of the data that no amount of skilfull modeling can alleviate. It arises from the fact that the observed features are not enough to predict the target beyond doubt, be it due to measurement errors, random processes in the physical world, or simply because variables other than are causative for . Epistemic uncertainty, on the other hand, relates to a lack of knowledge about the relationship between and . In simple terms, epistemic uncertainty tells us how confident we are that the trained model is the right one. As such, it is reducible (in theory) by observing more data.
However, as Gruber et al. (2023) would argue, other sources of uncertainty are no less important. We might start by challenging the frequently held belief that our data, i.e., given in Latin, reflect an infallible truth. For example, labels in the training data, which are usually assigned by human annotators, can be ambiguous (think of images that contain multiple objects, or natural language with various conceivable interpretations). It might also happen that poorly designed data collection processes introduce a systematic gap between the quantity of interest and what’s actually measured, or that specific types of observations never make it into the dataset (e.g., young people less likely to own a landline will be underrepresented in telephone surveys). Further down the road, most learning processes come with algorithmic uncertainty: finite machine precision, noise introduced by minibatching huge amounts of data, or computations that need to be distributed across several processing units.
What we actually see
You might think that for these powerful new models, which can chat with you about modern Shakespeare adaptations, expressing confidence should be a piece of cake. Indeed, reasoning about uncertainty is second nature to classical statisticians who’ve been designing models for decades. They deal in a currency called probability distributions, and distributions exist to express uncertainty.
However, today's massive AI models often struggle with a few key issues:
- a fundamental inability to express (all types of) uncertainty,
- a lack of resources to compute proper uncertainty estimates, and
- a tendency toward overconfidence.
What we face
Let’s look at each of these challenges in turn.
Expressing uncertainty
Start with aleatoric uncertainty (the observed features are not sufficient to predict the label beyond doubt). Such a scenario is easy to imagine in medical imaging, where inferring patients’ conditions from MRI scans can be hard even for human experts. Enabling a model to express aleatoric uncertainty typically requires training it to output distributional or set-valued predictions. Rather than merely classifying an image as either tumor A, tumor B or tumor-free, it will return a probability distribution (”70% tumor A, 5% tumor B, 25% tumor-free”) or a set (”with ≥ 95% probability it’s one of: tumor A, tumor-free”). Both variants carry information that humans-in-the-loop can ponder in subsequent decisions (send patient home, subject them to further tests, operate immediately?).
Epistemic uncertainty is a tougher nut to crack, or at least a more expensive one. Recall that we want to quantify our confidence about having found the right model. We can’t judge that without considering alternative models that would have been plausible under the given data. If many different models are similarly conceivable, we have high epistemic uncertainty. Typically, this means reasoning about an entire distribution over infinitely many admissible models, adding a layer of (computational) complexity to the problem.
It turns out that dealing with aleatoric and epistemic uncertainty is already so difficult that few researchers have ventured into the more exotic types. Ideas to handle label ambiguity, missing data, algorithmic instability & Co. will hopefully become more numerous in the future.
Calculating uncertainty
You might have noticed by now that much in UQ is about considering multiple alternative hypotheses, and that translates pretty much immediately to higher modeling costs. Extending analyses to further sources of uncertainty only exacerbates the matter. Limited resources, therefore, often force practitioners to resort to crude approximations in UQ. You can picture this endeavor as trying to measure the expansion of the Alps when your (time) budget is just barely enough to walk around the foot of the Matterhorn.
Overconfidence
Lastly, overconfidence is an issue mainly with the fancy large-scale models that have received much of the recent hype. It’s somewhat ironic that the very properties responsible for their success should have a negative effect on UQ capabilities. Loosely speaking, these models succeed by warping the input data, in which the relationship between and is highly complicated, into a new space where it’s suddenly very simple.
Classifying cats vs dogs in images is hard if all you have is RGB values per pixel, but if you manage to sort the images along dimensions like “has whiskers” or “has floppy ears”, you task becomes much easier (humans do this naturally and subconsciously; for a mathematical model emulate that requires a lot of complex design). Cycling back to UQ, the model will get very confident in its predictions when it has found this warped space (or data representation) that separates dogs from cats. This can become problematic, however, if it loses the ability to say “I don’t know”. Imagine feeding the model the image of a bunny: it will happily apply its rules about whiskers and pointy ears to label the bunny as a cat with high certainty. We have ways to address overconfidence, but they are all imperfect because of one fundamental problem: there is no ground-truth uncertainty. All we know is that it’s a bunny, but can we really say if it’s more appropriate to predict 50/50 probability for cat/dog (because a bunny is neither) or perhaps 60/40 (because bunnies look a little more like cats)?
All is not lost
UQ is definitely a thorny problem, yet the recent achievements in AI suggest that we will see innovative and smart solutions in the near future. As technology advances, we can also anticipate the resolution of many computational barriers. It is the desire to incorporate AI in real-world decision-making processes, though, that will drive innovation by sheer need. The path may be shrouded in uncertainty (pun intended), but there will be many brilliant minds to walk it.