NeurIPS 2025 Tutorial
We are living through a moment that once belonged to science fiction: generative foundation models can write, reason, design, diagnose, and increasingly, decide. They are no longer just predicting the next word — they are shaping knowledge, influencing choices, and becoming collaborators in science, medicine, education, and daily life. But here's the tension: as their capabilities accelerate, our ability to trust them has not kept pace.
Trustworthiness can't remain a "patch after the failure" or a moral hope layered on top of engineering. It must evolve into a science—a discipline as rigorous as the one that created these models in the first place. In this tutorial, we explore what that science looks like: how we understand model behaviors, measure and stress-test trust, and design systems that earn it. We'll build the foundations together, then step into the frontier—where models begin to exhibit human-like cognitive behaviors that inspire wonder, but also demand responsibility and new forms of alignment.
This session is an invitation: to move beyond building models that impress us, toward building models we can trust with what matters.
Generative foundation models are rapidly evolving from pattern imitators into reasoning, decision-shaping systems that influence science, healthcare, education, and society. Yet alongside remarkable breakthroughs, we've seen hallucinations presented as facts, biased outputs amplified at scale, and models behaving unpredictably when the world shifts even slightly from their training data. This section grounds the audience in how these models work, why failures emerge, and why "trustworthy by design" is no longer optional—it is the prerequisite for real-world deployment.
To build generative models that society can rely on, we must anchor them in clear principles that go beyond performance metrics. Trustworthy models must be reliable, safe, fair, transparent, and aligned with human values—not only in average cases, but especially in ambiguous, high-stakes, and cross-cultural contexts. This section defines the north-star principles that shape how such systems should behave, and what it truly means for a generative model to be worthy of trust.
Trustworthiness is not a single metric—it is a multi-dimensional, evolving framework. In this section, we introduce key dimensions that illustrate how trustworthiness manifests in generative models, such as fairness, safety, robustness, and machine ethics, along with others that shape responsible behavior. Rather than treating these as a fixed checklist, we focus on how to evaluate, stress-test, and strengthen trustworthiness across diverse contexts and stakeholders. The aim is to equip the audience with a flexible mental model and a set of practical strategies to assess and enhance trustworthiness throughout the model's lifecycle.
As models become more capable, the bar for trustworthiness rises. We face unresolved challenges: trustworthiness is context-dependent and hard to define, must be dynamically interpreted as models evolve, and has to hold even at the edges and tails of rare, high-impact scenarios. Progress will require deep interdisciplinary collaboration and a new research agenda to address emerging human-like cognitive behaviors—self-reflection, strategic reasoning, persuasion, even deception—that bring advanced and unprecedented AI risks. We close by outlining the frontier questions that will shape the next generation of trustworthy generative AI.
This tutorial is designed for a diverse audience including:
For questions about this tutorial, please contact the organizers at yhuang37@nd.edu