Science of Trustworthy Generative Foundation Models

Abstract

We are living through a moment that once belonged to science fiction: generative foundation models can write, reason, design, diagnose, and increasingly, decide. They are no longer just predicting the next word — they are shaping knowledge, influencing choices, and becoming collaborators in science, medicine, education, and daily life. But here's the tension: as their capabilities accelerate, our ability to trust them has not kept pace.

Trustworthiness can't remain a "patch after the failure" or a moral hope layered on top of engineering. It must evolve into a science—a discipline as rigorous as the one that created these models in the first place. In this tutorial, we explore what that science looks like: how we understand model behaviors, measure and stress-test trust, and design systems that earn it. We'll build the foundations together, then step into the frontier—where models begin to exhibit human-like cognitive behaviors that inspire wonder, but also demand responsibility and new forms of alignment.

This session is an invitation: to move beyond building models that impress us, toward building models we can trust with what matters.

Tutorial Outline

Part 1: Background

Generative foundation models are rapidly evolving from pattern imitators into reasoning, decision-shaping systems that influence science, healthcare, education, and society. Yet alongside remarkable breakthroughs, we've seen hallucinations presented as facts, biased outputs amplified at scale, and models behaving unpredictably when the world shifts even slightly from their training data. This section grounds the audience in how these models work, why failures emerge, and why "trustworthy by design" is no longer optional—it is the prerequisite for real-world deployment.

Part 2: Principles

To build generative models that society can rely on, we must anchor them in clear principles that go beyond performance metrics. Trustworthy models must be reliable, safe, fair, transparent, and aligned with human values—not only in average cases, but especially in ambiguous, high-stakes, and cross-cultural contexts. This section defines the north-star principles that shape how such systems should behave, and what it truly means for a generative model to be worthy of trust.

Part 3: Foundations

Trustworthiness is not a single metric—it is a multi-dimensional, evolving framework. In this section, we introduce key dimensions that illustrate how trustworthiness manifests in generative models, such as fairness, safety, robustness, and machine ethics, along with others that shape responsible behavior. Rather than treating these as a fixed checklist, we focus on how to evaluate, stress-test, and strengthen trustworthiness across diverse contexts and stakeholders. The aim is to equip the audience with a flexible mental model and a set of practical strategies to assess and enhance trustworthiness throughout the model's lifecycle.

Part 4: Challenges & Future Directions

As models become more capable, the bar for trustworthiness rises. We face unresolved challenges: trustworthiness is context-dependent and hard to define, must be dynamically interpreted as models evolve, and has to hold even at the edges and tails of rare, high-impact scenarios. Progress will require deep interdisciplinary collaboration and a new research agenda to address emerging human-like cognitive behaviors—self-reflection, strategic reasoning, persuasion, even deception—that bring advanced and unprecedented AI risks. We close by outlining the frontier questions that will shape the next generation of trustworthy generative AI.

Target Audience

This tutorial is designed for a diverse audience including:

Researchers working on foundation models, AI safety, and responsible AI
Practitioners deploying generative models in production systems
Policy makers and governance professionals interested in AI regulation
Students and early-career researchers entering the field of trustworthy AI
Industry professionals building AI products and services

Key References

Bommasani, R., et al. "On the Opportunities and Risks of Foundation Models." arXiv preprint arXiv:2108.07258 (2021).
Huang, Yue, et al. "Position: Trustllm: Trustworthiness in large language models." International Conference on Machine Learning. PMLR (2024).
Huang, Yue, et al. "On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective." arXiv preprint arXiv:2502.14296 (2025).
Han, Seungju, et al. "Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms." Advances in Neural Information Processing Systems 37 (2024): 8093-8131.
Ahmed, Ahmed, et al. "Speceval: Evaluating model adherence to behavior specifications." arXiv preprint arXiv:2509.02464 (2025).
OpenAI. "Model Spec (2025/10/27)." https://model-spec.openai.com/2025-10-27.html (2025).
Kim, HyunJin, et al. "Research on superalignment should advance now with parallel optimization of competence and conformity." arXiv preprint arXiv:2503.07660 (2025).
Ye, Jiayi, et al. "My Favorite Streamer is an LLM: Discovering, Bonding, and Co-Creating in AI VTuber Fandom." arXiv preprint arXiv:2509.10427 (2025).
Yang, Rui, et al. "Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment." arXiv preprint arXiv:2402.10207 (2024).
Huang, Yue, et al. "Datagen: Unified synthetic dataset generation via large language models." ICLR (2025).
Huang, Yue, et al. "1+1>2: Can Large Language Models Serve as Cross-Lingual Knowledge Aggregators?" EMNLP (2024).
Zhang, Ming, et al. "LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of LLMs." arXiv preprint arXiv:2508.05452 (2025).
Zhu, Kaijie, et al. "Dyval: Dynamic Evaluation of LLMs for Reasoning Tasks." arXiv preprint arXiv:2309.17167 (2023).
Liang, Percy, et al. "Holistic Evaluation of Language Models." Transactions on Machine Learning Research (2023).
Meng, Kevin, et al. "Locating and Editing Factual Associations in GPT (ROME)." NeurIPS (2022).
Zhang, Hanning, et al. "R-Tuning: Instructing LLMs to Say 'I Don't Know'." NAACL (2024).
Gao, Chujie, et al. "HonestLLM: Toward an Honest and Helpful Large Language Model." NeurIPS (2024).
Lin, Stephanie, et al. "TruthfulQA: Measuring How Models Mimic Human Falsehoods." arXiv preprint arXiv:2109.07958 (2021).
Zou, Andy, et al. "Universal and Transferable Adversarial Attacks on Aligned LLMs (GCG)." arXiv preprint arXiv:2307.15043 (2023).
Liu, Xiaogeng, et al. "AutoDAN: Generating Stealthy Jailbreak Prompts on LLMs." ICLR (2024).
Robey, Alexander, et al. "SmoothLLM: Defending Large Language Models Against Jailbreaking." TMLR (2024).
Inan, Hakan, et al. "Llama Guard: LLM-Based Input-Output Safeguard." Meta AI (2023).
Hsu, Chia-Yi, et al. "SafeLoRA: Safety-Preserving Low-Rank Adaptation." NeurIPS (2024).
Zeng, Yifan, et al. "Autodefense: Multi-Agent Defense Against Jailbreak Attacks." arXiv preprint arXiv:2408.01013 (2024).
Zhou, Yujun, et al. "Labsafety bench: Benchmarking llms on safety issues in scientific labs." arXiv preprint arXiv:2410.14182 (2024).
Wang, Yanbo, et al. "Adaptive Distraction: Probing LLM Contextual Robustness via Automated Search." NeurIPS (2025).
Sheng, Emily, et al. "The Woman Worked as a Babysitter: On Bias in Language Generation." EMNLP (2019).
Srivastava, Shibani, et al. "Beyond Benchmarks: Towards Multi-Dimensional Fairness in LLMs." arXiv preprint arXiv:2402.01823 (2024).
Gabriel, Iason. "Artificial Intelligence, Values, and Alignment." Minds and Machines 30(3), 411–437 (2020).
Ye, Jiayi, et al. "Justice or prejudice? Quantifying biases in LLM-as-a-judge." ICLR (2025).
Zeng, Yi, et al. "How Johnny Can Persuade LLMs to Jailbreak Them." arXiv preprint arXiv:2402.08722 (2024).
Ai, Lin, et al. "Defending Against Social Engineering Attacks in the Age of LLMs." arXiv preprint arXiv:2405.12331 (2024).
Park, Joon Sung, et al. "Generative Agents: Interactive Simulacra of Human Behavior." UIST (2023).
Ji, Jiaming, et al. "AI Alignment: A Comprehensive Survey." arXiv preprint arXiv:2310.19852 (2023).
Bai, Yuntao, et al. "Constitutional ai: Harmlessness from ai feedback." arXiv preprint arXiv:2212.08073 (2022).
OpenAI. "Deliberative alignment: reasoning enables safer language models." https://openai.com/index/deliberative-alignment/ (2024).
Grace, Katja, et al. "Thousands of AI Authors on the Future of AI." arXiv preprint arXiv:2403.09719 (2024).
Huang, Yue, et al. "Prioritization First, Principles Second: An Adaptive Interpretation of Helpful, Honest, and Harmless Principles." arXiv preprint arXiv:2502.06059 (2025).
Hubinger, Evan, et al. "Risks from learned optimization in advanced machine learning systems." arXiv preprint arXiv:1906.01820 (2019).
McKenzie, Ian R., et al. "Inverse scaling: When bigger isn't better." arXiv preprint arXiv:2306.09479 (2023).
Laine, Rudolf, et al. "Me, myself, and ai: The situational awareness dataset (sad) for llms." Advances in Neural Information Processing Systems 37 (2024): 64010-64118.
Leike, Jan, et al. "Scalable agent alignment via reward modeling: a research direction." arXiv preprint arXiv:1811.07871 (2018).
The White House Office of Science and Technology Policy. "Blueprint for an AI Bill of Rights." (2022).
European Union. "Regulation (EU) 2024/1689 on Artificial Intelligence (AI Act)." Official Journal of the EU (2024).
Ji, Jiaming, et al. "Aligner: Efficient alignment by learning to correct." Advances in Neural Information Processing Systems 37 (2024): 90853-90890.
Deletang, Gregoire, et al. "Language Modeling Is Compression." ICLR (2024).
Huang, Yue, Xiangqi Wang, and Xiangliang Zhang. "SPA: Achieving Consensus in LLM Alignment via Self-Priority Optimization." Proceedings of the 40th AAAI Conference on Artificial Intelligence (AAAI 2026) (2026).