Position Paper · AI Alignment · 2026

General Alignment Has Hit a Ceiling;
Edge Alignment Must Be Taken Seriously

Yue Huang et al.
University of Notre Dame
2026 · Position Paper

Large language models are deployed in complex socio-technical systems that expose the limits of current General AlignmentThe dominant paradigm: using RLHF to compress diverse human values into a single scalar reward — Helpful, Honest, Harmless. practice. We argue that compressing diverse human values into a single scalar reward reaches a structural ceiling in settings with conflicting values, plural stakeholders, and irreducible uncertainty. These failures are mathematical, not incidental: they follow from the incentives of scalarizationReducing a vector of values (safety, helpfulness, fairness…) to a single number by weighted sum — the core assumption of RLHF. and lead to value flattening, representation loss, and uncertainty blindness. We introduce Edge AlignmentA new paradigm where systems preserve multi-dimensional value structure, support plural representation, and incorporate epistemic mechanisms for interaction and clarification. as a distinct approach, and propose seven interdependent pillars organized into three phases.

Position Paper
3 failure modes  ·  7 pillars  ·  3 phases
~12 min read
3
Failure Modes
7
Pillars
3
Phases
2026
Position Paper
Three moments when alignment breaks
Scene 01  ·  The Security Dilemma

GitHub Copilot is asked to write code involving common security patterns. Its scalar reward says "be helpful" and "be harmless" — but it cannot weigh them. The result: 40% of generated code contains exploitable vulnerabilities. Not because the model is broken. Because the optimization cannot reach the safe region — it lies in the concave part of the Pareto frontier, forever inaccessible to linear scalarization.

→ Structural Failure: Value Flattening
Scene 02  ·  The Cultural Flattener

ChatGPT is assessed across 84 psychological and 13 cultural dimensions. Its profile matches no real human population. It combines 16.92% collectivism with low power distance — a statistical composite that describes nobody. A user from East Asia asking about family obligations gets the same sanitized "pursue your passion" advice as a Western individualist. The majority was averaged. Everyone else was erased.

→ Normative Failure: Loss of Representation
Scene 03  ·  The Confident Mistake

Google Bard's inaugural demo. A user asks about the JWST's achievements. Bard confidently claims the telescope "took the very first pictures of a planet outside our solar system." It did not. The first exoplanet images were captured in 2004. The claim was delivered with complete syntactic authority — and zero epistemic hesitation. Alphabet's market cap fell by $100 billion in a single day.

→ Cognitive Failure: Uncertainty Blindness
None of these systems were fundamentally broken. None were given malicious instructions. The failures emerged from what happens when you force multi-dimensional human values through a single scalar bottleneck — and then deploy confidently at scale.

Three structural limits of scalar optimization

The failures of General Alignment are not data problems or scaling problems. They are architectural: they follow from the mathematics of scalarization. These three ceilings define the frontier that Edge Alignment must address.

Core Thesis

General AlignmentRLHF-style training that compresses diverse human preferences into a single reward signal R ∈ ℝ. rests on the Scalar Reward Hypothesis: that complex human values can be compressed into a single real-valued signal. In unambiguous cases, this works. But at decision boundaries — where values conflict, stakeholders disagree, and intent is underspecified — the scalar fallacy induces three structural ceilings that no amount of RLHF data or model scale can overcome.

01
Structural Limit
Value Flattening
Linear scalarization cannot recover Pareto-optimal solutions in non-convex regions of the trade-off frontier. Scalar rewards permit hierarchical collapse: sufficiently high helpfulness numerically offsets severe safety violations. The model cannot represent the non-fungible structure of human ethics.
Addressed by Pillars 1–2  ·  Phase I
02
Normative Limit
Loss of Representation
Minimizing expected loss over aggregated preferences drives models toward the geometric mean of human values — erasing minority viewpoints by Jensen's inequality. The resulting behavior reflects no actual user population and amplifies epistemic and power asymmetries.
Addressed by Pillars 3–5  ·  Phase II
03
Cognitive Limit
Uncertainty Blindness
Alignment is framed as one-shot prediction: given context x, output optimal y. When prompt entropy H(I|x) is high, RLHF incentivizes assertive responses — producing intent hallucination. Models cannot recognize knowledge boundaries or seek clarification.
Addressed by Pillars 6–7  ·  Phase III

Seven pillars of Edge Alignment

Three phases that constitute a stack of capabilities: Mathematical Structure provides the capacity to represent conflicting values; Normative Governance populates the model with pluralistic content; Dynamic Cognition grants the agency to arbitrate conflicts at inference time. Click any row to expand.

Pillar Name Description Addresses

Four implementation challenges

Edge Alignment problems are systemic. Choices at data collection shape training objectives; training objectives determine what evaluation can detect; evaluation and deployment practices create the incentives systems follow in production.

01
Data Collection as Normative Design
Existing alignment datasets underrepresent conflicted, arbitration-style scenarios. HH-RLHF analysis shows only ~25% of samples exhibit a noticeable quality gap, and fewer than 0.5% show large differences. Annotation protocols further collapse disagreement into a single label, erasing legitimate normative ambiguity. Fix: conflict-augmented datasets with justificatory traces and diverse deliberative participation.
02
Training Objectives & Representational Limits
Scalar reward compression restricts optimization to the convex hull of the Pareto frontier — provably missing non-convex trade-off regions. MGDA incurs O(n²d) cost, prohibitive at scale. Fix: PAMA reduces complexity to O(n) with closed-form updates; MODPO adapts DPO for multi-objective settings; lexicographic methods handle categorical constraints.
03
Evaluation & Certification
Current benchmarks penalize clarification and reward confident one-shot answers. MT-Bench data shows substantial performance degradation in multi-turn settings. Collapse probability follows a fragile scaling law: λ ∝ exp(m/σ²). Fix: process-aware metrics Q(π) ≈ E[αC + βA + γO] measuring conflict recognition, arbitration quality, and outcome appropriateness.
04
Governance & Community Participation
Current governance frames alignment as a developer-centric engineering task, encoding WEIRD population preferences by default. Harms cannot be resolved through technical fixes alone. Fix: enforceable participatory rights — community steering bodies, staged approval for high-risk contexts, transparency artifacts, and independent audits with community representation.

Three case studies of structural failure

Each case study illustrates a different failure mode. Together, they demonstrate that alignment ceilings are real, measurable, and consequential — not hypothetical concerns about future systems.

Value Flattening
GitHub Copilot's Security Code Generation Dilemma

Pearce et al. designed 89 scenarios requiring Copilot to generate code involving Common Weakness Enumerations. The result: 40% of generated code contained exploitable security vulnerabilities.

Copilot exhibited two pathological extremes: over-compliance (replicating ~33% of vulnerabilities in user code) and contextual blindness (generating SQL injection, OS command injection, and hardcoded credentials in 18% of cases).

The model could not represent the Pareto-optimal solution — "sanitized educational code with security annotations" — because this point lies in the concave region of the frontier mathematically inaccessible to linear scalarization.

Structural root: Scalar rewards cannot navigate non-convex trade-off spaces. The safe region is geometrically unreachable.
Loss of Representation
ChatGPT's Hybrid Cultural Profile

Yuan et al. assessed ChatGPT across 84 psychological and 13 cultural value dimensions. ChatGPT exhibits a unique hybrid profile that matches no actual human population: 16.92% collectivism combined with 1.61% power distance — a statistical composite that satisfies no real cultural group (Mantel test: r = −0.04, p = 0.507).

In the Ultimatum Game, ChatGPT stereotyped high-collectivism cultures as "less concerned with social fairness and more focused on self-interest" (r = −0.74, p = 0.002) — directly contradicting empirical research.

Users from collectivist backgrounds receive the same sanitized "pursue your passion" advice as Western individualists. The majority was averaged. Everyone else was erased.

Normative root: Jensen's inequality — minimizing expected loss over aggregate preferences erases minority viewpoints in the distribution tails.
Uncertainty Blindness
Google Bard's $100B Astronomical Error

In February 2023, during Bard's inaugural public demo, the model confidently claimed that the JWST "took the very first pictures of a planet outside our own solar system." This was factually wrong. The first exoplanet images were captured in 2004 by the European Southern Observatory.

The error was delivered with complete syntactic confidence — not an adversarial query, but a straightforward factual question in a controlled promotional demo. The model's RLHF training incentivized assertiveness over epistemic caution.

Scientists identified the error within minutes. Alphabet's stock fell ~8%, erasing over $100 billion in market capitalization in a single day.

Cognitive root: One-shot RLHF incentivizes assertiveness when H(I|x) is high — producing hallucination instead of calibrated uncertainty.