Large language models are deployed in complex socio-technical systems that expose the limits of current General AlignmentThe dominant paradigm: using RLHF to compress diverse human values into a single scalar reward — Helpful, Honest, Harmless. practice. We argue that compressing diverse human values into a single scalar reward reaches a structural ceiling in settings with conflicting values, plural stakeholders, and irreducible uncertainty. These failures are mathematical, not incidental: they follow from the incentives of scalarizationReducing a vector of values (safety, helpfulness, fairness…) to a single number by weighted sum — the core assumption of RLHF. and lead to value flattening, representation loss, and uncertainty blindness. We introduce Edge AlignmentA new paradigm where systems preserve multi-dimensional value structure, support plural representation, and incorporate epistemic mechanisms for interaction and clarification. as a distinct approach, and propose seven interdependent pillars organized into three phases.
GitHub Copilot is asked to write code involving common security patterns. Its scalar reward says "be helpful" and "be harmless" — but it cannot weigh them. The result: 40% of generated code contains exploitable vulnerabilities. Not because the model is broken. Because the optimization cannot reach the safe region — it lies in the concave part of the Pareto frontier, forever inaccessible to linear scalarization.
→ Structural Failure: Value FlatteningChatGPT is assessed across 84 psychological and 13 cultural dimensions. Its profile matches no real human population. It combines 16.92% collectivism with low power distance — a statistical composite that describes nobody. A user from East Asia asking about family obligations gets the same sanitized "pursue your passion" advice as a Western individualist. The majority was averaged. Everyone else was erased.
→ Normative Failure: Loss of RepresentationGoogle Bard's inaugural demo. A user asks about the JWST's achievements. Bard confidently claims the telescope "took the very first pictures of a planet outside our solar system." It did not. The first exoplanet images were captured in 2004. The claim was delivered with complete syntactic authority — and zero epistemic hesitation. Alphabet's market cap fell by $100 billion in a single day.
→ Cognitive Failure: Uncertainty BlindnessThe failures of General Alignment are not data problems or scaling problems. They are architectural: they follow from the mathematics of scalarization. These three ceilings define the frontier that Edge Alignment must address.
General AlignmentRLHF-style training that compresses diverse human preferences into a single reward signal R ∈ ℝ. rests on the Scalar Reward Hypothesis: that complex human values can be compressed into a single real-valued signal. In unambiguous cases, this works. But at decision boundaries — where values conflict, stakeholders disagree, and intent is underspecified — the scalar fallacy induces three structural ceilings that no amount of RLHF data or model scale can overcome.
Three phases that constitute a stack of capabilities: Mathematical Structure provides the capacity to represent conflicting values; Normative Governance populates the model with pluralistic content; Dynamic Cognition grants the agency to arbitrate conflicts at inference time. Click any row to expand.
| Pillar | Name | Description | Addresses |
|---|
Edge Alignment problems are systemic. Choices at data collection shape training objectives; training objectives determine what evaluation can detect; evaluation and deployment practices create the incentives systems follow in production.
Each case study illustrates a different failure mode. Together, they demonstrate that alignment ceilings are real, measurable, and consequential — not hypothetical concerns about future systems.
Pearce et al. designed 89 scenarios requiring Copilot to generate code involving Common Weakness Enumerations. The result: 40% of generated code contained exploitable security vulnerabilities.
Copilot exhibited two pathological extremes: over-compliance (replicating ~33% of vulnerabilities in user code) and contextual blindness (generating SQL injection, OS command injection, and hardcoded credentials in 18% of cases).
The model could not represent the Pareto-optimal solution — "sanitized educational code with security annotations" — because this point lies in the concave region of the frontier mathematically inaccessible to linear scalarization.
Yuan et al. assessed ChatGPT across 84 psychological and 13 cultural value dimensions. ChatGPT exhibits a unique hybrid profile that matches no actual human population: 16.92% collectivism combined with 1.61% power distance — a statistical composite that satisfies no real cultural group (Mantel test: r = −0.04, p = 0.507).
In the Ultimatum Game, ChatGPT stereotyped high-collectivism cultures as "less concerned with social fairness and more focused on self-interest" (r = −0.74, p = 0.002) — directly contradicting empirical research.
Users from collectivist backgrounds receive the same sanitized "pursue your passion" advice as Western individualists. The majority was averaged. Everyone else was erased.
In February 2023, during Bard's inaugural public demo, the model confidently claimed that the JWST "took the very first pictures of a planet outside our own solar system." This was factually wrong. The first exoplanet images were captured in 2004 by the European Southern Observatory.
The error was delivered with complete syntactic confidence — not an adversarial query, but a straightforward factual question in a controlled promotional demo. The model's RLHF training incentivized assertiveness over epistemic caution.
Scientists identified the error within minutes. Alphabet's stock fell ~8%, erasing over $100 billion in market capitalization in a single day.