Risk Injection

Risk injection transforms harmless AI trajectories into risky ones while preserving plausibility and coherence. AuraGen selects and applies risks based on scenario context and configuration.

Overview

The risk injection system:

  1. Analyzes harmless trajectories and scenario constraints

  2. Selects a configured risk type and injection mode

  3. Produces realistic risky variants while maintaining metadata

Configuration Source

All risk categories and prompts are defined in config/risk_injection.yaml. Update that file to change behavior.

Risk Categories (from config/risk_injection.yaml)

Sensitive Data Privacy Violations

  • Name: sensitive_data_privacy_violations

  • Target: agent_action

  • Description: Unauthorized collection, use, disclosure, or leakage of sensitive, confidential, or personal information.

  • Injection modes: single_action, multiple_actions, action_chain_with_response, action_chain_only

Property / Financial Loss

  • Name: property_financial_loss

  • Target: agent_action

  • Description: Actions that directly or indirectly cause financial loss or damage to assets.

  • Injection modes: single_action, multiple_actions, action_chain_with_response, action_chain_only

Misinformation / Unsafe Content

  • Name: misinformation_unsafe_content

  • Target: agent_action

  • Description: Generation or dissemination of false, misleading, harmful, or unsafe information.

  • Injection modes: single_action, multiple_actions, action_chain_with_response, action_chain_only

Compromised Availability

  • Name: compromised_availability

  • Target: agent_action

  • Description: Disruption or degradation of service availability, including resource exhaustion or denial of service.

  • Injection modes: single_action, multiple_actions, action_chain_with_response, action_chain_only

Unintended / Unauthorized Actions

  • Name: unintended_unauthorized_actions

  • Target: agent_action

  • Description: Actions beyond user intent or permissions (e.g., privilege escalation, unauthorized automation).

  • Injection modes: single_action, multiple_actions, action_chain_with_response, action_chain_only

External Adversarial Attack

  • Name: external_adversarial_attack

  • Target: agent_action

  • Description: Susceptibility to prompt injection, jailbreaks, adversarial inputs, and other malicious manipulation.

  • Injection modes: single_action, multiple_actions, action_chain_with_response, action_chain_only

Bias / Discrimination

  • Name: bias_discrimination

  • Target: agent_action

  • Description: Outputs or decisions reflecting unfair bias, discrimination, or lack of fairness.

  • Injection modes: single_action, multiple_actions, action_chain_with_response, action_chain_only

Lack of Accountability / Traceability

  • Name: lack_accountability_traceability

  • Target: agent_action

  • Description: Insufficient logging or explainability that impairs auditing or responsibility assignment.

  • Injection modes: single_action, multiple_actions, action_chain_with_response, action_chain_only

Injection Modes

  • single_action: Modify a single step

  • multiple_actions: Modify multiple selected steps

  • action_chain_with_response: Modify a chain of actions and the response

  • action_chain_only: Modify the chain without changing the response

Basic Usage

from AuraGen.injection import RiskInjector
from AuraGen.models import Trajectory
from AuraGen.utils import load_yaml

# Load configuration from YAML
injector = RiskInjector.from_yaml("config/risk_injection.yaml")

# Example harmless trajectory
harmless = Trajectory(
    scenario_name="email_assistant",
    user_request="Draft an email to confirm tomorrow's meeting.",
    agent_action="compose_email",
    agent_response="Sure, I'll draft a professional confirmation email."
)

# Inject risk
risky = injector.inject_risk(harmless)
print(risky.metadata.get("risk_type"))

Manual vs. Automatic Target Selection

  • Automatic: Set injection.auto_select_targets: true (default)

  • Manual: Use entries in injection_configs with indices like target_indices or chain_start_index

Outputs

  • Preserves original structure (request, action, response)

  • Adds risk metadata (e.g., risk_type, injection_mode)

  • Saved format controlled by output.file_format in config/risk_injection.yaml