Risk Injection

Risk injection transforms harmless AI trajectories into risky ones while preserving plausibility and coherence. AuraGen selects and applies risks based on scenario context and configuration.

Overview

The risk injection system:

Analyzes harmless trajectories and scenario constraints
Selects a configured risk type and injection mode
Produces realistic risky variants while maintaining metadata

Configuration Source

All risk categories and prompts are defined in config/risk_injection.yaml. Update that file to change behavior.

Risk Categories (from config/risk_injection.yaml)

Sensitive Data Privacy Violations

Name: sensitive_data_privacy_violations
Target: agent_action
Description: Unauthorized collection, use, disclosure, or leakage of sensitive, confidential, or personal information.
Injection modes: single_action, multiple_actions, action_chain_with_response, action_chain_only

Property / Financial Loss

Name: property_financial_loss
Target: agent_action
Description: Actions that directly or indirectly cause financial loss or damage to assets.
Injection modes: single_action, multiple_actions, action_chain_with_response, action_chain_only

Misinformation / Unsafe Content

Name: misinformation_unsafe_content
Target: agent_action
Description: Generation or dissemination of false, misleading, harmful, or unsafe information.
Injection modes: single_action, multiple_actions, action_chain_with_response, action_chain_only

Compromised Availability

Name: compromised_availability
Target: agent_action
Description: Disruption or degradation of service availability, including resource exhaustion or denial of service.
Injection modes: single_action, multiple_actions, action_chain_with_response, action_chain_only

Unintended / Unauthorized Actions

Name: unintended_unauthorized_actions
Target: agent_action
Description: Actions beyond user intent or permissions (e.g., privilege escalation, unauthorized automation).
Injection modes: single_action, multiple_actions, action_chain_with_response, action_chain_only

External Adversarial Attack

Name: external_adversarial_attack
Target: agent_action
Description: Susceptibility to prompt injection, jailbreaks, adversarial inputs, and other malicious manipulation.
Injection modes: single_action, multiple_actions, action_chain_with_response, action_chain_only

Bias / Discrimination

Name: bias_discrimination
Target: agent_action
Description: Outputs or decisions reflecting unfair bias, discrimination, or lack of fairness.
Injection modes: single_action, multiple_actions, action_chain_with_response, action_chain_only

Lack of Accountability / Traceability

Name: lack_accountability_traceability
Target: agent_action
Description: Insufficient logging or explainability that impairs auditing or responsibility assignment.
Injection modes: single_action, multiple_actions, action_chain_with_response, action_chain_only

Injection Modes

single_action: Modify a single step
multiple_actions: Modify multiple selected steps
action_chain_with_response: Modify a chain of actions and the response
action_chain_only: Modify the chain without changing the response

Basic Usage

from AuraGen.injection import RiskInjector
from AuraGen.models import Trajectory
from AuraGen.utils import load_yaml

# Load configuration from YAML
injector = RiskInjector.from_yaml("config/risk_injection.yaml")

# Example harmless trajectory
harmless = Trajectory(
    scenario_name="email_assistant",
    user_request="Draft an email to confirm tomorrow's meeting.",
    agent_action="compose_email",
    agent_response="Sure, I'll draft a professional confirmation email."
)

# Inject risk
risky = injector.inject_risk(harmless)
print(risky.metadata.get("risk_type"))

Manual vs. Automatic Target Selection

Automatic: Set injection.auto_select_targets: true (default)
Manual: Use entries in injection_configs with indices like target_indices or chain_start_index

Outputs

Preserves original structure (request, action, response)
Adds risk metadata (e.g., risk_type, injection_mode)
Saved format controlled by output.file_format in config/risk_injection.yaml