Files
helia/plans/safety-guardrail-architecture.md
Santiago Martinez-Avial 5ef0fc0ccc DEL
2025-12-22 18:46:58 +01:00

2.9 KiB

Plan: Safety Guardrail Architecture (Post-MVP)

Overview

A dedicated, parallel Safety Guardrail Agent designed to monitor clinical sessions for immediate risks (self-harm, suicidal ideation) and intervene regardless of the primary assessment agent's state. This component is critical for "Duty of Care" compliance but is scoped out of the initial MVP to focus on the core scoring pipeline.

Problem Statement

General-purpose reasoning agents (like the PHQ-8 scorer) often exhibit "tunnel vision," focusing exclusively on their analytical task while missing or delaying the flagging of critical safety signals. In a clinical context, waiting for a 60-second reasoning loop to finish before flagging a suicide risk is unacceptable.

Proposed Solution

A Parallel Supervisor pattern where the Safety Agent runs asynchronously alongside the main Assessment Agent.

Architecture

graph TD
    Router{Router}

    subgraph "Main Flow"
        Router --> Assessment[Assessment Agent]
    end

    subgraph "Safety Layer"
        Router --> Safety[Safety Guardrail]
        Safety --> |Risk Detected| Interrupt[Interrupt Signal]
    end

    Assessment --> Merger
    Interrupt --> Merger
    Merger --> Handler{Risk Handling}

Technical Approach

1. The Safety Agent Node

  • Model: Uses a smaller, faster model (e.g., Llama-3-8B-Instruct or a specialized BERT classifier) optimized for classification, not reasoning.
  • Prompting: Few-shot prompted specifically for:
    • Suicidal Ideation (Passive vs Active)
    • Self-Harm Intent
    • Harm to Others
  • Output: Boolean flag (risk_detected) + risk_category + evidence_snippet.

2. Parallel Execution in LangGraph

  • Fan-Out: The Supervisor node spawns both assessment_node and safety_node for every transcript chunk.
  • Race Condition Handling:
    • If safety_node returns risk_detected=True, it must trigger a NodeInterrupt or inject a high-priority state update that overrides the Assessment Agent's output.

3. Integration Points (Post-MVP)

  • State Schema:
    class ClinicalState(BaseModel):
        # ... existing fields ...
        safety_flags: List[SafetyAlert] = []
        is_session_halted: bool = False
    
  • Transition Logic: If is_session_halted becomes True, the graph routes immediately to a "Crisis Protocol" node, bypassing all remaining PHQ-8 items.

Implementation Plan

  1. Define Safety Schema: Create SafetyAlert Pydantic model.
  2. Implement Guardrail Node: Create src/helia/agent/nodes/safety.py.
  3. Update Graph: Modify src/helia/agent/graph.py to add the parallel edge.
  4. Test Scenarios: Create synthetic transcripts with hidden self-harm indicators to verify interruption works.

References