Files
helia/plans/agentic-architecture-phq8.md
Santiago Martinez-Avial 5ef0fc0ccc DEL
2025-12-22 18:46:58 +01:00

4.8 KiB

Plan: Modular Agentic Framework for Clinical Assessment (Helia)

Overview

Implement a production-grade, privacy-first Agentic Framework using LangGraph to automate PHQ-8 clinical assessments. The system allows dynamic switching between Local (Tier 1), Self-Hosted (Tier 2), and Cloud (Tier 3) models to benchmark performance.

Problem Statement

The current system relies on a monolithic script (src/helia/agent/workflow.py is a placeholder) and a single-pass evaluation logic that likely underperforms on smaller local models. To prove the thesis hypothesis—that local models can match cloud performance—we need a sophisticated Stateful Architecture that implements Multi-Stage Reasoning ("RISEN" pattern) and robust Human-in-the-Loop (HITL) workflows.

Proposed Solution

A Hierarchical Agent Supervisor architecture built with LangGraph:

  1. Supervisor: Orchestrates the workflow and manages state.
  2. Assessment Agent: Implements the "RISEN" (Reasoning Improvement via Stage-wise Evaluation Network) pattern:
    • Extract: Quote relevant patient text.
    • Map: Align quotes to PHQ-8 criteria.
    • Score: Assign 0-3 value.
  3. Ingestion: Standardizes data from S3/Local into a ClinicalState.
  4. Benchmarking: Automates the comparison between Generated Scores vs. Ground Truth (DAIC-WOZ labels).

Note: A dedicated Safety Guardrail agent has been designed but is scoped out of this MVP. See plans/safety-guardrail-architecture.md for details.

Technical Approach

Architecture: The "Helia Graph"

graph TD
    Start --> Ingestion
    Ingestion --> Router{Router}

    subgraph "Assessment Agent (RISEN)"
        Router --> Extract[Extract Evidence]
        Extract --> Map[Map to Criteria]
        Map --> Score[Score Item]
        Score --> NextItem{Next Item?}
        NextItem -- Yes --> Extract
    end

    NextItem -- No --> HumanReview["Human Review (HITL)"]
    HumanReview --> Finalize[Finalize & Persist]

Implementation Phases

Phase 1: Core Graph & State Management (Foundation)

  • Goal: Establish the LangGraph structure and Pydantic State.
  • Deliverables:
    • src/helia/agent/state.py: Define ClinicalState (transcript, current_item, scores).
    • src/helia/agent/graph.py: Define the main StateGraph with Ingestion -> Assessment -> Persistence nodes.
    • src/helia/ingestion/loader.py: Add "Ground Truth" loading for DAIC-WOZ labels (critical for benchmarking).

Phase 2: The "RISEN" Assessment Logic

  • Goal: Replace monolithic PHQ8Evaluator with granular nodes.
  • Deliverables:
    • src/helia/agent/nodes/assessment.py: Implement extract_node, map_node, score_node.
    • src/helia/prompts/: Create specialized prompt templates for each stage (optimized for Llama 3).
    • Refactor: Update PHQ8Evaluator to be callable as a tool/node rather than a standalone class.

Phase 3: Tier Switching & Execution

  • Goal: Implement dynamic model config.
  • Deliverables:
    • src/helia/configuration.py: Ensure RunConfig (Tier 1/2/3) propagates to LangGraph configurable params.
    • src/helia/agent/runner.py: CLI entry point to run batch benchmarks.

Phase 4: Human-in-the-Loop & Persistence

  • Goal: Enable clinician review and data saving.
  • Deliverables:
    • Checkpointing: Configure MongoDB/Postgres checkpointer for LangGraph.
    • Review Flow: Implement the interrupt_before logic for the "Finalize" node.
    • Metrics: Calculate "Item-Level Agreement" (MAE/Kappa) between Agent and Ground Truth.

Acceptance Criteria

Functional Requirements

  • Stateful Workflow: System successfully transitions Ingest -> Assess -> Persist using LangGraph.
  • Multi-Stage Scoring: Each PHQ-8 item is scored using the Extract -> Map -> Score pattern.
  • Model Swapping: Can run the exact same graph with gpt-4 (Tier 3) and llama3 (Tier 1) just by changing config.
  • Benchmarking: Automatically output a CSV comparing Model_Score vs Human_Label for all 8 items.

Non-Functional Requirements

  • Privacy: Tier 1 execution sends ZERO bytes to external APIs.
  • Reproducibility: Every run logs the exact prompts used and model version to MongoDB.

Dependencies & Risks

  • Risk: Local models (Tier 1) may hallucinate formatting in the "Map" stage.
    • Mitigation: Use instructor or constrained decoding (JSON mode) for Tier 1.
  • Dependency: Requires DAIC-WOZ dataset (assumed available locally or mocked).

References