Plan: Modular Agentic Framework for Clinical Assessment (Helia)

Overview

Implement a production-grade, privacy-first Agentic Framework using LangGraph to automate PHQ-8 clinical assessments. The system allows dynamic switching between Local (Tier 1), Self-Hosted (Tier 2), and Cloud (Tier 3) models to benchmark performance.

Problem Statement

The current system relies on a monolithic script (src/helia/agent/workflow.py is a placeholder) and a single-pass evaluation logic that likely underperforms on smaller local models. To prove the thesis hypothesis—that local models can match cloud performance—we need a sophisticated Stateful Architecture that implements Multi-Stage Reasoning ("RISEN" pattern) and robust Human-in-the-Loop (HITL) workflows.

Proposed Solution

A Hierarchical Agent Supervisor architecture built with LangGraph:

Supervisor: Orchestrates the workflow and manages state.
Assessment Agent: Implements the "RISEN" (Reasoning Improvement via Stage-wise Evaluation Network) pattern:
- Extract: Quote relevant patient text.
- Map: Align quotes to PHQ-8 criteria.
- Score: Assign 0-3 value.
Ingestion: Standardizes data from S3/Local into a ClinicalState.
Benchmarking: Automates the comparison between Generated Scores vs. Ground Truth (DAIC-WOZ labels).

Note: A dedicated Safety Guardrail agent has been designed but is scoped out of this MVP. See plans/safety-guardrail-architecture.md for details.

Technical Approach

Architecture: The "Helia Graph"

graph TD
    Start --> Ingestion
    Ingestion --> Router{Router}

    subgraph "Assessment Agent (RISEN)"
        Router --> Extract[Extract Evidence]
        Extract --> Map[Map to Criteria]
        Map --> Score[Score Item]
        Score --> NextItem{Next Item?}
        NextItem -- Yes --> Extract
    end

    NextItem -- No --> HumanReview["Human Review (HITL)"]
    HumanReview --> Finalize[Finalize & Persist]

Implementation Phases

Phase 1: Core Graph & State Management (Foundation)

Goal: Establish the LangGraph structure and Pydantic State.
Deliverables:
- src/helia/agent/state.py: Define ClinicalState (transcript, current_item, scores).
- src/helia/agent/graph.py: Define the main StateGraph with Ingestion -> Assessment -> Persistence nodes.
- src/helia/ingestion/loader.py: Add "Ground Truth" loading for DAIC-WOZ labels (critical for benchmarking).

Phase 2: The "RISEN" Assessment Logic

Goal: Replace monolithic PHQ8Evaluator with granular nodes.
Deliverables:
- src/helia/agent/nodes/assessment.py: Implement extract_node, map_node, score_node.
- src/helia/prompts/: Create specialized prompt templates for each stage (optimized for Llama 3).
- Refactor: Update PHQ8Evaluator to be callable as a tool/node rather than a standalone class.

Phase 3: Tier Switching & Execution

Goal: Implement dynamic model config.
Deliverables:
- src/helia/configuration.py: Ensure RunConfig (Tier 1/2/3) propagates to LangGraph configurable params.
- src/helia/agent/runner.py: CLI entry point to run batch benchmarks.

Phase 4: Human-in-the-Loop & Persistence

Goal: Enable clinician review and data saving.
Deliverables:
- Checkpointing: Configure MongoDB/Postgres checkpointer for LangGraph.
- Review Flow: Implement the interrupt_before logic for the "Finalize" node.
- Metrics: Calculate "Item-Level Agreement" (MAE/Kappa) between Agent and Ground Truth.

Acceptance Criteria

Functional Requirements

Stateful Workflow: System successfully transitions Ingest -> Assess -> Persist using LangGraph.
Multi-Stage Scoring: Each PHQ-8 item is scored using the Extract -> Map -> Score pattern.
Model Swapping: Can run the exact same graph with gpt-4 (Tier 3) and llama3 (Tier 1) just by changing config.
Benchmarking: Automatically output a CSV comparing Model_Score vs Human_Label for all 8 items.

Non-Functional Requirements

Privacy: Tier 1 execution sends ZERO bytes to external APIs.
Reproducibility: Every run logs the exact prompts used and model version to MongoDB.

Dependencies & Risks

Risk: Local models (Tier 1) may hallucinate formatting in the "Map" stage.
- Mitigation: Use instructor or constrained decoding (JSON mode) for Tier 1.
Dependency: Requires DAIC-WOZ dataset (assumed available locally or mocked).

References

LangGraph: State Management
Clinical Best Practice: RISEN Framework (2025)
Project Config: src/helia/configuration.py

4.8 KiB Raw Blame History