4.8 KiB
Plan: Modular Agentic Framework for Clinical Assessment (Helia)
Overview
Implement a production-grade, privacy-first Agentic Framework using LangGraph to automate PHQ-8 clinical assessments. The system allows dynamic switching between Local (Tier 1), Self-Hosted (Tier 2), and Cloud (Tier 3) models to benchmark performance.
Problem Statement
The current system relies on a monolithic script (src/helia/agent/workflow.py is a placeholder) and a single-pass evaluation logic that likely underperforms on smaller local models. To prove the thesis hypothesis—that local models can match cloud performance—we need a sophisticated Stateful Architecture that implements Multi-Stage Reasoning ("RISEN" pattern) and robust Human-in-the-Loop (HITL) workflows.
Proposed Solution
A Hierarchical Agent Supervisor architecture built with LangGraph:
- Supervisor: Orchestrates the workflow and manages state.
- Assessment Agent: Implements the "RISEN" (Reasoning Improvement via Stage-wise Evaluation Network) pattern:
- Extract: Quote relevant patient text.
- Map: Align quotes to PHQ-8 criteria.
- Score: Assign 0-3 value.
- Ingestion: Standardizes data from MongoDB into a
ClinicalState. - Benchmarking: Automates the comparison between Generated Scores vs. Ground Truth (DAIC-WOZ labels).
Note: A dedicated Safety Guardrail agent has been designed but is scoped out of this MVP. See plans/safety-guardrail-architecture.md for details.
Technical Approach
Architecture: The "Helia Graph"
graph TD
Start --> Ingestion
Ingestion --> Router{Router}
subgraph "Assessment Agent (RISEN)"
Router --> Extract[Extract Evidence]
Extract --> Map[Map to Criteria]
Map --> Score[Score Item]
Score --> NextItem{Next Item?}
NextItem -- Yes --> Extract
end
NextItem -- No --> HumanReview["Human Review (HITL)"]
HumanReview --> Finalize[Finalize & Persist]
Implementation Phases
Phase 1: Core Graph & State Management (Foundation)
- Goal: Establish the LangGraph structure and Pydantic State.
- Deliverables:
src/helia/agent/state.py: DefineClinicalState(transcript, current_item, scores).src/helia/agent/graph.py: Define the mainStateGraphwith Ingestion -> Assessment -> Persistence nodes.src/helia/ingestion/loader.py: Refactor to load Transcript documents from MongoDB.
Phase 2: The "RISEN" Assessment Logic
- Goal: Replace monolithic
PHQ8Evaluatorwith granular nodes. - Deliverables:
src/helia/agent/nodes/assessment.py: Implementextract_node,map_node,score_nodethat fetch prompts from DB.migrations/init_risen_prompts.py: Database migration to seed the Extract/Map/Score prompts.- Refactor: Update
PHQ8Evaluatorto be callable as a tool/node rather than a standalone class.
Phase 3: Tier Switching & Execution
- Goal: Implement dynamic model config.
- Deliverables:
src/helia/configuration.py: EnsureRunConfig(Tier 1/2/3) propagates to LangGraphconfigurableparams.src/helia/agent/runner.py: CLI entry point to run batch benchmarks using MongoDB transcripts.
Phase 4: Human-in-the-Loop & Persistence
- Goal: Enable clinician review and data saving.
- Deliverables:
- Checkpointing: Configure MongoDB/Postgres checkpointer for LangGraph.
- Review Flow: Implement the
interrupt_beforelogic for the "Finalize" node. - Metrics: Calculate "Item-Level Agreement" (MAE/Kappa) between Agent and Ground Truth.
Acceptance Criteria
Functional Requirements
- Stateful Workflow: System successfully transitions Ingest -> Assess -> Persist using LangGraph.
- Multi-Stage Scoring: Each PHQ-8 item is scored using the Extract -> Map -> Score pattern.
- Model Swapping: Can run the exact same graph with
gpt-4(Tier 3) andllama3(Tier 1) just by changing config. - Benchmarking: Automatically output a CSV comparing
Model_ScorevsHuman_Labelfor all 8 items.
Non-Functional Requirements
- Privacy: Tier 1 execution sends ZERO bytes to external APIs.
- Reproducibility: Every run logs the exact prompts used and model version to MongoDB.
Dependencies & Risks
- Risk: Local models (Tier 1) may hallucinate formatting in the "Map" stage.
- Mitigation: Use
instructoror constrained decoding (JSON mode) for Tier 1.
- Mitigation: Use
- Dependency: Requires DAIC-WOZ dataset (loaded in MongoDB).
References
- LangGraph: State Management
- Clinical Best Practice: RISEN Framework (2025)
- Project Config:
src/helia/configuration.py