# Plan: Modular Agentic Framework for Clinical Assessment (Helia) ## Overview Implement a production-grade, privacy-first Agentic Framework using LangGraph to automate PHQ-8 clinical assessments. The system allows dynamic switching between Local (Tier 1), Self-Hosted (Tier 2), and Cloud (Tier 3) models to benchmark performance. ## Problem Statement The current system relies on a monolithic script (`src/helia/agent/workflow.py` is a placeholder) and a single-pass evaluation logic that likely underperforms on smaller local models. To prove the thesis hypothesis—that local models can match cloud performance—we need a sophisticated **Stateful Architecture** that implements Multi-Stage Reasoning ("RISEN" pattern) and robust Human-in-the-Loop (HITL) workflows. ## Proposed Solution A **Hierarchical Agent Supervisor** architecture built with **LangGraph**: 1. **Supervisor**: Orchestrates the workflow and manages state. 2. **Assessment Agent**: Implements the "RISEN" (Reasoning Improvement via Stage-wise Evaluation Network) pattern: * **Extract**: Quote relevant patient text. * **Map**: Align quotes to PHQ-8 criteria. * **Score**: Assign 0-3 value. 3. **Ingestion**: Standardizes data from MongoDB into a `ClinicalState`. 4. **Benchmarking**: Automates the comparison between Generated Scores vs. Ground Truth (DAIC-WOZ labels). **Note:** A dedicated **Safety Guardrail** agent has been designed but is scoped out of this MVP. See `plans/safety-guardrail-architecture.md` for details. ## Technical Approach ### Architecture: The "Helia Graph" ```mermaid graph TD Start --> Ingestion Ingestion --> Router{Router} subgraph "Assessment Agent (RISEN)" Router --> Extract[Extract Evidence] Extract --> Map[Map to Criteria] Map --> Score[Score Item] Score --> NextItem{Next Item?} NextItem -- Yes --> Extract end NextItem -- No --> HumanReview["Human Review (HITL)"] HumanReview --> Finalize[Finalize & Persist] ``` ### Implementation Phases #### Phase 1: Core Graph & State Management (Foundation) * **Goal**: Establish the LangGraph structure and Pydantic State. * **Deliverables**: * `src/helia/agent/state.py`: Define `ClinicalState` (transcript, current_item, scores). * `src/helia/agent/graph.py`: Define the main `StateGraph` with Ingestion -> Assessment -> Persistence nodes. * `src/helia/ingestion/loader.py`: Refactor to load Transcript documents from MongoDB. #### Phase 2: The "RISEN" Assessment Logic * **Goal**: Replace monolithic `PHQ8Evaluator` with granular nodes. * **Deliverables**: * `src/helia/agent/nodes/assessment.py`: Implement `extract_node`, `map_node`, `score_node` that fetch prompts from DB. * `migrations/init_risen_prompts.py`: Database migration to seed the Extract/Map/Score prompts. * **Refactor**: Update `PHQ8Evaluator` to be callable as a tool/node rather than a standalone class. #### Phase 3: Tier Switching & Execution * **Goal**: Implement dynamic model config. * **Deliverables**: * `src/helia/configuration.py`: Ensure `RunConfig` (Tier 1/2/3) propagates to LangGraph `configurable` params. * `src/helia/agent/runner.py`: CLI entry point to run batch benchmarks using MongoDB transcripts. #### Phase 4: Human-in-the-Loop & Persistence * **Goal**: Enable clinician review and data saving. * **Deliverables**: * **Checkpointing**: Configure MongoDB/Postgres checkpointer for LangGraph. * **Review Flow**: Implement the `interrupt_before` logic for the "Finalize" node. * **Metrics**: Calculate "Item-Level Agreement" (MAE/Kappa) between Agent and Ground Truth. ## Acceptance Criteria ### Functional Requirements - [ ] **Stateful Workflow**: System successfully transitions Ingest -> Assess -> Persist using LangGraph. - [ ] **Multi-Stage Scoring**: Each PHQ-8 item is scored using the Extract -> Map -> Score pattern. - [ ] **Model Swapping**: Can run the *exact same graph* with `gpt-4` (Tier 3) and `llama3` (Tier 1) just by changing config. - [ ] **Benchmarking**: Automatically output a CSV comparing `Model_Score` vs `Human_Label` for all 8 items. ### Non-Functional Requirements - [ ] **Privacy**: Tier 1 execution sends ZERO bytes to external APIs. - [ ] **Reproducibility**: Every run logs the exact prompts used and model version to MongoDB. ## Dependencies & Risks - **Risk**: Local models (Tier 1) may hallucinate formatting in the "Map" stage. * *Mitigation*: Use `instructor` or constrained decoding (JSON mode) for Tier 1. - **Dependency**: Requires DAIC-WOZ dataset (loaded in MongoDB). ## References - **LangGraph**: [State Management](https://langchain-ai.github.io/langgraph/concepts/high_level/#state) - **Clinical Best Practice**: [RISEN Framework (2025)](https://pubmed.ncbi.nlm.nih.gov/40720397/) - **Project Config**: `src/helia/configuration.py`