helia/plans/agentic-architecture-phq8.md

# Plan: Modular Agentic Framework for Clinical Assessment (Helia)

## Overview

Implement a production-grade, privacy-first Agentic Framework using LangGraph to automate PHQ-8 clinical assessments. The system allows dynamic switching between Local (Tier 1), Self-Hosted (Tier 2), and Cloud (Tier 3) models to benchmark performance.

## Problem Statement

The current system relies on a monolithic script (`src/helia/agent/workflow.py` is a placeholder) and a single-pass evaluation logic that likely underperforms on smaller local models. To prove the thesis hypothesis—that local models can match cloud performance—we need a sophisticated **Stateful Architecture** that implements Multi-Stage Reasoning ("RISEN" pattern) and robust Human-in-the-Loop (HITL) workflows.

## Proposed Solution

A **Hierarchical Agent Supervisor** architecture built with **LangGraph**:

1.  **Supervisor**: Orchestrates the workflow and manages state.
2.  **Assessment Agent**: Implements the "RISEN" (Reasoning Improvement via Stage-wise Evaluation Network) pattern:
    *   **Extract**: Quote relevant patient text.
    *   **Map**: Align quotes to PHQ-8 criteria.
    *   **Score**: Assign 0-3 value.
3.  **Ingestion**: Standardizes data from MongoDB into a `ClinicalState`.
4.  **Benchmarking**: Automates the comparison between Generated Scores vs. Ground Truth (DAIC-WOZ labels).

**Note:** A dedicated **Safety Guardrail** agent has been designed but is scoped out of this MVP. See `plans/safety-guardrail-architecture.md` for details.

## Technical Approach

### Architecture: The "Helia Graph"

```mermaid
graph TD
    Start --> Ingestion
    Ingestion --> Router{Router}

    subgraph "Assessment Agent (RISEN)"
        Router --> Extract[Extract Evidence]
        Extract --> Map[Map to Criteria]
        Map --> Score[Score Item]
        Score --> NextItem{Next Item?}
        NextItem -- Yes --> Extract
    end

    NextItem -- No --> HumanReview["Human Review (HITL)"]
    HumanReview --> Finalize[Finalize & Persist]
```

### Implementation Phases

#### Phase 1: Core Graph & State Management (Foundation)
*   **Goal**: Establish the LangGraph structure and Pydantic State.
*   **Deliverables**:
    *   `src/helia/agent/state.py`: Define `ClinicalState` (transcript, current_item, scores).
    *   `src/helia/agent/graph.py`: Define the main `StateGraph` with Ingestion -> Assessment -> Persistence nodes.
    *   `src/helia/ingestion/loader.py`: Refactor to load Transcript documents from MongoDB.

#### Phase 2: The "RISEN" Assessment Logic
*   **Goal**: Replace monolithic `PHQ8Evaluator` with granular nodes.
*   **Deliverables**:
    *   `src/helia/agent/nodes/assessment.py`: Implement `extract_node`, `map_node`, `score_node` that fetch prompts from DB.
    *   `migrations/init_risen_prompts.py`: Database migration to seed the Extract/Map/Score prompts.
    *   **Refactor**: Update `PHQ8Evaluator` to be callable as a tool/node rather than a standalone class.

#### Phase 3: Tier Switching & Execution
*   **Goal**: Implement dynamic model config.
*   **Deliverables**:
    *   `src/helia/configuration.py`: Ensure `RunConfig` (Tier 1/2/3) propagates to LangGraph `configurable` params.
    *   `src/helia/agent/runner.py`: CLI entry point to run batch benchmarks using MongoDB transcripts.

#### Phase 4: Human-in-the-Loop & Persistence
*   **Goal**: Enable clinician review and data saving.
*   **Deliverables**:
    *   **Checkpointing**: Configure MongoDB/Postgres checkpointer for LangGraph.
    *   **Review Flow**: Implement the `interrupt_before` logic for the "Finalize" node.
    *   **Metrics**: Calculate "Item-Level Agreement" (MAE/Kappa) between Agent and Ground Truth.

## Acceptance Criteria

### Functional Requirements
- [ ] **Stateful Workflow**: System successfully transitions Ingest -> Assess -> Persist using LangGraph.
- [ ] **Multi-Stage Scoring**: Each PHQ-8 item is scored using the Extract -> Map -> Score pattern.
- [ ] **Model Swapping**: Can run the *exact same graph* with `gpt-4` (Tier 3) and `llama3` (Tier 1) just by changing config.
- [ ] **Benchmarking**: Automatically output a CSV comparing `Model_Score` vs `Human_Label` for all 8 items.

### Non-Functional Requirements
- [ ] **Privacy**: Tier 1 execution sends ZERO bytes to external APIs.
- [ ] **Reproducibility**: Every run logs the exact prompts used and model version to MongoDB.

## Dependencies & Risks
- **Risk**: Local models (Tier 1) may hallucinate formatting in the "Map" stage.
    *   *Mitigation*: Use `instructor` or constrained decoding (JSON mode) for Tier 1.
- **Dependency**: Requires DAIC-WOZ dataset (loaded in MongoDB).

## References
- **LangGraph**: [State Management](https://langchain-ai.github.io/langgraph/concepts/high_level/#state)
- **Clinical Best Practice**: [RISEN Framework (2025)](https://pubmed.ncbi.nlm.nih.gov/40720397/)
- **Project Config**: `src/helia/configuration.py`