96 lines
4.8 KiB
Markdown
96 lines
4.8 KiB
Markdown
# Plan: Modular Agentic Framework for Clinical Assessment (Helia)
|
|
|
|
## Overview
|
|
|
|
Implement a production-grade, privacy-first Agentic Framework using LangGraph to automate PHQ-8 clinical assessments. The system allows dynamic switching between Local (Tier 1), Self-Hosted (Tier 2), and Cloud (Tier 3) models to benchmark performance.
|
|
|
|
## Problem Statement
|
|
|
|
The current system relies on a monolithic script (`src/helia/agent/workflow.py` is a placeholder) and a single-pass evaluation logic that likely underperforms on smaller local models. To prove the thesis hypothesis—that local models can match cloud performance—we need a sophisticated **Stateful Architecture** that implements Multi-Stage Reasoning ("RISEN" pattern) and robust Human-in-the-Loop (HITL) workflows.
|
|
|
|
## Proposed Solution
|
|
|
|
A **Hierarchical Agent Supervisor** architecture built with **LangGraph**:
|
|
|
|
1. **Supervisor**: Orchestrates the workflow and manages state.
|
|
2. **Assessment Agent**: Implements the "RISEN" (Reasoning Improvement via Stage-wise Evaluation Network) pattern:
|
|
* **Extract**: Quote relevant patient text.
|
|
* **Map**: Align quotes to PHQ-8 criteria.
|
|
* **Score**: Assign 0-3 value.
|
|
3. **Ingestion**: Standardizes data from MongoDB into a `ClinicalState`.
|
|
4. **Benchmarking**: Automates the comparison between Generated Scores vs. Ground Truth (DAIC-WOZ labels).
|
|
|
|
**Note:** A dedicated **Safety Guardrail** agent has been designed but is scoped out of this MVP. See `plans/safety-guardrail-architecture.md` for details.
|
|
|
|
## Technical Approach
|
|
|
|
### Architecture: The "Helia Graph"
|
|
|
|
```mermaid
|
|
graph TD
|
|
Start --> Ingestion
|
|
Ingestion --> Router{Router}
|
|
|
|
subgraph "Assessment Agent (RISEN)"
|
|
Router --> Extract[Extract Evidence]
|
|
Extract --> Map[Map to Criteria]
|
|
Map --> Score[Score Item]
|
|
Score --> NextItem{Next Item?}
|
|
NextItem -- Yes --> Extract
|
|
end
|
|
|
|
NextItem -- No --> HumanReview["Human Review (HITL)"]
|
|
HumanReview --> Finalize[Finalize & Persist]
|
|
```
|
|
|
|
### Implementation Phases
|
|
|
|
#### Phase 1: Core Graph & State Management (Foundation)
|
|
* **Goal**: Establish the LangGraph structure and Pydantic State.
|
|
* **Deliverables**:
|
|
* `src/helia/agent/state.py`: Define `ClinicalState` (transcript, current_item, scores).
|
|
* `src/helia/agent/graph.py`: Define the main `StateGraph` with Ingestion -> Assessment -> Persistence nodes.
|
|
* `src/helia/ingestion/loader.py`: Refactor to load Transcript documents from MongoDB.
|
|
|
|
#### Phase 2: The "RISEN" Assessment Logic
|
|
* **Goal**: Replace monolithic `PHQ8Evaluator` with granular nodes.
|
|
* **Deliverables**:
|
|
* `src/helia/agent/nodes/assessment.py`: Implement `extract_node`, `map_node`, `score_node` that fetch prompts from DB.
|
|
* `migrations/init_risen_prompts.py`: Database migration to seed the Extract/Map/Score prompts.
|
|
* **Refactor**: Update `PHQ8Evaluator` to be callable as a tool/node rather than a standalone class.
|
|
|
|
#### Phase 3: Tier Switching & Execution
|
|
* **Goal**: Implement dynamic model config.
|
|
* **Deliverables**:
|
|
* `src/helia/configuration.py`: Ensure `RunConfig` (Tier 1/2/3) propagates to LangGraph `configurable` params.
|
|
* `src/helia/agent/runner.py`: CLI entry point to run batch benchmarks using MongoDB transcripts.
|
|
|
|
#### Phase 4: Human-in-the-Loop & Persistence
|
|
* **Goal**: Enable clinician review and data saving.
|
|
* **Deliverables**:
|
|
* **Checkpointing**: Configure MongoDB/Postgres checkpointer for LangGraph.
|
|
* **Review Flow**: Implement the `interrupt_before` logic for the "Finalize" node.
|
|
* **Metrics**: Calculate "Item-Level Agreement" (MAE/Kappa) between Agent and Ground Truth.
|
|
|
|
## Acceptance Criteria
|
|
|
|
### Functional Requirements
|
|
- [ ] **Stateful Workflow**: System successfully transitions Ingest -> Assess -> Persist using LangGraph.
|
|
- [ ] **Multi-Stage Scoring**: Each PHQ-8 item is scored using the Extract -> Map -> Score pattern.
|
|
- [ ] **Model Swapping**: Can run the *exact same graph* with `gpt-4` (Tier 3) and `llama3` (Tier 1) just by changing config.
|
|
- [ ] **Benchmarking**: Automatically output a CSV comparing `Model_Score` vs `Human_Label` for all 8 items.
|
|
|
|
### Non-Functional Requirements
|
|
- [ ] **Privacy**: Tier 1 execution sends ZERO bytes to external APIs.
|
|
- [ ] **Reproducibility**: Every run logs the exact prompts used and model version to MongoDB.
|
|
|
|
## Dependencies & Risks
|
|
- **Risk**: Local models (Tier 1) may hallucinate formatting in the "Map" stage.
|
|
* *Mitigation*: Use `instructor` or constrained decoding (JSON mode) for Tier 1.
|
|
- **Dependency**: Requires DAIC-WOZ dataset (loaded in MongoDB).
|
|
|
|
## References
|
|
- **LangGraph**: [State Management](https://langchain-ai.github.io/langgraph/concepts/high_level/#state)
|
|
- **Clinical Best Practice**: [RISEN Framework (2025)](https://pubmed.ncbi.nlm.nih.gov/40720397/)
|
|
- **Project Config**: `src/helia/configuration.py`
|