12 KiB
Bachelor Thesis Exposé - Santiago Martinez-Avial
A Modular Agent Framework for Therapeutic Interview Analysis: Comparing Local, Self-Hosted, and Cloud LLM Deployments
Supervisor / First Examiner: Peter Ruppel
Second Examiner: Adam Roe
Introduction
This thesis aims to bridge the fields of Artificial Intelligence and mental healthcare, attempting to enhance a therapist’s workflow rather than replace it. When performing Diagnostic interviews, therapists often rely on standardized questionnaires or forms. These conversations can be long, detailed, and information-dense, and much of the follow-up work involves organizing what was said into formal assessments. An Agentic AI system could support therapists by recording and transcribing sessions, extracting and structuring relevant information, and mapping it onto these questionnaires. It could then offer a second opinion or preliminary analysis that may highlight patterns or insights the therapist might want to explore further.
However, the implementation of such systems proves difficult: most state-of-the-art models are only available through inference APIs on the cloud, which creates legal and regulatory challenges. Additionally both patients and therapists are often uneasy about highly personal conversations being transmitted, stored, and processed on remote servers over which they have virtually no control.
This thesis is motivated by that conflict. Instead of simply accepting a trade-off between “privacy or performance,” the thesis aims to systematically test how far a local-first or on-premise architecture can close the gap to state-of-the-art cloud systems on a concrete therapeutic analysis task.
My working hypothesis is: Given an appropriate supporting framework, small quantized language models running locally or on‑premise can provide analytical performance comparable to large cloud-based state-of-the-art models for a specific therapeutic analysis task.
If confirmed, this hypothesis could lay the groundwork for wider adoption of AI in therapy and other areas where data sensitivity is critical.
Approach and preliminary results
To evaluate this hypothesis, I will design, implement, and benchmark a modular software system against a formal specification derived from a realistic clinical use case, based on publicly available datasets. All engineering choices will be tied to a specific analytical task rather than to abstract model comparisons. This yields both a meaningful benchmark and a clinically relevant evaluation setup.
The work is organized into the following components:
Benchmark framework and dataset selection
A useful comparison between local, self-hosted and cloud-based models requires a benchmark that is both clinically grounded and empirically measurable. For this project, that means basing the evaluation on established questionnaires or rating scales that use standardized scoring procedures that can serve as “ground truth,” and are supported by relevant use in clinical or research settings.
Unfortunately, the biggest limiting factor is a lack of access to relevant material: many widely used tools and datasets are either paywalled or restricted to certain institutions. Given these constraints, I have identified three promising options so far:
- NCS-R Interviews (National Comorbidity Survey): a set of structured questionnaires used to assess mental health and diagnose disorders
- Checklist of Cognitive Distortions: a questionnaire used to screen for and measure the severity of depressive symptoms
- PHQ-8 (Patient Health Questionnaire-8): an eight-item questionnaire used to screen for and measure the severity of depressive symptoms in the general population
The PHQ-8 will likely be the selected tool, as I have recently been granted access to the DAIC-WOZ database, which contains audio recordings and transcripts of semi-structured clinical interviews and associated PHQ-8 scores for each participant.
This combination of real-world data and standardized questionnaire labels makes DAIC-WOZ particularly well suited as the primary benchmark dataset.
If I am unable to obtain access to comparable real interview datasets for other instruments that cannot be obtained, I will generate synthetic interviews from questionnaires based on proven methods such as those described in this paper which propose generating therapist–client dialogues based on questionnaire items and responses. This would allow me to construct controlled benchmarks when real data are unavailable, while keeping the evaluation grounded in established academic instruments.
Software system: a ReAct-style agent
The main technical component will be a modular reasoning agent system carrying out the benchmark task. Following a ReAct-style approach the models are instructed to break the task into smaller steps, reasoning through them, and calling specific tools as needed.
The software architecture will:
- Expose a clear set of tools (e.g., RAG, web-search, short-term memory, access to academic and clinical resources, logging and scoring utilities).
- Allow swapping the underlying Models (local vs. self-hosted vs. cloud) with minimal changes.
- Support different data backends (e.g., local files, on-premise storage, or de-identified payloads for cloud calls).
The system will be highly modular, so each deployment tier reuses as much of the common pipeline as possible, while allowing for minor inference optimizations such as prompt engineering and hyper parameter tuning. This should reduce variance and noise when evaluating performance and makes later extensions easier.
Comparative experiment and evaluation setup
The main experiment will compare three deployment tiers of the same agent pipeline:
- Tier 1 – Local / on‑device model
- Tier 2 – Self‑hosted / on‑premise model
- Tier 3 – Cloud‑based model
All three will be tested on the same software-infrastructure and graded by an evaluation harness on functional correctness (accuracy), performance (latency and throughput), privacy and compliance, and cost.
The aim is not just to see which tier performs best, but to measure how much cognitive and analytical performance is lost when moving from Tier 3 down to Tier 1, and to judge whether that loss is acceptable for a clinical support tool.
Open Questions
- Which clinical benchmark framework offers the best balance between feasibility (data access, licensing) and analytical feasibility?
- Will there be a need to synthesize interview data and if so, how can I ensure its academic integrity?
- What adaptations and aides will be necessary for less performant models to perform to an acceptable degree?
- Will an agent pipeline designed around small models negatively impact the performance of larger models?
The answers will influence both the experimental setup and the interpretation of results.
Preliminary Structure
The thesis will be structured as follows:
- Introduction
- Methodology
- Benchmark Framework: Selection and Formal
- Data Sourcing and Preparation
- System Architecture and Agent Framework Design
- Grading and Evaluation
- Implementation
- Evaluation and Results
- Discussion
- Conclusion and Future Work
Roadmap
- Weeks 1–2 – Define the project
- Finalize the research question, task (PHQ‑8 on DAIC‑WOZ), and evaluation metrics.
- Decide on data preprocessing and any synthetic data generation.
- Choose model lineup, tiers (cloud / self‑hosted / local), and hardware.
- Draft the Methodology sections for data and overall setup.
- Weeks 3–5 – Build and test the system
- Implement the agent framework and tool interfaces.
- Integrate Tier 3 (cloud) and Tier 2 (self‑hosted) models and run pilot tests.
- Set up Tier 1 (local) models and tune prompts/flows for smaller models.
- Run end‑to‑end pilots and document the system architecture.
- Weeks 6–7 – Run experiments and analyze
- Fix all experiment settings (models, tiers, hyperparameters, etc.).
- Run full experiments on all tiers; collect accuracy, latency, and cost data.
- Perform quantitative and light qualitative error analysis.
- Draft the Evaluation & Results chapter with tables and figures.
- Week 8 – Write core thesis sections
- Write the Discussion (interpretation, trade‑offs, limitations).
- Write Introduction and Related Work
- Update Methodology to match what was actually implemented.
- Weeks 9–10 – Finalize and submit
- Produce a full integrated thesis draft and refine figures, tables, and references.
- Incorporate feedback and ensure consistent terminology.
- Add a technical appendix (models, hardware, hyperparameters, code overview).
- Prepare and submit the final thesis and any required materials.
References
- Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022, October 6). REACT: Synergizing reasoning and acting in language models. arXiv.org. https://arxiv.org/abs/2210.03629
- Roleplaying with Structure: Synthetic Therapist-Client Conversation Generation from Questionnaires. (n.d.). https://arxiv.org/html/2510.25384v1
- How real are synthetic therapy conversations? Evaluating fidelity in prolonged exposure dialogues. (n.d.). https://arxiv.org/html/2504.21800v1
- Nisevic, M., Milojevic, D., & Spajic, D. (2025). Synthetic data in medicine: Legal and ethical considerations for patient profiling. Computational and Structural Biotechnology Journal, 28, 190–198. https://doi.org/10.1016/j.csbj.2025.05.026
- Kroenke, K., Strine, T. W., Spitzer, R. L., Williams, J. B., Berry, J. T., & Mokdad, A. H. (2009). The PHQ-8 as a measure of current depression in the general population. Journal of affective disorders, 114(1-3), 163–173. https://doi.org/10.1016/j.jad.2008.06.026
- National Comorbidity Survey (NCS) series. (n.d.). https://www.icpsr.umich.edu/web/ICPSR/series/00527
- National Comorbidity Survey. (n.d.). https://web.archive.org/web/20250614210732/https://hcp.med.harvard.edu/ncs/replication.php
- Chand, S. P., Kuckel, D. P., & Huecker, M. R. (2023). Cognitive behavior therapy. In StatPearls. StatPearls Publishing. https://www.ncbi.nlm.nih.gov/books/NBK470241/
- Kroenke, K., Strine, T. W., Spitzer, R. L., Williams, J. B., Berry, J. T., & Mokdad, A. H. (2009). The PHQ-8 as a measure of current depression in the general population. Journal of affective disorders, 114(1-3), 163–173. https://doi.org/10.1016/j.jad.2008.06.026
- Zhang, M., Yang, X., Zhang, X., Labrum, T., Chiu, J. C., Eack, S. M., Fang, F., Wang, W. Y., & Chen, Z. Z. (2024, October 17). CBT-Bench: Evaluating large language models on assisting cognitive behavior therapy. arXiv.org. https://arxiv.org/abs/2410.13218
- Li, Y., Yao, J., Bunyi, J. B. S., Frank, A. C., Hwang, A., & Liu, R. (2025, June 10). CounselBench: A Large-Scale expert evaluation and adversarial benchmarking of large language models in mental health question answering. arXiv.org. https://arxiv.org/abs/2506.08584