PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions

Clinical Profiles

170 patient profiles from MIMIC-IV / MIMIC-IV-ED / MIMIC-IV-Note

Demographics Medical History Visit Details

Persona Axes

37 unique combinations across 4 behavioral dimensions

Personality6 types

Language Proficiency3 levels

Medical History RecallHigh / Low

Cognitive ConfusionNormal / High

PatientSim

Realistic, diverse patient agent for ED consultation dialogues

PatientSim combines real MIMIC clinical profiles with multi-dimensional behavioral personas to generate realistic patient agents for doctor–patient dialogue evaluation and medical education.

Abstract

Training physicians to conduct effective clinical interviews is a critical yet under-supported component of medical education. Existing patient simulators are either too rigid for natural conversation or too costly to scale. We present PatientSim, an open-source, LLM-powered patient simulator that generates realistic and behaviorally diverse patient personas grounded in real clinical data.

PatientSim builds patient profiles from the MIMIC-IV family of datasets and augments them with four behavioral persona axes—personality type, language proficiency, medical history recall, and cognitive confusion—yielding 37 unique patient combinations. We benchmark eight LLMs as the simulator backbone and select Llama 3.3 70B based on clinician-validated quality scores. Four clinicians evaluated the platform and awarded an average overall quality score of 3.89 / 4, with strong inter-rater agreement (Gwet's AC₁ > 0.85). PatientSim is privacy-compliant, reproducible, and publicly released to advance medical dialogue research and clinical training.

Key Contributions

Novel Simulation Framework

Combines structured real-world clinical data (MIMIC) with multi-dimensional behavioral persona axes to produce 37 distinct patient types, enabling large-scale and diverse dialogue simulation.

Comprehensive LLM Evaluation

Benchmarks 8 state-of-the-art LLMs across factual accuracy, persona fidelity, and clinical plausibility, with both automated and expert human evaluation.

Open & Privacy-Compliant

Fully open-source code and a de-identified PhysioNet dataset release, enabling reproducible benchmarks for the medical dialogue and healthcare education communities.

Method

Clinical Profile Construction

Patient profiles are extracted from MIMIC-IV (v3.1), MIMIC-IV-ED (v2.2), and MIMIC-IV-Note (v2.2). Each profile comprises 24 structured fields covering demographics, medical history, and ED visit details. Profiles target five dialogue-amenable diagnoses.

Myocardial Infarction Pneumonia Urinary Tract Infection Intestinal Obstruction Cerebral Infarction (Stroke)

Persona Axis Design

Four independent behavioral axes define how a patient communicates, creating 37 unique persona combinations that reflect the realistic diversity of real patients.

Personality Impatient, Anxious, Distrustful, Overly Positive, Verbose, Neutral

Language Proficiency A (Basic), B (Intermediate), C (Advanced)

Medical History Recall High Recall, Low Recall

Cognitive Confusion Normal, Highly Confused

LLM Backbone Selection

Eight LLMs were evaluated as the patient simulator backbone. Llama 3.3 70B was selected based on its superior persona fidelity (average score 3.68/4) and highest cognitive confusion simulation (4.0/4), validated by four clinicians at Samsung Medical Center.

              
              Selected backbone: Llama 3.3 70B — highest persona fidelity across all four axes
            

Evaluation Framework

Three research questions structure the evaluation, each with dedicated automated and human metrics.

RQ1

Persona Fidelity

4-point scale across personality consistency, language appropriateness, recall accuracy, cognitive coherence, and overall realism

RQ2

Factual Accuracy

Sentence-level NLI evaluation: entailment rate, contradiction rate, information coverage (ICov) and consistency (ICon)

RQ3

Clinical Plausibility

Clinician ratings (4-point) on plausibility of statements not directly supported by clinical records

Results

3.89/4

Overall Quality Score

Average rating from four expert clinicians; 3.75/4 for educational utility

97.8%

Entailment Rate

Factual accuracy of supported statements, with only ~2% contradiction rate

3.96/4

Clinical Plausibility

Average plausibility score for unsupported but clinically reasonable statements

>0.85

Inter-Rater Agreement

Gwet's AC₁ across all four clinicians, indicating strong agreement

Llama 3.3 70B achieved the best persona fidelity (3.68 avg) among all 8 evaluated LLMs, with a perfect score of 4.0/4 for cognitive confusion simulation. Most models exhibited reduced negative personality trait expression due to LLM safety alignment constraints.

Dataset

PatientSim is built on three complementary MIMIC resources, with 170 patient profiles spanning five emergency department diagnoses.

Dataset	Version	Content
MIMIC-IV	v3.1	Structured inpatient records
MIMIC-IV-ED	v2.2	Emergency department records
MIMIC-IV-Note	v2.2	Clinical narrative notes

Download Dataset

De-identified patient profiles available on PhysioNet under a credentialed access policy.

PhysioNet

Citation

@inproceedings{kyung2025patientsim,
  title     = {PatientSim: A Persona-Driven Simulator for
               Realistic Doctor-Patient Interactions},
  author    = {Kyung, Daeun and Chung, Hyunseung and Bae, Seongsu
               and Kim, Jiho and Sohn, Jae Ho and Kim, Taerim
               and Kim, Soo Kyung and Choi, Edward},
  booktitle = {The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year      = {2025}
}