NeurIPS 2025 · Datasets & Benchmarks Spotlight

PatientSim

A Persona-Driven Simulator for Realistic Doctor–Patient Interactions

Daeun Kyung1 Hyunseung Chung1 Seongsu Bae1 Jiho Kim1 Jae Ho Sohn2 Taerim Kim3 Soo Kyung Kim4 Edward Choi1
1KAIST 2UCSF 3Samsung Medical Center 4Ewha Womans University
Clinical Profiles
170 patient profiles from MIMIC-IV / MIMIC-IV-ED / MIMIC-IV-Note
Demographics Medical History Visit Details
Persona Axes
37 unique combinations across 4 behavioral dimensions
Personality6 types
Language Proficiency3 levels
Medical History RecallHigh / Low
Cognitive ConfusionNormal / High
PatientSim
Realistic, diverse patient agent for ED consultation dialogues

PatientSim combines real MIMIC clinical profiles with multi-dimensional behavioral personas to generate realistic patient agents for doctor–patient dialogue evaluation and medical education.

Abstract

Training physicians to conduct effective clinical interviews is a critical yet under-supported component of medical education. Existing patient simulators are either too rigid for natural conversation or too costly to scale. We present PatientSim, an open-source, LLM-powered patient simulator that generates realistic and behaviorally diverse patient personas grounded in real clinical data.

PatientSim builds patient profiles from the MIMIC-IV family of datasets and augments them with four behavioral persona axes—personality type, language proficiency, medical history recall, and cognitive confusion—yielding 37 unique patient combinations. We benchmark eight LLMs as the simulator backbone and select Llama 3.3 70B based on clinician-validated quality scores. Four clinicians evaluated the platform and awarded an average overall quality score of 3.89 / 4, with strong inter-rater agreement (Gwet's AC₁ > 0.85). PatientSim is privacy-compliant, reproducible, and publicly released to advance medical dialogue research and clinical training.

Key Contributions

Novel Simulation Framework

Combines structured real-world clinical data (MIMIC) with multi-dimensional behavioral persona axes to produce 37 distinct patient types, enabling large-scale and diverse dialogue simulation.

Comprehensive LLM Evaluation

Benchmarks 8 state-of-the-art LLMs across factual accuracy, persona fidelity, and clinical plausibility, with both automated and expert human evaluation.

Open & Privacy-Compliant

Fully open-source code and a de-identified PhysioNet dataset release, enabling reproducible benchmarks for the medical dialogue and healthcare education communities.

Method

1

Clinical Profile Construction

Patient profiles are extracted from MIMIC-IV (v3.1), MIMIC-IV-ED (v2.2), and MIMIC-IV-Note (v2.2). Each profile comprises 24 structured fields covering demographics, medical history, and ED visit details. Profiles target five dialogue-amenable diagnoses.

Myocardial Infarction Pneumonia Urinary Tract Infection Intestinal Obstruction Cerebral Infarction (Stroke)
2

Persona Axis Design

Four independent behavioral axes define how a patient communicates, creating 37 unique persona combinations that reflect the realistic diversity of real patients.

AxisOptions
Personality Impatient, Anxious, Distrustful, Overly Positive, Verbose, Neutral
Language Proficiency A (Basic), B (Intermediate), C (Advanced)
Medical History Recall High Recall, Low Recall
Cognitive Confusion Normal, Highly Confused
3

LLM Backbone Selection

Eight LLMs were evaluated as the patient simulator backbone. Llama 3.3 70B was selected based on its superior persona fidelity (average score 3.68/4) and highest cognitive confusion simulation (4.0/4), validated by four clinicians at Samsung Medical Center.

Selected backbone: Llama 3.3 70B — highest persona fidelity across all four axes
4

Evaluation Framework

Three research questions structure the evaluation, each with dedicated automated and human metrics.

RQ1
Persona Fidelity

4-point scale across personality consistency, language appropriateness, recall accuracy, cognitive coherence, and overall realism

RQ2
Factual Accuracy

Sentence-level NLI evaluation: entailment rate, contradiction rate, information coverage (ICov) and consistency (ICon)

RQ3
Clinical Plausibility

Clinician ratings (4-point) on plausibility of statements not directly supported by clinical records

Results

3.89/4
Overall Quality Score
Average rating from four expert clinicians; 3.75/4 for educational utility
97.8%
Entailment Rate
Factual accuracy of supported statements, with only ~2% contradiction rate
3.96/4
Clinical Plausibility
Average plausibility score for unsupported but clinically reasonable statements
>0.85
Inter-Rater Agreement
Gwet's AC₁ across all four clinicians, indicating strong agreement
Llama 3.3 70B achieved the best persona fidelity (3.68 avg) among all 8 evaluated LLMs, with a perfect score of 4.0/4 for cognitive confusion simulation. Most models exhibited reduced negative personality trait expression due to LLM safety alignment constraints.

Dataset

PatientSim is built on three complementary MIMIC resources, with 170 patient profiles spanning five emergency department diagnoses.

DatasetVersionContent
MIMIC-IVv3.1Structured inpatient records
MIMIC-IV-EDv2.2Emergency department records
MIMIC-IV-Notev2.2Clinical narrative notes

Download Dataset

De-identified patient profiles available on PhysioNet under a credentialed access policy.

PhysioNet

Citation

@inproceedings{kyung2025patientsim,
  title     = {PatientSim: A Persona-Driven Simulator for
               Realistic Doctor-Patient Interactions},
  author    = {Kyung, Daeun and Chung, Hyunseung and Bae, Seongsu
               and Kim, Jiho and Sohn, Jae Ho and Kim, Taerim
               and Kim, Soo Kyung and Choi, Edward},
  booktitle = {The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year      = {2025}
}