Harvard study: OpenAI's o1 beats ER doctors at triage diagnosis, 67% vs 50-55%

A Harvard trial published in Science pitted OpenAI’s o1 reasoning model against emergency department physicians using identical electronic health records, and the model identified the correct or near-correct diagnosis in 67% of 76 Boston ER cases versus 50-55% for the human pairs. With richer chart data, o1 hit 82% accuracy against 70-79% for clinicians, and on longer-term treatment planning across five case studies it scored 89% versus 34% for 46 doctors using conventional tools like search engines. In one example, o1 connected a patient’s lupus history to lung inflammation that doctors had misattributed to failing anti-coagulants.

The authors are careful about scope: the test only covered text-based patient data, excluding visual cues, distress signals, and bedside judgment, so the model was effectively functioning as a paperwork-based second opinion rather than a replacement clinician. Lead authors Arjun Manrai and Adam Rodman frame the future as a ‘triadic care model’ of doctor, patient, and AI rather than physician displacement, which tracks with adoption data showing roughly 20% of US physicians and 31% of UK doctors already using AI in clinical workflows.

The unresolved problems are accountability and automation bias. There is no formal liability framework when an AI-assisted diagnosis goes wrong, and one commenting researcher flagged signs that doctors may already be deferring to model output instead of reasoning independently. The study also doesn’t break down where o1 underperforms — elderly patients, non-English speakers, and other subgroups remain unexamined, which means this is a strong second-opinion signal, not a green light for unsupervised clinical deployment.