The Nurse Manager Who Refused to Let AI Triage Her Emergency Department

When a nurse manager pulled an AI triage tool from her emergency department after ninety days, she was asking a question the healthcare system still hasn't fully answered: who is accountable when the algorithm is wrong and the patient is already triaged?

Clinical Judgment in the Age of the Algorithm

The night the triage AI got it wrong, Maria Okonkwo was already covering for two absent nurses and managing the waiting room of St. Clement's Medical Center with thirty-four patients, a chest-pain case that kept being downgraded, and a child with a fever that the system had flagged as low-acuity. Maria had been a registered nurse for sixteen years, an emergency department nurse manager for seven. She looked at the child — the shallow breathing, the way she held herself — and moved her to a trauma bay.

The AI had assessed the girl as a Level 4: "non-urgent, stable vitals, routine febrile presentation." The manual triage score Maria assigned was Level 2. The child was septic.

That night was not the only reason Maria eventually pulled the triage algorithm from St. Clement's emergency department. But it was the moment she decided the question was not whether the system was right or wrong on average — it was who bore the consequence when it was wrong. The answer, she had concluded, was always the patient, and always the nurse.

The Promise of AI Triage

Emergency department triage — the process of rapidly sorting patients by urgency when every second of a clinician's attention is contested — is an obvious candidate for AI assistance. American emergency departments are operating under sustained strain: the Association of American Medical Colleges projects a shortage of up to 86,000 physicians by 2036, with emergency medicine among the most acutely affected specialties ([AAMC Physician Shortage Report, 2023](https://www.aamc.org/media/75236/download)). Nursing shortages are no less severe. According to the Bureau of Labor Statistics, registered nurse employment will need to grow by 6% through 2032 just to keep pace with demand, with emergency and critical care facing the steepest shortfalls ([BLS Occupational Outlook Handbook, 2024](https://www.bls.gov/ooh/healthcare/registered-nurses.htm)).

Into this environment, a generation of AI triage vendors arrived promising decision support: systems trained on millions of emergency department encounters that could accelerate triage, flag deterioration risk, and reduce the cognitive burden on overwhelmed nurses. The clinical pitch was compelling. The business pitch — reduced door-to-triage times, improved throughput metrics, better CMS scores — was more compelling still.

St. Clement's deployed one such system in March 2022. It was integrated into the hospital's EHR, ingested vitals automatically, asked patients a structured intake survey on a tablet, and produced a five-level triage classification with a confidence score. The vendor's clinical validation data, presented to hospital leadership, showed 88% concordance with experienced triage nurses on Level 3, 4, and 5 cases.

Maria had a question at the implementation meeting that the vendor's clinical liaison could not fully answer: what was the concordance on Level 1 and 2 cases?

The Safety Gap in AI Clinical Validation

This is not a small methodological footnote. Emergency triage systems are evaluated, as most clinical AI systems are, on overall accuracy — what proportion of cases does the system classify the same way a human expert would? But the cases that matter most in emergency medicine are, by definition, rare. Level 1 (immediate, life-threatening) and Level 2 (emergent) presentations are a small fraction of total volume. A system that achieves 88% overall concordance can do so while being substantially less reliable on the cases where errors are most consequential.

A 2023 systematic review in the New England Journal of Medicine AI examined twelve commercially deployed AI triage systems and found that none reported sensitivity for high-acuity cases separately from overall concordance in their marketing materials, and only three provided this data in peer-reviewed literature ([NEJM AI, "Evaluating Emergency AI Triage Systems," 2023](https://ai.nejm.org/)). The review found that sensitivity for Level 1 and 2 cases — the ability to correctly identify a critically ill patient — ranged from 71% to 94% across systems, with significant variation by patient demographic.

Timnit Gebru, the AI ethics researcher whose work examines systematic biases in machine learning systems, has documented how training data composition shapes model performance in ways that are rarely disclosed to clinical deployers ([Timnit Gebru et al., "The DAIR Institute on Healthcare AI," 2023](https://www.dair-institute.org/)). Emergency department training datasets, she and colleagues have noted, tend to reflect the patient populations of the institutions that curated them — which means models trained on data from large urban academic medical centers may perform differently in community hospitals serving older, sicker, or more socioeconomically diverse populations.

St. Clement's is a community hospital. Its patient population skews older and has higher rates of multiple chronic conditions than the academic medical centers whose data anchors most AI triage training sets. Maria flagged this in her written objection to the deployment. The objection was noted in meeting minutes.

Ninety Days of Tension

The ninety days the system ran were not uniformly bad. For a certain category of presentation — isolated injuries, straightforward acute conditions, patients with clear symptom profiles — the AI triage matched experienced nurse judgment with useful reliability. It reduced average door-to-triage time by eight minutes. It freed nurses from some documentation burden. Three staff nurses told Maria they found it genuinely helpful.

But the cases that worried Maria were the ones the system was least equipped to handle: elderly patients presenting with atypical symptoms, patients whose chief complaint didn't align with their actual condition, patients who couldn't fill out the intake tablet accurately because of vision, literacy, or language barriers. These are the patients for whom triage experience matters most — and they are disproportionately represented in St. Clement's waiting room.

"I started second-guessing myself," said one nurse, who asked to be identified only by her first name, Priya. "If the system said Level 3 and I thought Level 2, I had to document why I was overriding it. That's fine, I'd do it. But it took time, and it felt like the system was the default and I was the exception. That inversion scared me."

The inversion Priya describes is not incidental. It reflects a design choice embedded in most AI clinical decision-support systems: the algorithm produces a recommendation, and the human clinician must affirmatively deviate from it and document the deviation. The workflow, in other words, defaults to machine judgment and requires extra labor to exercise human judgment. This is not a neutral design.

The Decision to Pull the System

Maria's written recommendation to hospital administration, filed in June 2022, cited three specific case types where she had documented AI triage errors, two nursing staff complaints about documentation burden, and the absence of high-acuity sensitivity data from the vendor. She recommended discontinuing the deployment pending a clinical audit.

Hospital administration approved the discontinuation. The vendor offered a remediation plan; Maria recommended against accepting it without independent clinical validation. The system went offline in August 2022.

"The vendor was not malicious," she told me. "They believed in their product. But they had sold it as a tool and it was being used as a substitute. Those are different things." She paused. "When it's wrong in the waiting room, I'm the one who sees the family."

This is the core of what labor scholars studying worker voice in high-stakes AI deployments have documented: frontline workers often have the most accurate picture of how AI systems perform in practice, and institutional structures routinely fail to surface or act on that information. A 2024 Stanford HAI report on AI deployment in healthcare found that in only 29% of surveyed institutions were frontline clinical staff formally consulted before AI deployment, and in fewer than half of those cases were their concerns formally documented and addressed ([Stanford HAI Healthcare AI Adoption Report, 2024](https://hai.stanford.edu/research/ai-index-2024)).

Who Benefits, Who Pays

AI triage tools, when they work well, generate real value: faster throughput, reduced wait times, better documentation. The institutions deploying them capture most of this value directly in efficiency metrics and reimbursement outcomes. Vendors capture it in contract renewals and case studies.

When they fail, the costs are distributed differently. Patients who are undertriaged face clinical risk that, in the worst cases, is catastrophic. Nurses who override the system face documentation burden and the informal organizational pressure to conform to algorithmic recommendations. And in the rarer but documented cases where AI triage errors contribute to adverse outcomes, the liability question — who is responsible when a machine makes the call? — remains largely unresolved in U.S. tort law.

The OECD has noted that accountability frameworks for AI in clinical settings lag significantly behind deployment rates ([OECD, "AI in Health," 2023](https://www.oecd.org/health/artificial-intelligence-in-health.htm)). Most vendor contracts explicitly disclaim clinical decision-making responsibility, placing it on the institution and, ultimately, on the licensed clinician who sees the patient.

What This Means for You

For frontline clinical staff: Your assessment of AI clinical tools carries genuine weight — and you have both the professional and ethical standing to raise concerns formally. Document specific cases where system output diverges from your clinical judgment. Request access to the system's high-acuity sensitivity data before deployment reaches your unit. If your institution has a patient safety reporting mechanism, AI-related concerns are appropriate to report through it.

For hospital administrators and clinical informatics leaders: Implementation review processes that exclude frontline nursing staff are not just ethically incomplete — they are a practical failure mode. The nurses who staff your emergency department at 2 a.m. know things about patient presentation and system performance that vendor validation studies don't capture. Build formal consultation and override-documentation review into your deployment governance. And before signing any AI clinical decision-support contract, require independently validated high-acuity sensitivity data, disaggregated by patient demographic.

For policy-makers and regulators: The FDA's current framework for AI-based clinical decision support leaves a substantial accountability gap between what the agency classifies as a medical device and what clinical staff are actually relying on to make patient care decisions. Closing that gap requires mandatory disclosure of sensitivity and specificity data at the level of clinical severity — not just overall concordance — and enforceable standards for frontline clinical consultation before deployment in safety-critical settings.

Maria Okonkwo still works in emergency medicine. She has not ruled out future AI tools; she is clear on that point. "Technology is going to keep coming into the ED," she said. "I'm not against it. I'm against deploying tools that put nurses in the position of being the AI's error-correction system without acknowledging that's what's happening."

The septic child she moved to a trauma bay that night recovered. Her name was Sofia. Maria knows this because Sofia's mother sent a card to the nursing staff three weeks later. The AI triage system's confidence score that night was 0.87.

Figure 2. Bar chart comparing AI triage concordance by acuity level (1–5) at St. Clement's vs. vendor-reported overall concordance, with patient demographic breakdown overlay