RABBITS

Robust Assessment of Biomedical Benchmarks Involving drug Term Substitutions

The Challenge

Language models (LLMs) are becoming integral to medical data processing and decision-making. However, medical knowledge is highly context-dependent, and patients often swap between brand names (e.g., Tylenol) and generic names (e.g., Acetaminophen). RABBITS evaluates how robust LLMs are to these common substitutions.

We found that even state-of-the-art models like GPT-4 and Llama-3 show significant performance drops (1-10%) when drug names are swapped, highlighting a critical fragility in current AI systems for healthcare.

Methodology

🩺 Dataset Creation

Using RxNorm, we mapped 2,271 generic drugs to 6,961 brand names to generate a comprehensive list of synonyms.

🔄 Data Transformation

We created swapped versions of MedQA and MedMCQA: one converting brands to generics, and another converting generics to brands.

👨‍⚕️ Expert Review

Physician experts rigorously reviewed the transformed datasets to ensure medical accuracy and context consistency.

📉 Zero-Shot Evaluation

We evaluated various open-source and API-based LLMs using the EleutherAI harness to measure robustness.

Key Findings

Fragility to Naming

Models answer differently based merely on whether a drug is referred to by its brand or generic name.

Performance Drop

We observed a consistent drop in accuracy ranging from 1% to 10% across benchmarks.

Data Contamination

Over 90% of MedQA questions appeared in training sets like Dolma, leading to inflated baseline metrics.

Memorization > Reasoning

Larger models (e.g., Llama-3-70B) showed greater drops, suggesting they rely more on memorized patterns than semantic understanding.