RABBITS
Robust Assessment of Biomedical Benchmarks Involving drug Term Substitutions
The Challenge
Language models (LLMs) are becoming integral to medical data processing and decision-making. However, medical knowledge is highly context-dependent, and patients often swap between brand names (e.g., Tylenol) and generic names (e.g., Acetaminophen). RABBITS evaluates how robust LLMs are to these common substitutions.
We found that even state-of-the-art models like GPT-4 and Llama-3 show significant performance drops (1-10%) when drug names are swapped, highlighting a critical fragility in current AI systems for healthcare.
Methodology
🩺 Dataset Creation
Using RxNorm, we mapped 2,271 generic drugs to 6,961 brand names to generate a comprehensive list of synonyms.
🔄 Data Transformation
We created swapped versions of MedQA and MedMCQA: one converting brands to generics, and another converting generics to brands.
👨⚕️ Expert Review
Physician experts rigorously reviewed the transformed datasets to ensure medical accuracy and context consistency.
📉 Zero-Shot Evaluation
We evaluated various open-source and API-based LLMs using the EleutherAI harness to measure robustness.
Key Findings
Fragility to Naming
Models answer differently based merely on whether a drug is referred to by its brand or generic name.
Performance Drop
We observed a consistent drop in accuracy ranging from 1% to 10% across benchmarks.
Data Contamination
Over 90% of MedQA questions appeared in training sets like Dolma, leading to inflated baseline metrics.
Memorization > Reasoning
Larger models (e.g., Llama-3-70B) showed greater drops, suggesting they rely more on memorized patterns than semantic understanding.