Cross-Care
Assessing the Healthcare Implications of Pre-training Data on Language Model Bias
Overview
Large language models (LLMs) are essential for processing medical data, but they often inherit biases from their training data. Cross-Care is the first open-source, large-scale framework specifically designed to uncover and benchmark these potential biases in how LLMs represent disease prevalence across diverse demographic groups.
We found significant misalignments between how LLMs represent disease prevalence compared to real-world data, highlighting a risk of bias propagation in medical applications.
Key Features
๐ Large-Scale Analysis
We analyzed over 1 TB of text (more than 1 trillion tokens) from massive datasets like RedPajama and The Pile to track disease co-occurrence patterns.
๐ Demographic Representation
We evaluated the representation of race and gender across 89 clinical terms to identify potential representational harms.
โ๏ธ Real-World Benchmarking
We juxtapose LLM biases against actual U.S. disease prevalence rates to quantify discrepancies and grounding issues.
๐งช Smart SRO Generation
Our framework uses Subject-Relation-Object (SRO) generation to create benchmarks that mirror real-world clinical scenarios.
Our key findings
Misalignment
There is a substantial misalignment between LLM representations and real-world disease prevalence across demographic subgroups.
Bias Propagation
This discrepancy indicates a pronounced risk of propagating biases if LLMs are used for clinical decision-making without correction.