Cross-Care

Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

Overview

Large language models (LLMs) are essential for processing medical data, but they often inherit biases from their training data. Cross-Care is the first open-source, large-scale framework specifically designed to uncover and benchmark these potential biases in how LLMs represent disease prevalence across diverse demographic groups.

We found significant misalignments between how LLMs represent disease prevalence compared to real-world data, highlighting a risk of bias propagation in medical applications.

Key Features

📊 Large-Scale Analysis

We analyzed over 1 TB of text (more than 1 trillion tokens) from massive datasets like RedPajama and The Pile to track disease co-occurrence patterns.

🌍 Demographic Representation

We evaluated the representation of race and gender across 89 clinical terms to identify potential representational harms.

⚖️ Real-World Benchmarking

We juxtapose LLM biases against actual U.S. disease prevalence rates to quantify discrepancies and grounding issues.

🧪 Smart SRO Generation

Our framework uses Subject-Relation-Object (SRO) generation to create benchmarks that mirror real-world clinical scenarios.

Our key findings

Misalignment

There is a substantial misalignment between LLM representations and real-world disease prevalence across demographic subgroups.

Bias Propagation

This discrepancy indicates a pronounced risk of propagating biases if LLMs are used for clinical decision-making without correction.