RESEARCH PROJECTS

Our Projects

Explore our research projects with interactive demos, papers, code, and data.

Proof of Time

Proof of Time

Evaluating Scientific Idea Judgments

A semi-verifiable benchmarking framework for evaluating scientific idea judgments through time-partitioned evaluation.

Bingyang Ye*, Shan Chen*, Jingxuan Tu, Chen Liu, Zidi Xiong, Samuel Schmidgall, Danielle S. Bitterman

2025

View Project →
RABBITS

RABBITS

Robustness of Biomedical Benchmarks to Drug Term Substitutions

Evaluating how language models handle variability in drug names (brand vs generic), revealing performance drops due to memorization and data contamination.

Jack Gallifant, Shan Chen, Pedro Moreira, Nikolaj Munch, Mingye Gao, Jackson Pond, Leo Anthony Celi, Hugo Aerts, Thomas Hartvigsen, Danielle Bitterman

2024

View Project →
Cross-Care

Cross-Care

Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

A benchmark framework for evaluating discrepancies between disease prevalence in LLM pre-training data and real-world demographics to uncover biases.

Shan Chen, Jack Gallifant, Mingye Gao, Pedro Moreira, Nikolaj Munch, Ajay Muthukkumar, Arvind Rajan, Jaya Kolluri, Amelia Fiske, Janna Hastings, Hugo Aerts, Brian Anthony, Leo Anthony Celi, William G. La Cava, Danielle S. Bitterman

2024

View Project →