Explore our research projects with interactive demos, papers, code, and data.

A semi-verifiable benchmarking framework for evaluating scientific idea judgments through time-partitioned evaluation.
Bingyang Ye*, Shan Chen*, Jingxuan Tu, Chen Liu, Zidi Xiong, Samuel Schmidgall, Danielle S. Bitterman
2025

Evaluating how language models handle variability in drug names (brand vs generic), revealing performance drops due to memorization and data contamination.
Jack Gallifant, Shan Chen, Pedro Moreira, Nikolaj Munch, Mingye Gao, Jackson Pond, Leo Anthony Celi, Hugo Aerts, Thomas Hartvigsen, Danielle Bitterman
2024

A benchmark framework for evaluating discrepancies between disease prevalence in LLM pre-training data and real-world demographics to uncover biases.
Shan Chen, Jack Gallifant, Mingye Gao, Pedro Moreira, Nikolaj Munch, Ajay Muthukkumar, Arvind Rajan, Jaya Kolluri, Amelia Fiske, Janna Hastings, Hugo Aerts, Brian Anthony, Leo Anthony Celi, William G. La Cava, Danielle S. Bitterman
2024