Proof of Time

A semi-verifiable benchmarking framework for evaluating scientific idea judgments.

Why Time Matters

Judging the quality of scientific ideas is hard. Current methods rely on immediate proxies—but true impact takes time to reveal itself. Proof of Time (PoT) solves this by time-partitioning the evaluation: we freeze evidence before a cutoff, ask models to forecast outcomes, and score them when the future arrives.

Key Advantages

⏳ Time-Partitioned

Ground truth arrives naturally as time passes—no manual labeling needed. Models are evaluated against verifiable future outcomes.

📈 Scalable

Evaluation scales automatically without exhaustive expert annotation. Over 30,000 instances spanning four domains.

🔬 Semi-Verifiable

Benchmarks link to real-world signals (citations, awards, leaderboard updates) that become observable post-cutoff.

How It Works

The PoT workflow: Evidence is frozen at a cutoff. Models forecast future outcomes. Ground truth arrives—enabling scalable, verifiable evaluation.

Task Families

Impact Prediction

Forecasting paper influence (citations) from limited cues. Models identify which papers will have higher impact.

Scientific Value

Predicting peer-review awards. Can models align with expert judgments to predict Best Papers?

Research Evolution

Longitudinal reasoning about faculty trajectories. Inferring a researcher's focus shifts.

Technological Frontier

Extrapolating benchmark progress (SOTA) and forecasting future leaderboard metrics.

Key Findings

Core Results: Test-time compute scaling and Zero-shot vs Agentic comparisons.

Do Agents Help? Agentic systems generally outperform zero-shot baselines on tasks requiring evidence exploration.

Scaling Benefits: Increasing interaction budgets yields large improvements for Claude models, while others plateau.

Performance heatmap across different models and tasks at high message limits.