Google DeepMind has published two research papers demonstrating that Gemini Deep Think can solve professional-level research problems across mathematics, physics, and computer science.
The breakthrough extends far beyond the model's 90% score on the IMO-ProofBench Advanced test. Since achieving Gold-medal standard at the International Mathematics Olympiad in July 2025, the system has progressed to tackle PhD-level exercises on DeepMind's internal FutureMath Basic benchmark.
The company built a specialized math research agent called Aletheia, powered by Gemini Deep Think mode. The system features a natural language verifier that identifies flaws in candidate solutions and enables iterative revision processes.
Unlike competition problems, research-level mathematics requires advanced techniques from vast literature. Foundation models typically struggle with data scarcity in advanced subjects, leading to superficial understanding and hallucinations.
Key Technical Advances
Aletheia addresses these limitations through several mechanisms. The agent can admit failure to solve problems, improving efficiency for researchers. It integrates Google Search and web browsing to navigate complex research while preventing spurious citations and computational errors.
The system demonstrates that higher reasoning quality can be achieved at lower inference-time compute compared to previous versions. The January 2026 version significantly outperforms the July 2025 Gold-medal version on Olympiad-level problems.
Performance scales consistently as inference-time compute increases. On PhD-level exercises, the model reaches approximately 38% accuracy, while Aletheia achieves 46% on the same benchmark.
The research represents a cross-disciplinary effort involving mathematicians, physicists, and computer scientists. Teams focused on solving open-ended challenges that mirror real scientific workflows rather than structured competition problems.
Google DeepMind plans to expand the system's capabilities across additional scientific domains. The company expects the scaling laws to continue holding as models progress beyond current PhD-level performance thresholds.
💬 Discussion
Sign in to join the discussion.
Sign in →No comments yet — be the first.