Anthropic's Claude Opus 4.6 and Google's Gemini 3.1 Pro now top 50% on the most rigorous AI benchmark available, Humanity's Last Exam, marking a historic milestone in AI capability.
The result confirms Anthropic's position at the top of the global model leaderboard as of April 2026, narrowly ahead of xAI, Google, and OpenAI. The benchmark consists of 3,000 expert-level questions spanning mathematics, physics, biology, philosophy, and computer science.
Anthropic CEO Dario Amodei called the result 'a validation of our approach to constitutional AI and careful scaling.' The company's safety-first approach, once seen as a competitive disadvantage, now appears to be a key differentiator in producing reliable, high-capability models.
Analysts note that the gap between Anthropic and its nearest competitors has narrowed significantly. Google's Gemini 3.1 Pro scored 49.8%, xAI's Grok-4 scored 48.2%, and OpenAI's GPT-5 scored 47.1%. The race is closer than ever, but Anthropic's consistency across benchmarks gives it the edge.
The practical implications are significant. Models scoring above 50% on Humanity's Last Exam are now considered capable of performing expert-level reasoning in domains previously thought to require decades of human training. Legal analysis, medical diagnosis, and scientific research are all affected.
Industry observers predict that the 75% threshold will be crossed within 18 months, based on current improvement trajectories. This has prompted urgent discussions in Brussels and Washington about regulatory frameworks for highly capable AI systems.