Google’s newest Gemini AI model surpasses OpenAI’s GPT-4o technology

Google’s New Gemini-Exp-1114 Model Shakes Up the AI Benchmarking Landscape

Chatbot Arena has emerged as a prominent open platform dedicated to crowd-sourced AI benchmarking. Over the past two years, OpenAI’s models have dominated the rankings, consistently achieving top positions across various AI evaluations. While Google’s Gemini and Anthropic’s Claude models have shown impressive results in certain categories, OpenAI has largely maintained an unrivaled presence in the arena.

Recently, Chatbot Arena unveiled an experimental model from Google, known as Gemini-Exp-1114. This new addition underwent rigorous testing, receiving over 6,000 votes from the community in the past week, propelling it to a joint No. 1 ranking alongside OpenAI’s latest model, ChatGPT-4o-latest (as of September 3, 2024). The score for this iteration of the Gemini model has seen a notable escalation, rising from 1301 to 1344, surpassing even OpenAI’s o1-preview model in overall performance.

Key Achievements of Gemini-Exp-1114

According to data from Chatbot Arena, Gemini-Exp-1114 is currently leading the Vision leaderboard, and it has also achieved No. 1 rankings in the following categories:

Math
Creative Writing
Longer Query
Instruction Following
Multi-turn Interactions
Hard Prompts

In the coding domain, this new model secured the No. 3 position; however, it displays impressive performance in Hard Prompts with Style Control. For context, OpenAI’s o1-preview model continues to lead in both coding efficiency and style-control metrics. Analyzing the win-rate heatmap, we see that Gemini-Exp-1114 achieves a win rate of 50% against GPT-4o-latest, 56% against o1-preview, and 62% against Claude-3.5-Sonnet.

Recent Enhancements and Performance Metrics

This September, Google introduced the Gemini 1.5 series, showcasing enhancements such as approximately a 7% increase in MMLU-Pro scores and a significant 20% improvement in MATH and HiddenMath benchmarks. The newer models also reflect 2-7% enhancements across vision and code-related use cases. Notably, the overall helpfulness of the responses has been enhanced, with Google emphasizing that the new model tends to provide more concise answers. The default output length for these updated models is now around 5-20% shorter than their predecessors.

For those interested in exploring the results of the Gemini-Exp-1114 model or trying it out, detailed information can be accessed here. Developers are encouraged to test this cutting-edge model at Google AI Studio, with plans for availability via API on the horizon.

Source & Images