A new AI coding benchmark from Datacurve suggests that the leading frontier models may not be as evenly matched as existing public leaderboards make them appear. For months, Scale AI’s SWE-Bench Pro leaderboard has shown OpenAI’s GPT-5 family, Anthropic’s Claude Opus, and Google’s Gemini Pro performing within a relatively close range. That made it difficult for enterprise buyers and engineering leaders to judge which AI coding agent would perform best inside real codebases.

A new AI coding benchmark from Datacurve suggests that the leading frontier models may not be as evenly matched as existing public leaderboards make them appear.

For months, Scale AI’s SWE-Bench Pro leaderboard has shown OpenAI’s GPT-5 family, Anthropic’s Claude Opus, and Google’s Gemini Pro performing within a relatively close range. That made it difficult for enterprise buyers and engineering leaders to judge which AI coding agent would perform best inside real codebases.

Datacurve’s new benchmark, called DeepSWE, presents a much wider performance gap. The test includes 113 tasks across 91 open source repositories and five programming languages. On this benchmark, OpenAI’s GPT-5.5 led the field with a 70 percent score, placing it 16 points ahead of the nearest competitor.

Datacurve co-author Serena Ge wrote on X that public leaderboards often make top models appear close in capability, while DeepSWE shows where they actually separate in developer work.

Datacurve said DeepSWE was designed to better reflect how developers assign real work to AI coding agents.

Most coding benchmarks, including the SWE-Bench family, build tasks from real GitHub commits. They take a bug fix or feature from a repository’s history, return the code to its earlier state, and ask an AI agent to recreate the fix. The original test suite then checks whether the agent’s patch works.

Datacurve argues that this system creates several problems. The first is contamination. Because the tasks come from public GitHub history, the original issue, discussion, and sometimes the exact solution may already exist in the training data of frontier models.

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks.

On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work. pic.twitter.com/HCDcjNuTFK

— Serena Ge (Datacurve) (@serenaa_ge) May 26, 2026

The second issue is task size. SWE-Bench Pro tasks require an average of 120 lines of code across five files. DeepSWE reference solutions average 668 added lines across seven files, making them roughly 5.5 times larger.

DeepSWE also gives models shorter prompts. Its prompts average 2,158 characters, compared with 4,614 characters for SWE-Bench Pro. That means DeepSWE gives agents less instruction while expecting more output, which Datacurve says is closer to how developers use AI assistants in practice.

Datacurve also raised concerns about the reliability of automated graders used in SWE-Bench Pro.

The company reviewed 30 random tasks from both DeepSWE and SWE-Bench Pro. It then ran three rollouts across 10 frontier model configurations and used an LLM-based judge to check whether each patch actually solved the assigned task.

According to Datacurve, SWE-Bench Pro’s verifiers accepted incorrect solutions 8.5 percent of the time and rejected correct solutions 24 percent of the time. DeepSWE’s verifiers recorded much lower rates, with 0.3 percent accepted wrong solutions and 1.1 percent rejected correct ones.

The false negative issue is especially important because it can punish valid solutions that differ from the original author’s implementation. In one case, a SWE-Bench Pro task expected a private helper function from the original pull request. An AI agent solved the task by inlining the same logic, but failed because the test suite tried to import a symbol that only existed in the original solution.

If Datacurve’s finding is confirmed, it could affect how enterprise buyers, venture capital firms, and AI labs interpret benchmark scores. A benchmark with a high grading error rate may give a misleading view of model progress.

DeepSWE changes the ranking of major AI coding models.

GPT-5.5 led with a 70 percent score. GPT-5.4 followed at 56 percent, while Claude Opus 4.7 scored 54 percent.

After that, performance dropped sharply. Claude Sonnet 4.6 reached 32 percent, Gemini 3.5 Flash scored 28 percent, and GPT-5.4 mini and Kimi K2.6 both scored 24 percent. Other models landed in the teens or single digits.

Claude Haiku 4.5, which scored 39 percent on SWE-Bench Pro, fell to zero on DeepSWE. Datacurve said this suggests some mid-tier models may have performed better on easier or potentially contaminated benchmarks than they do on harder coding tasks.

GPT-5.5 also performed strongly on cost efficiency. The model reached its 70 percent pass rate with a median cost of $5.80 per trial, a median wall clock time of 20 minutes, and a median output of 47,000 tokens.

GPT-5.4 appeared to offer strong overall value, scoring 56 percent with a median cost of $3.30 per trial.

Datacurve said Claude Opus 4.7 costs much more per run. It also found that output tokens, runtime, and cost varied widely across the tested agents. However, higher spending, longer runs, or larger outputs did not consistently lead to better results.

Datacurve said DeepSWE is not perfect. It’s standardized harness routes all edits through bash, instead of using the model-specific editing tools that each family may have been trained on, such as apply_patch for GPT or str_replace_based_edit_tool for Claude.

The benchmark also uses only open source repositories with more than 500 stars. The results may not fully represent performance on private enterprise codebases. Bug localization and refactoring tasks are underrepresented, and common languages such as C++ and Java are not included.

Datacurve also said its qualitative verdicts come from an LLM analyzer instead of human reviewers, with modest sample sizes of about 90 reviewed rollouts per model per benchmark.

The company has published the dataset, agent trajectories, and evaluation harness on GitHub, which should allow others to inspect and reproduce the results.

DeepSWE arrives as companies are moving quickly to adopt AI coding agents. If its findings about unreliable grading and benchmark contamination hold up, the AI industry may need to rethink how it measures coding performance.

📢 For the latest Tech & Telecom news, videos and analysis join ProPakistani's WhatsApp Group now!

Follow ProPakistani on Google News & scroll through your favourite content faster!