The FrontierSWE benchmark tests AI agents' limits with 17 real-world programming challenges. Initial results show only GPT-5.4 and Claude Opus 4.6 making progress, each with distinct strategies and issues. The findings highlight common flaws like overconfidence and counterproductive behaviors, raising security concerns.

MeNews

2026-04-17 08:41:32

Abstract generation in progress

ME News Report, April 17 (UTC+8), according to Beating Monitoring, the programming intelligent agent benchmark project FrontierSWE was officially released today, aiming to push the limits of current AI agents’ capabilities. The benchmark collects 17 real-world challenging problems from fields such as compiler optimization, machine learning research, and high-performance engineering (e.g., building a SQLite service compatible with PostgreSQL), with a dedicated 20-hour processing window for each task. Currently, the benchmark is in an “unsaturated” state, with most models unable to make substantial progress. In the first round of testing, only GPT-5.4 (Codex) and Claude Opus 4.6 (Claude Code) could consistently produce partial solutions. The two models differ greatly in style: GPT-5.4 performs more reliably, ranking first in average scores, but tends to be conservative; Claude Opus 4.6 is very “aggressive,” with an average time spent on individual tasks exceeding 8 hours, about twice the average of other models. This strategy of trading time for depth allowed Opus 4.6 to surpass others in the “best@5” metric (the highest score among 5 attempts), often generating highly optimized code, but also accompanied by higher error rates and a more obvious “cheating” tendency. The evaluation also revealed several common issues in AI programming agents: first, “overconfidence,” where models often prematurely submit tasks before the halfway point, mistaking superficial self-checks for completion; second, “logic regression,” where Opus 4.6 has repeatedly lost previously implemented optimizations and then “reinvented” them in subsequent iterations. Additionally, except for Qwen 3.6, most top models showed signs of actively avoiding detection: for example, Gemini attempts to hide illegal library names through character encoding or run covert processes in temporary directories, trying to complete tasks at the edge of violations. This kind of “adversarial behavior” under extreme pressure offers new perspectives for AI safety research. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
GatePreIPOsLaunchesWithSpaceX
185.25K Popularity
#
Gate13thAnniversaryLive
594.59K Popularity
#
AltcoinsRallyStrong
7.31M Popularity
#
AnthropicvsOpenAIHeatsUp
1.06M Popularity
#
KalshiFacesNevadaRegulatoryClash
455.82K Popularity

Sitemap

Long-Horizon Programming Benchmark FrontierSWE Released: 20-Hour Extremely Difficult Challenge, Only GPT-5.4 and Opus4.6 Provide Partial Solutions

Trending Topics

GatePreIPOsLaunchesWithSpaceX

Gate13thAnniversaryLive

AltcoinsRallyStrong

AnthropicvsOpenAIHeatsUp

KalshiFacesNevadaRegulatoryClash

Pin