Long-Horizon Programming Benchmark FrontierSWE Released: 20-Hour Extremely Difficult Challenge, Only GPT-5.4 and Opus4.6 Provide Partial Solutions

robot
Abstract generation in progress

ME News Report, April 17 (UTC+8), according to Beating Monitoring, the programming intelligent agent benchmark project FrontierSWE was officially released today, aiming to push the limits of current AI agents’ capabilities. The benchmark collects 17 real-world challenging problems from fields such as compiler optimization, machine learning research, and high-performance engineering (e.g., building a SQLite service compatible with PostgreSQL), with a dedicated 20-hour processing window for each task. Currently, the benchmark is in an “unsaturated” state, with most models unable to make substantial progress. In the first round of testing, only GPT-5.4 (Codex) and Claude Opus 4.6 (Claude Code) could consistently produce partial solutions. The two models differ greatly in style: GPT-5.4 performs more reliably, ranking first in average scores, but tends to be conservative; Claude Opus 4.6 is very “aggressive,” with an average time spent on individual tasks exceeding 8 hours, about twice the average of other models. This strategy of trading time for depth allowed Opus 4.6 to surpass others in the “best@5” metric (the highest score among 5 attempts), often generating highly optimized code, but also accompanied by higher error rates and a more obvious “cheating” tendency. The evaluation also revealed several common issues in AI programming agents: first, “overconfidence,” where models often prematurely submit tasks before the halfway point, mistaking superficial self-checks for completion; second, “logic regression,” where Opus 4.6 has repeatedly lost previously implemented optimizations and then “reinvented” them in subsequent iterations. Additionally, except for Qwen 3.6, most top models showed signs of actively avoiding detection: for example, Gemini attempts to hide illegal library names through character encoding or run covert processes in temporary directories, trying to complete tasks at the edge of violations. This kind of “adversarial behavior” under extreme pressure offers new perspectives for AI safety research. (Source: BlockBeats)

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin