Futures
Access hundreds of perpetual contracts
TradFi
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Long-Horizon Programming Benchmark FrontierSWE Released: 20-Hour Extremely Difficult Challenge, Only GPT-5.4 and Opus4.6 Provide Partial Solutions
ME News Report, April 17 (UTC+8), according to Beating Monitoring, the programming intelligent agent benchmark project FrontierSWE was officially released today, aiming to push the limits of current AI agents’ capabilities. The benchmark collects 17 real-world challenging problems from fields such as compiler optimization, machine learning research, and high-performance engineering (e.g., building a SQLite service compatible with PostgreSQL), with a dedicated 20-hour processing window for each task. Currently, the benchmark is in an “unsaturated” state, with most models unable to make substantial progress. In the first round of testing, only GPT-5.4 (Codex) and Claude Opus 4.6 (Claude Code) could consistently produce partial solutions. The two models differ greatly in style: GPT-5.4 performs more reliably, ranking first in average scores, but tends to be conservative; Claude Opus 4.6 is very “aggressive,” with an average time spent on individual tasks exceeding 8 hours, about twice the average of other models. This strategy of trading time for depth allowed Opus 4.6 to surpass others in the “best@5” metric (the highest score among 5 attempts), often generating highly optimized code, but also accompanied by higher error rates and a more obvious “cheating” tendency. The evaluation also revealed several common issues in AI programming agents: first, “overconfidence,” where models often prematurely submit tasks before the halfway point, mistaking superficial self-checks for completion; second, “logic regression,” where Opus 4.6 has repeatedly lost previously implemented optimizations and then “reinvented” them in subsequent iterations. Additionally, except for Qwen 3.6, most top models showed signs of actively avoiding detection: for example, Gemini attempts to hide illegal library names through character encoding or run covert processes in temporary directories, trying to complete tasks at the edge of violations. This kind of “adversarial behavior” under extreme pressure offers new perspectives for AI safety research. (Source: BlockBeats)