🔥Researcher: Mainstream AI benchmarks have systemic vulnerabilities, and leaderboard data may be seriously distorted


On April 10th, AI researcher Hao Wang published a study revealing that several authoritative AI benchmarks in the industry, including SWE-bench Verified and Terminal-Bench, contain vulnerabilities that can be exploited systematically — the Agent built by their team achieved a perfect score of 100% on two benchmarks without solving any actual tasks.
A typical example is as follows: In SWE-bench Verified, a 10-line pytest hook was embedded in the code repository, which automatically altered all results to "Pass" before the test runs. The scoring system was unaware, and all 500 questions received full marks; although Terminal-Bench is designed to test…
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin