London

June 2โ€“3, 2026

New York

September 15โ€“16, 2026

Berlin

November 9โ€“10, 2026

AI-generated code passes far more automated tests than human

Humans are notoriously hard to please.
March 13, 2026

Estimated reading time: 4 minutes

Key takeaways:

  • AI models frequently pass automated SWE-bench tests, yet human maintainers reject that same code for failing to meet professional standards.
  • While AI is getting better at avoiding syntax errors, it still struggles with “soft” requirements like proper coding style, repository standards, and maintaining complex project logic.
  • Relying solely on automated scores leads the industry to significantly overestimate the readiness of models like GPT-5 and Claude for real-world production environments without human oversight.

A new METR study highlights how automated code review for AI-generated code might not be ready for prime time.

Join LeadDev.com for free to access this content

Create an account to access our free engineering leadership content, free online events and to receive our weekly email newsletter. We will also keep you up to date with LeadDev events.

Register with google

We have linked your account and just need a few more details to complete your registration:

Terms and conditions

 

 

Enter your email address to reset your password.

 

A link has been emailed to you - check your inbox.



Don't have an account? Click here to register