AI-generated code passes far more automated tests than human

Humans are notoriously hard to please.

By Chris Stokel-Walker

March 13, 2026

Estimated reading time: 4 minutes

Key takeaways:

AI models frequently pass automated SWE-bench tests, yet human maintainers reject that same code for failing to meet professional standards.
While AI is getting better at avoiding syntax errors, it still struggles with “soft” requirements like proper coding style, repository standards, and maintaining complex project logic.
Relying solely on automated scores leads the industry to significantly overestimate the readiness of models like GPT-5 and Claude for real-world production environments without human oversight.

A new METR study highlights how automated code review for AI-generated code might not be ready for prime time.

Join LeadDev.com for free to access this content

Create an account to access our free engineering leadership content, free online events and to receive our weekly email newsletter. We will also keep you up to date with LeadDev events.

We have linked your account and just need a few more details to complete your registration:

First name Last name Job title Company Country

Terms and conditions I agree to the LeadDev.com terms and conditions of use

Create a password

About the author

Chris Stokel-Walker

Chris Stokel-Walker is a freelance journalist based in the UK.
- @stokel

Newsletters

Panel discussions

Videos

Reports

For you

London

Meetups

New York

Berlin

AI-generated code passes far more automated tests than human

By Chris Stokel-Walker

Join LeadDev.com for free to access this content

About the author

Chris Stokel-Walker

London

Meetups

New York

Berlin

AI-generated code passes far more automated tests than human

By Chris Stokel-Walker

Join LeadDev.com for free to access this content

Share:

About the author

Share:

More like this