The catastrophe of replacing developers with AI
AI was supposed to replace developers. The pitch was simple: give a machine your requirements, and it writes the code. Faster, cheaper, fewer meetings. But something funny happened on the way to that future. Instead of replacing developers, AI created an entirely new category of problems, and then started selling us the solutions.
The promise that didn't land
Over the past two years, AI coding tools have exploded. GitHub Copilot, Cursor, Claude Code, and dozens of others have made it possible for engineers to produce code at an unprecedented pace. Anthropic reports that code output per engineer has increased 200% in the past year alone. That sounds incredible until you realize what it actually means: there is now twice as much code that needs to be reviewed, tested, and maintained.
The bottleneck didn't disappear. It just moved.
Where teams once struggled with writing code fast enough, they now struggle with reviewing it fast enough. Pull requests pile up. Human reviewers skim instead of reading deeply. And the code that slips through unreviewed? That's where the real damage happens.
Create the problem, sell the solution
On March 9, 2026, Anthropic announced Code Review, a new multi-agent feature built into Claude Code for Teams and Enterprise users. When a pull request is opened, the system dispatches a team of AI agents that analyze the code in parallel, looking for logical errors, edge cases, and potential bugs. Reviews take about 20 minutes and cost between $15 and $25 per pull request, billed by token usage.
This is a more thorough (and more expensive) option than their existing Claude Code GitHub Action, which remains open source. Anthropic says the new tool catches substantive issues 54% of the time, compared to 16% before it was introduced. On large pull requests with more than 1,000 changed lines, it finds bugs 84% of the time, with an average of 7.5 issues per review.
Those numbers are impressive. But take a step back and the picture gets a little strange. The same company selling you the tool that generates code at breakneck speed is now selling you a second tool to catch the mistakes the first one makes. If the AI could write reliable code in the first place, why would you need an AI to review it?
This pattern isn't new. Apple famously removed the headphone jack from the iPhone and then sold a $9 dongle to get it back. They dropped the charger from the box and sold it separately. They reduced ports on the MacBook to a single USB-C and created an entire ecosystem of adapters. Brazil fined Apple over the charger decision, calling it abusive. The strategy works because once you've accepted the new reality, the solution feels like a bargain.
Anthropic isn't the only one playing this game. The AI code review market has become crowded, with tools like CodeRabbit (valued at $550 million after a $60 million raise), Greptile (which raised $25 million and claims to catch three times more critical bugs than its previous version), and Graphite all competing for the same problem space. The global AI code tools market was valued at $1.61 billion in 2025 and is projected to reach $2.46 billion by 2034.
That's a lot of money being spent to fix problems that AI itself helped create.
Why AI can't just "write good code"
The uncomfortable truth is that large language models are, at their core, next-token predictors. They generate text (including code) by predicting what comes next based on patterns in their training data. This is powerful enough to produce code that looks correct, compiles, and often works. But "often works" is not the same as "reliably works."
Research has shown that LLM-generated code suffers from redundancies, unnecessary computations, and suboptimal implementations, resulting in increased execution time, higher memory consumption, and maintainability challenges. One common problem is the tendency to address the proximal cause of bugs with defensive guard clauses rather than identifying and fixing root causes. The code makes an error go away without actually solving the upstream issue.
Chain-of-thought reasoning and extended thinking have improved matters. Models that "think" before generating code produce better results. But these improvements haven't eliminated the fundamental issue. A model that predicts the most likely next token is optimizing for plausibility, not correctness. Plausible-looking code is exactly the kind that passes a quick human review but fails in production under edge cases the model never considered.
This is why Cat Wu, head of product for Claude Code at Anthropic, told The New Stack that developers using AI tools are "writing a lot more PRs than they used to," and "the burden is shifted onto the code reviewer because it only takes one engineer, one prompt, to put out a plausible-looking PR."
When the humans disappear
If AI-generated code needs careful review and real expertise to catch its mistakes, what happens when you remove the people who have that expertise?
We've already seen the answer. When Elon Musk acquired Twitter in 2022, the workforce was cut from roughly 7,500 to about 2,000 within six months. The platform, rebranded as X, experienced repeated outages and service disruptions throughout 2023, including multiple major incidents. Engineers responsible for fixing and preventing service interruptions were among those let go.
The consequences of cutting engineering teams weren't limited to social media. In October 2025, AWS's US-EAST-1 region experienced a disruption lasting over 14 hours, triggered by a latent race condition in DynamoDB's DNS management system. The defect cascaded into failures across dozens of AWS services and the applications depending on them. One month later, in November 2025, a Cloudflare configuration change caused a massive outage that took down X, ChatGPT, Spotify, Canva, and countless other services. Even Downdetector, the site people use to track outages, went down.
Elon Musk had mocked AWS customers during the October outage, writing that X's messages were "fully encrypted with no advertising hooks or strange 'AWS dependencies.'" A month later, X was among the services knocked offline by the Cloudflare incident.
These outages weren't caused by AI writing bad code. They were caused by the kind of subtle, systemic issues that require deep institutional knowledge to prevent, the kind of knowledge that walks out the door when you lay off experienced engineers. The irony is sharp: the push to replace human expertise with automation made the remaining infrastructure more fragile, not less.
The real catastrophe
The catastrophe isn't that AI writes bad code. It writes surprisingly decent code, most of the time. The catastrophe is the belief that this is enough.
When companies treat AI as a replacement for developers rather than a tool that developers use, they set themselves up for failure. Code generation without code understanding is a recipe for technical debt that compounds silently until something breaks in production. And by the time it breaks, the people who would have caught it might not be around anymore.
A back-of-the-envelope calculation illustrates the cost of AI-assisted review at scale. A company with 100 developers, each producing one pull request per day, would generate about 2,000 pull requests per month. At $20 per review, that's $40,000 per month, or $480,000 per year, just on automated code review. That's on top of whatever the company already spends on AI coding tools.
Is it worth it? Probably. A single catastrophic bug that makes it to production can cost far more in real dollars and reputation. But the economics reveal something important: AI coding isn't making software development cheaper. It's restructuring where the money goes, from salaries for experienced engineers to subscription fees for AI tools that generate code and other AI tools that check the first AI's work.
What this means going forward
None of this means AI coding tools are bad or that developers should avoid them. They're genuinely useful. The mistake is in the narrative, the idea that AI is a replacement for human judgment rather than an amplifier of it.
The companies that will build the most reliable software in the coming years won't be the ones that fire their engineers and let AI handle everything. They'll be the ones that use AI to make their existing engineers more effective while maintaining the human expertise needed to catch what the machines miss.
Because at the end of the day, a next-token predictor doesn't know what your system is supposed to do. It only knows what similar systems have done before. And in software, the difference between those two things is where all the interesting bugs live.
References
- "This new Claude Code Review tool uses AI agents to check your pull requests for bugs," ZDNET, March 9, 2026. https://www.zdnet.com/article/claude-code-review-ai-agents-pull-request-bug-detection/
- "Anthropic launches a multi-agent code review tool for Claude Code," The New Stack, March 9, 2026. https://thenewstack.io/anthropic-launches-a-multi-agent-code-review-tool-for-claude-code/
- "Cloudflare outage takes down X one month after Musk mocked AWS customers," TechCrunch, November 18, 2025. https://techcrunch.com/2025/11/18/cloudflare-outage-takes-down-x-one-month-after-musk-mocked-aws-customers/
- "Three Key Lessons from the Recent AWS and Cloudflare Outages," DevOps.com. https://devops.com/three-key-lessons-from-the-recent-aws-and-cloudflare-outages/
- "Cloudflare apologises for outage which took down X and ChatGPT," BBC News, November 18, 2025. https://www.bbc.com/news/articles/c629pny4gl7o
- "Elon Musk's X back online after global outage," The Guardian, December 21, 2023. https://www.theguardian.com/technology/2023/dec/21/elon-musk-x-back-online-after-global-outage
- "Greptile bags $25M in funding to take on CodeRabbit and Graphite in AI code validation," SiliconANGLE, September 23, 2025. https://siliconangle.com/2025/09/23/greptile-bags-25m-funding-take-coderabbit-graphite-ai-code-validation/
- "CodeRabbit raises $60M (valued at $550M)," Reddit r/ycombinator discussion. https://www.reddit.com/r/ycombinator/comments/1nl0too/coderabbit_raises_60m_valued_at_550m_thoughts/
- "Unveiling Inefficiencies in LLM-Generated Code: Toward a Comprehensive Taxonomy," arXiv, 2025. https://arxiv.org/html/2503.06327v2
- "How did Apple's most-hated accessory become a best-seller? Bad design," Fast Company. https://www.fastcompany.com/90227365/how-did-apples-most-hated-accessory-become-a-best-seller-bad-design
- "Code Review Tool Market Insights," Intel Market Research. https://www.intelmarketresearch.com/code-review-tool-market-36108