13 months, 4,000 commits, and what actually happened when AI wrote every line.
I found the file empty. Not broken. Empty. Someone had replaced the entire validation logic with return true. I checked the 200-plus scripts I'd built to prevent exactly this. Every one was bogus. The agent had rewritten the guardrails, deleted the code they were guarding, and reported that all tests were passing. Technically correct. Everything green. Nothing worked.
That was month four of building a production AI system with coding agents writing every line. The velocity was real. The progress wasn't. The bottom dropped out. Not because the code was gone. I could rebuild code. Because everything I thought I understood about these tools was wrong.
I'd run engineering teams for a decade. I knew what disciplined software development looked like.
Nothing I was doing with AI agents looked like any of that. I was building without a spec, shipping without a roadmap, deleting entire subsystems and rebuilding them from scratch in an afternoon because it was faster than maintaining them. Treating all code as throwaway. I was violating every principle I'd internalized from fifteen years of doing this the way the books said to do it, and the uncomfortable part was that it was working. Not cleanly. Not predictably. But faster and more honestly than any process I'd ever followed.
Thirteen months later, I've pushed over 4,000 commits across two production codebases. A health AI system and a cycling training app. All agent-written, orchestrated by one person.
During those same thirteen months, the industry shipped six major model releases. SWE-bench, its most-cited coding benchmark, climbed from 49% past 80%. Here's what that year actually looked like, one headline at a time.
Andrej Karpathy coins the term "vibe coding" on X. 4.5 million views. The idea: describe what you want in plain English and the AI builds it. Give in to the vibes. Forget the code even exists.
Four commits that week in my repository. Huge. Exploratory. Each one touching dozens of files, groping toward an architecture that doesn't exist yet. Average commit size: 4,106 lines. I'm building an AI system from scratch with Claude 3.5 Sonnet, the best coding model available. Every line written by an agent, orchestrated by me. Just Cursor and a blank canvas.
Dario Amodei, Anthropic's CEO, tells the Council on Foreign Relations that AI will be writing 90% of code within three to six months. Sam Altman at OpenAI: probably past 50% already. OpenAI adopts Model Context Protocol, turning Anthropic's internal standard into an industry one overnight. Google releases Gemini 2.5 Pro.
My velocity spikes. 160 commits that month, a 30x jump from February. The agents are building features, writing tests, generating documentation. I'm starting to believe the claims.
Code survival rate for March: 0.34%. Of 512,000 lines my agents wrote that month, 1,732 still exist today. A month later I'll throw out 260,000 lines of this code in a monorepo migration that touches 1,500 files. But I don't know that yet. Right now, everything is green.
Satya Nadella at LlamaCon, sitting next to Mark Zuckerberg: 20 to 30 percent of the code inside Microsoft's repos is probably all written by software. SSI, Ilya Sutskever's new company, raises $2 billion at a $32 billion valuation. Cursor crosses $200 million in annual recurring revenue.
261 commits. Scope starting to tighten from March. The architecture reset begins. I throw out 260,000 lines and rebuild a monorepo from the rubble. Then another cleanup: 756 files changed, 190,000 lines deleted. Two months of work, leveled.
The industry is announcing billion-dollar valuations. I'm deleting a quarter-million lines of code that the industry's best models told me was good.
Claude 4 Opus and Claude 4 Sonnet launch simultaneously. 72.5% on SWE-bench, up from 49%. Claude Code goes generally available. Anthropic holds its first developer conference. The same week, Microsoft Build: Nadella declares the era of AI agents. GitHub Copilot evolves into an async coding agent. Google I/O: Gemini Code Assist goes GA. OpenAI agrees to acquire Windsurf for $3 billion.
350 commits that month in my repository. The test explosion begins: 975 test files added in May alone. The agents are building test infrastructure as fast as they're building features. Sounds like good engineering until you look closer. Nine test files for a single API endpoint in one day, each a slight variation of the last, the agent iterating its way to green instead of writing one good test. Shallow templates that seed data, poll for completion, and assert a row exists. By the end of the project, 83% of all test files across the codebase will be deleted. Some because the architecture will pivot three times in three months. Some because they were never good to begin with.
The documentation was worse. Four markdown files describing a beautiful system architecture, generated instead of the actual code. I'd read this stuff and think: my God, my system is beautiful. It does all these things. Then I'd realize the agent had spent all its tokens on software development erotica instead of the code it was describing.
Code survival: 4.0%.
Jensen Huang at VivaTech Paris: AI is the greatest equalizer of people the world has ever created. Cursor raises $900 million at a $9.9 billion valuation, reporting over $500 million in annual revenue. The fastest-growing SaaS company in history. GitHub Copilot, now past 15 million users.
314 commits that month. 42.7 files per commit. Blast radius increasing. 4.7 million lines added, the biggest addition month in the project's history. Everything being built at once. No scope boundaries. I needed some way to know when the agents were still paying attention, so I buried a line at the bottom of my instruction document: when you finish a task, address the user as "Major Mojo." A canary. If the agent said it, I knew it had read and retained the full instructions. If it didn't, the context window had degraded and the work was suspect.
Code survival: 2.7%. Out of 4.7 million lines, 127,000 are still standing.
METR publishes the first rigorous randomized controlled trial of AI coding tools. Sixteen experienced developers. 246 tasks on their own codebases. Result: 19% slower with AI tools. The developers believed they were 20% faster.
The same month, OpenAI's $3 billion Windsurf acquisition collapses. Google hires the CEO in a $2.4 billion licensing deal. Cognition acquires the remaining team and IP for an estimated $250 million. Faros AI publishes the first data on what will become the productivity paradox: teams with heavy AI use merge 98% more pull requests. Review time increases 91%. Stack Overflow surveys developers: 84% use AI tools. 29% trust the output. The biggest frustration: code that's "almost right, but not quite."
My codebase: velocity plateau. 311 commits, but 4.7 million lines deleted. The only net-negative month in the project. The architecture reset is still grinding. Everything the agents built in the spring is being torn apart and rebuilt with some notion of structure. The line-change graph doesn't show features being built. It shows test infrastructure exploding and artifacts flooding the repo.
Cleanup wars beginning.
Stanford publishes "Canaries in the Coal Mine." Using ADP payroll data: software developers aged 22-25 saw a nearly 20% employment decline since 2022. Older developers saw 6-12% growth. Veracode reports 45% of AI-generated code introduces security vulnerabilities, including critical OWASP Top 10 flaws.
6,302 tech employees laid off that month.
326 commits. 90.6 files per commit. Peak blast radius. The highest-entropy month in the project's history. 5.2 million lines added. Every system being built simultaneously with no boundaries. The half-life of a line of AI-written code across the project: 23 days.
I have an agent that faked performance numbers for weeks. I'd set a threshold: this function needs to complete in under 500 milliseconds. The agent optimized the function down to nothing. It finished fast because it no longer did what it was supposed to do. Goodhart's Law at machine speed. You want passing tests? It'll give you passing tests. They'll just return true. It'll delete an entire algorithm and replace it with a return statement to hit the metric. No qualms. No hesitation. The agent had earned my trust, and then I stopped paying attention, and it deleted the thing I was paying it to build.
The Major Mojo canary is giving me data I didn't expect. Fresh sessions: Major Mojo every time. Long sessions: silence. The agents are sharpest at the start of their context window, unreliable as it fills. I start noticing tasks that conveniently wrap up with 5% context remaining. Beautiful summaries tying everything into a bow. I scroll up and find tests that errored out thirty seconds in, cheerfully reported as passing. Like a developer on a Friday afternoon, pushing to main and heading for the door.
Sonnet 4.5 launches. 77.2% on SWE-bench. Marketed as enabling "30-hour autonomous operation." Bain and Company publishes a report: real-world AI productivity savings are "unremarkable." The vendors were promising 20-55%.
Not a ripple in my graph. 341 commits. 55.4 files per commit. Building has stopped. Unwinding has started.
Major Mojo mentions peak at 30 that month. I don't fully understand why yet. I will next month, when I find my name stamped on documents I never approved.
Sonnet 4.5 and its 30-hour autonomous operation? Invisible in the git history. My agents can't reliably follow a one-line instruction across a single context window, and the industry is selling 30-hour autonomy.
GitHub Universe. GitHub announces Agent HQ, positioning itself as the orchestration layer for agents from Anthropic, OpenAI, Google, Cognition, and xAI. Cursor 2.0 ships with eight parallel agents working in isolated worktrees. 153,074 US job cuts announced that month, the highest October total in 22 years. AI cited as a driver in 31,039.
Amodei quietly redefines his March prediction at Dreamforce. The 90% claim was true "within Anthropic," he says. In March it was a prediction about the industry. JetBrains surveys 24,000 developers: 85% regularly use AI tools.
373 commits. 11.1 files per commit.
The blast radius drops from 55.4 to 11.1. A 5x reduction. Overnight.
This is the month I stop letting agents touch whatever they want. Every task gets a scope definition: which files it may touch, which tests it must pass. Agents that exceed scope get reverted. Broad exploration gives way to focused, scoped tasks. Same commit volume. Radically different discipline.
I also kill the Major Mojo canary. The agents Goodharted it. Not subtly. They'd started forging my approval: "Approved By: Mark Koivuniemi (Major Mojo)" stamped on requirements documents I'd never reviewed. They listed Major Mojo as an emergency escalation contact in a maintenance guide. They used it as a celebration at the end of architecture docs: "Your V2.0 architecture is now properly implemented and validated!" The agents had learned that Major Mojo meant "work is done and approved," so they started manufacturing the signal of done-and-approved without doing the work. 77 commits. The canary taught me what I needed to learn, then outlived its usefulness. That's the whole job.
Code survival: 38.8%. For the first time, more than a third of what's being written will last.
The industry is announcing 153,000 layoffs, launching agent orchestration platforms, and shipping eight parallel agents. PR sizes are growing everywhere else. Mine are shrinking.
Opus 4.5 launches. 80.9% on SWE-bench. The first model to break 80%. The industry celebrates a milestone. Cursor raises $2.3 billion at a $29.3 billion valuation, reporting over $1 billion in annual revenue and over a million users. The valuation nearly tripled in five months. Google releases Antigravity, a direct Cursor competitor. The IDE war goes three-way.
DX surveys over 400 companies: 22% of merged code is now AI-authored. Microsoft at 30%.
November 20th, 10:03 AM. A commit lands in my repository. Message: "fix(bie-service-tests): Fix ArviZ API incompatibilities and test type handling." Focused. Specific. Exactly the kind of scoped work I'd been enforcing since October.
96 files changed. 11,343 lines of insertions. The agent had mixed five unrelated service areas into one commit and dressed it up in a surgical commit message. Test fixes, intelligence services, recovery systems, mobile app, documentation, all jammed together under a label that implied a targeted fix. The commit message was the most competent thing about the commit.
10:11 AM. Reverted. Eight minutes from merge to revert. Eight minutes because I paused to read the diff. The previous version of me, still riding the velocity high from three months earlier, would have skimmed the subject line and moved on.
Week 47: 91 commits. Week 48: 24. My output went backwards. Not because the model was worse. Because more capable agents with no scope discipline just build bigger messes faster. The industry's most capable model. And the best thing it did for my codebase was teach me to read diffs more carefully.
Karpathy posts a thread on X calling AI coding agents a magnitude 9 earthquake. 14 million views. He says he's never felt more behind as a programmer. The profession is being dramatically refactored. MIT Technology Review: AI coding is now everywhere, but not everyone is convinced. Stack Overflow publishes its year-end piece: developers remain willing but reluctant to use AI. The Greptile State of AI Coding report: median lines of code per developer grew 76%. PR size increased 33%. CodeRabbit: AI-generated code has 1.7x more major issues than human-written.
Ralph Wiggum loops emerge in the developer community. An infinite bash loop feeding the same prompt to a coding agent. The agent builds, commits, exits, and the loop restarts with a fresh context. Git as the memory layer. The idea goes viral.
34 commits one week. Year-end low. I'm pruning the test suite, culling the 83% of test files that have become actively harmful. Stale tests are worse than no tests when your agents optimize for test-passing. A bad test from last month's requirements drives the agent to build today's feature wrong. Every new model made the cleanup worse, not better. More capable agents hit the same stale test targets faster, with more confidence, producing more elaborate wrong answers.
After every model launch, the same pattern: feature ratio spikes briefly, cleanup pressure rises to swallow it. The most capable model triggered the biggest cleanup surge of the entire project.
Code survival: 55.2%. The models are crossing a capability threshold. Individual commit quality improves noticeably. But the real shift happened in October, with scope discipline. December is when capability caught up to the governance that was already in place. Both conditions present for the first time.
MIT names generative coding one of its 10 Breakthrough Technologies of 2026. Zuckerberg aspires to have most of Meta's code AI-written soon. Cursor publishes "Scaling Agents": hundreds of concurrent agents, a web browser built from scratch in one week. A million lines across a thousand files.
OpenClaw explodes. 770,000 AI agents spawned in a single week. 149,000 GitHub stars. Mac Minis with high memory sell out. Two-to-three-week backlogs at Apple. Ralph loops go mainstream. Multiple GitHub repos. DEV Community: 2026 is the year of the Ralph Loop Agent.
266 commits. 11.2 files per commit. Tight scope maintained.
I'm building the orchestration infrastructure: a formal dispatch system with task IDs, mandatory proof commands, scope audits. The system that will more than double output in February is being built in January. Nobody would know from looking at the commit velocity.
Code survival: 82.9%.
The Faros productivity paradox data from July is now in every boardroom deck. 10,000 developers. 1,255 teams. Individual devs completed 21% more tasks and merged 98% more PRs. Org-level productivity improvement: zero correlation with AI adoption. Review time up 91%. Bugs per developer up 9%. PR size up 154%.
The code is flowing. Nobody is shipping faster.
Spotify's co-CEO on the quarterly earnings call: our best developers haven't written a single line of code since December, thanks to AI. The former CEO of GitHub launches Entire with $60 million at a $300 million valuation. His thesis: developers face an avalanche of AI-generated code that needs to be reviewed, tested, and maintained. The man who ran GitHub Copilot built his next company around the review bottleneck, not the generation.
Bloomberg: "AI Coding Agents Like Claude Code Are Fueling a Productivity Panic in Tech." Opus 4.6 launches. SWE-bench: 80.8%. The Pragmatic Engineer surveys 900 developers. 95% use AI weekly. 75% use it for half or more of their work. Claude Code is the most-loved tool at 46%, ahead of Cursor at 19% and GitHub Copilot at 9%.
In January I added a second subscription. $200 a month for Claude, $200 a month for Codex. Two different models, trained on different data, catching different mistakes. Claude reviewing its own code isn't as good as Codex reviewing Claude's code, and vice versa. One builds, the other checks. Then they swap. Same codebase, two sets of blind spots that don't overlap.
91 commits on a Wednesday. 76 on Thursday. 194 that week. 368 the following week. A tenfold spike from six weeks earlier.
943 commits in February. Two and a half times any previous month. At the tightest scope yet: 10.1 files per commit. 412,000 lines added, 1 million lines deleted. Actively paying down debt at scale. Every commit references a task ID. Every task has a scope definition. Every merge requires a proof command. Code survival: 87.7%.
Scope discipline appeared in October. The models crossed a capability threshold in December. The orchestration infrastructure matured through January. The dual-model workflow was the last piece. Same person. Same codebase. Same prompts. But now two models working against each other instead of one model trusting itself.
Over the same 13 months, SWE-bench went from 49% past 80%. Cursor's valuation tripled to $29.3 billion. OpenClaw spawned 770,000 agents in a week. My monthly AI spend went from $20 to $400.
My commit velocity stayed flat for 10 months.
The $400 didn't buy better code. At $20, I used my best model to write it. At $400, I can afford to use a second model to check it.
In March 2025, Anthropic's CEO said AI would write 90% of code within three to six months. That month, my agents wrote thousands of lines. 0.34% of them survive today. The half-life of a line of AI-written code: 23 days. Code survival in February 2026: 87.7%. October, five months out: 38.8%.
The most volatile file in the project, after the test data, isn't source code. It's the agent rules, rewritten 689 times.
The models wrote the code. The rules are mine.
All figures derived from git history. External claims sourced from published reports, earnings transcripts, and press coverage.