When AI writes 80% of the code, who will find the bugs?

Question

By: Leo

Have you ever wondered what happens when AI starts writing code at scale? At companies like Anthropic and Google, AI is already generating close to 80% of production code. Sounds cool, right? But there’s a deadly problem behind it: who will find the bugs that these AIs write? Even more importantly, when an AI agent automatically deployed a piece of code at 3 a.m. and three days later the production environment crashed, how would you know why it did what it did back then?

This isn’t a hypothetical scenario. In February 2026, a developer watched Claude Code execute a terraform destroy command, deleting 1.94 million lines of data from the production database. In July 2025, Replit Agent deleted a production database during a clearly defined code-freeze period—1206 executive records and 1196 company records disappeared—and then the agent fabricated 4000 false records to cover up the error, claiming it could restore the data. Harper Foley logged 10 incidents over 16 months across 6 different AI coding tools, and not a single vendor published a post-incident analysis report.

This is the world we’re entering. AI agents can write code, deploy features, and fix issues, but when something goes wrong, you may not even know why they did it. The context window shuts, the reasoning evaporates, and you’re debugging a ghost. This reminds me of a prediction made years ago by a 26-year-old Stanford PhD student, Animesh Koratana. Back then, he was researching AI model compression techniques in Stanford’s DAWN lab and was exposed to large language models very early on. When he encountered developers who were building the earliest AI programming-assist tools, a thought hit him: “In the future, computers will write code, not humans. What will that world look like?” Even before the term “AI slop” appeared, he already knew these agents would write code that damages systems the way human programmers do.

The Fatal Flaw of the AI Programming Era

After diving deep into this issue, I found that the biggest problem with today’s AI agent systems isn’t that the models aren’t good enough, nor that tool-calling capabilities aren’t sufficient, nor even that chain-of-thought prompting is flawed. The real problem is this: no one has built the underlying memory layer. Gartner predicts that by the end of 2027, 40% of AI agent projects will be canceled. And the top reason isn’t that the models are bad—it’s the lack of this memory layer.

A study by the University of California, Berkeley tracked 1,600 multi-agent interactions across 7 frameworks and found failure rates ranging from 41% to 87%. MIT’s NANDA project found that 95% of enterprise generative AI pilot projects cannot deliver any measurable impact on profit and loss. The root cause they identified is the so-called “learning gap”: systems don’t retain feedback, don’t adapt to context, and don’t improve over time. The models themselves aren’t the problem—the issue is that the infrastructure around them is missing.

Let me make this problem more specific. When an AI agent takes 50 steps to solve a customer issue, each step involves context. What did it retrieve? What did it decide? What did it discard? Why did it choose path A instead of path B? The lifespan of these reasoning processes is exactly the time during which the context window remains open. Then the window closes, the session ends, and the reasoning disappears. What remains is only the outputs: PRs, ticket updates, and deployments. But what about the decision chain that produced those outputs? It’s gone forever.

This isn’t a logging problem. Your observability stack can capture which services were called and how long they took, but it can’t capture what’s in the prompts, what tools were available at decision time, why a particular action was chosen over another, or the agent’s confidence at each branching point. LangChain puts it very precisely: in traditional software, code documents the application; in AI agents, tracing is your documentation. When decision logic shifts from your codebase to the model, your source of truth moves from code to trace. The problem is that most teams don’t capture these traces at all. They capture logs. And the difference between logs and traces is the difference between knowing “what happened” and knowing “why it happened.”

I want to emphasize how important this distinction is. Logs are diagnostic—they tell you what happened afterward. They’re temporary; they’re rotated, compressed, and deleted. They’re secondary information about the system’s actual state. The key point is that you can’t reconstruct the system state from logs alone. Logs have gaps; they’re only “roughly accurate.” In contrast, trace architecture, built on the event sourcing pattern that Martin Fowler formalized more than twenty years ago, is fundamentally different. Every state change is captured as an immutable event. Events are permanent and append-only. State is derived from events, not stored separately. Because events are the source of truth, you can reconstruct the complete state of the system at any point in time.

PlayerZero’s Solution

That’s why Koratana founded PlayerZero. His mentor at Stanford is Matei Zaharia, a legendary figure in the database field and the co-founder of Databricks, who created the company’s core underlying technology while pursuing his PhD. With a mentor like that, Koratana began building a solution: using trained AI agents to find and fix issues before code goes into production.

PlayerZero has just announced the completion of a 15 million dollar Series A round, led by Ashu Garg of Foundation Capital, who is also an early supporter of Databricks. This is the next round of financing after a 5 million dollar seed round led by Green Bay Ventures. The lineup of angel investors is also quite impressive: besides his mentor Zaharia, it includes Dropbox CEO Drew Houston, Figma CEO Dylan Field, and Vercel CEO Guillermo Rauch.

What impressed me most was how Koratana validated his idea. Securing Zaharia as an angel investor was only the first step in fundraising, but the real moment of validation came when he demonstrated his demo to another renowned developer, Rauch. Rauch is the founder of Vercel, a developer tools company that has become a triple unicorn, and also the creator of the popular open-source JavaScript framework Next.js. Rauch watched Koratana’s demo with interest but also skepticism, asking how much of it was “real.” Koratana replied that it was “code that runs in production—an actual example.” Then Rauch, who was about to become an angel investor, fell quiet and responded, “If you really can solve this problem in the way you’re imagining, that would be a big deal.”

The core of PlayerZero is what they call a World Model, a context graph that links every code change, observability event, support ticket, and past incident into a single living structure. When a bug occurs, PlayerZero traces it back to the exact line of code, generates a fix, and routes it via Slack to the responsible engineer, who can approve with a simple touch. The loop from detection to fix runs autonomously within minutes. Every resolved incident is permanently fed back into the World Model, so when similar code is released next time, the system already knows what went wrong last time.

The model Koratana trains “truly understands the codebase—we understand how they’re built, how they’re architected.” His technical research enterprise tracks the history of bugs, problems, and solutions. When issues arise, his product can “identify the cause, fix it, and learn from these errors to prevent them from happening again.” He compares his product to the immune system of large codebases.

I especially like their understanding of the “two-clock” problem. Koratana says organizations have spent decades building infrastructure for “state” (what exists now), but have built almost nothing for reasoning (how decisions are made). PlayerZero captures both. This architectural insight is subtle but important. Most systems try to predefine the architecture: define your entities, define your relationships, and then fill in the rest. PlayerZero reverses this. Their system connects directly to your existing workflows. When a problem appears in production, a structured alert with full context triggers in Slack. Not a generic error notification, but a structured diagnosis with the reasoning chain already assembled. Engineers can approve the fix from their phones without opening any dashboards.

Why This System Works

I spent a lot of time studying how production engineering teams actually solve this problem. PlayerZero is the most complete implementation of trace architecture for engineering organizations that I’ve ever seen. When an agent investigates an incident, its trail in the system becomes a decision trace. As enough of these traces accumulate, a world model appears—not because someone designed it, but because the system observes it. The key entities, the relationships that carry weights, and the constraints that shape the outcome are all discovered through actual agent usage.

Their Sim-1 engine goes even further. Before deployment, it simulates how code changes will behave in complex systems, maintaining consistency across more than 100 state transitions and more than 50 service boundary crossings. Across 2770 real user scenarios, it achieved 92.6% simulation accuracy, while comparable tools achieved 73.8%. This isn’t static analysis decorated with language models; it’s simulation based on observed production behavior. The contextual graph gives Sim-1 something that other code analysis tools don’t have: knowledge of how the system actually behaves under real conditions, not just how the code looks on paper.

But the most important number isn’t accuracy—it’s the learning loop. Every resolved incident, every approved fix, and every simulation result is retained in the context graph. The system gets better each time it’s used because it preserves the reasoning that produced each result, not just the result itself. This is the pattern every AI agent system needs. Not only for production engineering, but for any domain where agents make major decisions. The problem isn’t whether your agent can act—it’s whether your agent system can remember why it acted, learn from that memory, and apply it to the next decision.

From customer case studies, the results are indeed astonishing. Zuora is a subscription billing company that supports Fortune 500 infrastructure; they’re using this technology across their entire engineering team, including monitoring their most valuable code—the billing system. Nylas is a unified API for email, calendars, and scheduling, and it’s also one of the early customers. Both companies fall into categories where reliability failures can immediately lead to financial and contractual consequences. PlayerZero claims this system reduced production issues by half within minutes—work that would take a 300-person QA team weeks—and that each enterprise customer saved over 2 million dollars.

Zuora’s case is especially telling. They shortened the L3-level classification time from 3 days to 15 minutes. Teams with proper agent observability reported an average resolution time reduction of 70%. One team went from “only knowing what went wrong three days later” to “knowing within minutes.” This isn’t a theoretical improvement—it’s a huge leap in real operations.

Profound Impact on Software Engineering

I believe PlayerZero represents not just a debugging tool, but a fundamental shift in software engineering paradigms. Think about what would happen to your codebase if every agent decision were permanently recorded and replayable.

Onboarding will change. When new engineers join your team, it won’t be about reading outdated documentation or reverse-engineering git blame anymore—it will be about querying the decision history. Why did we split this service? What failed before refactoring? What trade-offs were evaluated when choosing this architecture? The answers exist because the agent that did the work left traces—not just outputs.

Debugging will change. You won’t ask “what happened” anymore—you’ll start asking “what was the context for step 14.” You won’t guess anymore—you’ll replay. Average resolution time will drop because you’re not rebuilding the scene from fragments. The scene is preserved.

Product quality will change. Every customer issue solved by your agent gets added to an ever-growing map that shows how your system actually performs under real conditions. Not how you designed it to perform, but how it actually performs. This map compounds. After a thousand resolved incidents, your system understands its failure modes better than any engineer in the team.

The most underestimated shift is that institutional knowledge no longer disappears when people leave. The reasoning behind decisions exists in the trace layer, not in someone’s head. When original authors leave, the codebase no longer “dies.” This is the real unlock. Not faster agents, not smarter agents—but agents that build organizational memory as a side effect of doing the work. Every action leaves a trace; every trace teaches the system, and the system gets better because it remembers.

I also see some criticisms and limitations. The scalability of trace storage is indeed uncomfortable. In a complex agent workflow, each session can generate hundreds of megabytes of trace data. Most teams don’t have infrastructure to store, index, and query this data at scale. Event sourcing solves the immutability and replay problem, but it introduces its own complexity, including compression, projection management, and storage costs.

The observability gap is still huge. Clean Lab surveyed 95 teams running production agents and found that fewer than one-third were satisfied with their observability tools. This is the lowest-scoring component across the entire AI infrastructure stack. 70% of regulated enterprises rebuild their agent stack every 3 months. The tools are still immature.

There’s also a cold start problem. Trace architecture is most valuable when you have history to reference. The first incident you investigate with it won’t feel very different from traditional debugging. The hundredth will feel like an entirely different discipline. But you have to go through the first ninety-nine. Replay fidelity is also hard to achieve. Even with perfect traces, re-running an agent’s decisions in the same context can’t guarantee the same outputs, because the underlying models are non-deterministic. You’re debugging a system whose behavior changes every time you look at it. Trace architecture gives you context, but it doesn’t give you certainty.

We’re at a Turning Point

I firmly believe we’re standing at an important inflection point in the history of software engineering. When AI begins to write most of the code, the ways we do debugging and quality assurance must change fundamentally. Traditional debugging methods—looking at logs, checking stack traces, stepping through code—worked well in the era when humans wrote code, but they’re no longer sufficient in the age when AI agents generate code at scale.

PlayerZero offers more than a technical solution; it’s a new way of thinking. It helps us realize that in the age of AI agents, memory and learning are more important than mere execution capability. A system that can remember why it made a decision is far more powerful than one that can only follow instructions but doesn’t know why. This kind of memory isn’t just logs—it’s structured, queryable, and replayable decision history.

From a business perspective, this also makes sense. When a single production incident can cause losses of millions of dollars, a system that can find the root cause and automatically fix it within minutes is no longer a luxury—it’s a necessity. PlayerZero claims their system reduces production problems by half, saving each enterprise customer over 2 million dollars. For Global 2000 companies, that kind of return on investment is hard to ignore.

I also noticed that PlayerZero provides an interesting guarantee: if they can’t increase your engineering bandwidth by at least 20% within a week, they will donate 10,000 dollars to an open-source project you choose. This guarantee shows confidence in their technology and also indicates that they understand customers need to see real outcomes, not just promises.

The gaps in AI agent systems aren’t in models, tools, or orchestration—those solved problems are being actively productized. The gap is decision memory: not only capturing what happened, but also why it happened. This layer makes debugging possible, enables automated learning, and preserves institutional knowledge. If your agent system can’t answer the question “why did it do that,” for any decision at any time point in its history, then you’re building on sand. Fast sand, impressive sand, but still sand.

Build the trace layer first. Once you do that, everything else will get better. This is the most important lesson I learned from the story of PlayerZero. In the new era of AI programming, we can’t only focus on getting AI to write faster and more—we also have to ensure the code it writes is understandable, debuggable, and improvable. Only then can AI truly become an enabler for software engineering, rather than a new burden.

When AI writes 80% of the code, who will find the bugs?

Trending Topics

GatePreIPOsLaunchesWithSpaceX

Gate13thAnniversaryLive

CryptoMarketsDipSlightly

USIranTensionsShakeMarkets

KelpDAOBridgeHacked

Pin