Your AI Agent Says All Tests Pass. Your App Is Still Broken
Stay on top of this story
Follow the names and topics behind it.
Add this story's key topics to your watchlist so LyscoNews can highlight related developments and future matches.
Create a free account to sync your watchlist, saved stories, and alerts across devices.
Quick Summary
How Knight Rider Testing Gave Me My Nights Back
There is a moment every developer using AI coding agents knows well. You wake up, check your terminal, and see the beautiful green wall: 47 tests passed, 0 failed. You open the app. The button does nothing. The layout is sideways. The feature you asked for doesn't exist. The agent rewrote half the codebase, generated tests that validate its own hallucinations, and declared victory. You are back to square one, except now you also have to understand 2,000 lines of code you didn't write. I call this the Vibe Coding Death Spiral. You prompt, the agent codes, the agent tests, the agent passes, and nothing actually works. You correct it, and it "fixes" things by rewriting what was already working. The tests still pass because the tests were written by the same agent that wrote the broken code. It is like hiring someone who grades their own homework and wondering why they always get an A. After months of this cycle, I stopped asking how to write better tests. I started asking a different question entirely: what if the thing that validates my app isn't a test suite at all? The fundamental issue is not that AI agents write bad tests. The issue is structural. When the same agent writes the implementation and the tests in the same session, the tests become a mirror of the agent's understanding, not a mirror of your intent. If the agent misunderstands the feature, it writes code that does the wrong thing and tests that verify the wrong thing does it correctly. Everything is green. Everything is wrong. There is a second problem that compounds the first, and it is one the industry is only beginning to reckon with. AI coding agents do not just write wrong tests. They write too many tests. An agent asked to "add comprehensive test coverage" will happily generate 1,000 unit tests in a single session. Each test is syntactically correct. Each test passes. And you now own 1,000 tests that you did not write, do not fully understand, and must maintain for the lifetime of the project. A thousand tests means a thousand things that break when you refactor a component, rename a prop, or change an API response shape. Every broken test demands investigation: is this a real regression or did the test couple itself to an implementation detail that no longer exists? When the answer is the latter — and with AI-generated tests it almost always is you face a choice. Fix the test manually, which defeats the purpose of having an agent. Ask the agent to fix it, which starts the spiral again. Or delete the test, which means the coverage number you were so proud of was never real. The problem is worse for end-to-end tests. A flaky E2E test is not merely useless. It is actively destructive. It trains the team to ignore failures. When the suite reports 847 passed and 3 failed, and those 3 have been failing intermittently for weeks, nobody investigates. The signal disappears into noise. The one real regression that slips in next Tuesday looks identical to the three false positives the team has been dismissing since last month. A flaky test suite is arguably worse than no test suite at all, because no test suite at least has the honesty to tell you that nothing is being validated. Traditional test suites compound all of this in the AI era. They are brittle, selector-dependent, and expensive to maintain. Every time an agent refactors a component, the tests break not because the feature broke, but because a CSS class changed or a div moved. The agent then "fixes" the tests by updating them to match the new broken state. The spiral tightens. What we actually need is something that answers one question: does the app work the way a human would expect it to? Not "does function X return value Y" but "when I type a message and press Enter, does a response appear on screen?" Knight Rider is a testing pattern where an autonomous AI agent drives your live application through a persistent harness, reads real application state, takes screenshots, analyzes what it sees, and reports what is actually broken. There are no test scripts. There are no selectors to maintain. The agent explores the app the way a human QA tester would, except it does it at 3 AM while you sleep. The name is deliberate. The AI is KITT, it has eyes on the road (screenshots), instruments on the dashboard (store state), and the ability to steer (send commands). You are Michael Knight; you set the destination (the test suite), hand over the wheel, and go to bed. Here is the architecture: ┌─────────────────────────────────────────────────────────┐ │ YOU (Michael Knight) │ │ Define test suite · Launch crew · Sleep │ └──────────────────────────┬──────────────────────────────┘ │ spawns ┌───────────────┼───────────────┐ ▼ ▼ ▼ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Agent 1 │ │ Agent 2 │ │ Agent 3 │ │ Stability │ │ UX Code │ │ Polish │ │ (read-only) │ │ Changes │ │ Changes │ │ │ │ │ │ │ │ 5 iterations │ │ Modifies │ │ Modifies │ │ × 23 tests │ │ files A │ │ files B │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ socket │ socket │ socket ┌──────▼───────┐ ┌──────▼───────┐ ┌──────▼───────┐ │ Electron 1 │ │ Electron 2 │ │ Electron 3 │ │ (isolated) │ │ (isolated) │ │ (isolated) │ └──────────────┘ └──────────────┘ └──────────────┘ │ ┌──────────────┘ │ all complete ┌─────────▼──────────┐ │ Agent 4 │ │ Final Validator │ │ │ │ 1. pnpm build │ │ 2. Full 23-test │ │ 3. Screenshot │ │ 4. Ship / Block │ └────────────────────┘
The system has three pieces. The Harness. A lightweight server, roughly 400 lines of TypeScript, that keeps your application alive between commands. It exposes a Unix socket that accepts simple text commands: click a button, fill an input, press a key, take a screenshot, read a value from the application's state store. It wraps Playwright's Electron support but strips away everything except the raw primitives. No page objects, no fixtures, no assertion libraries. Just: do this thing, tell me what happened. The Test Suite. A plain-English list of what to verify. Not 1,000 tests that pass. Twenty-three tests that mean something. "Create a new tab, send a message, confirm a response arrives within 25 seconds." "Open settings, verify it renders, close it." "Create 6 tabs rapidly, confirm no crash." The AI agent reads this list and figures out how to execute each test using the harness commands. You define what to check. The agent figures out how. The Agent. Any LLM with tool-calling capability; Claude, GPT-4, Gemini, or similar. It reads the test suite, drives the harness, interprets results, takes screenshots when something looks wrong, and writes a report. Because the agent did not write the application code, it has no bias toward leniency. It is a genuinely independent validator. The pattern becomes powerful when you run multiple agents in parallel. Each agent gets its own instance of the application through a separate socket. They do not interfere with each other. One agent runs stability validation five iterations of the full test suite, looking for intermittent failures. Another agent makes UX improvements to the codebase. A third handles cosmetic polish. A fourth agent, which depends on the first three completing, rebuilds the application with all changes and runs a final validation pass. If you want to try this yourself, the barrier is lower than you think. The harness is the only piece you need to build. Everything else is prompt engineering. Start with a persistent process that launches your application and keeps it alive. Expose a socket or HTTP endpoint that accepts commands. You need roughly ten commands to cover most applications: ┌──────────────────────────────────────────────┐ │ Harness Commands │ │ │ │ Interact Observe Control │ │ ───────── ─────── ─────── │ │ click <id> screenshot press <key> │ │ fill <id> <t> store <path> sleep <ms> │ │ hover <id> text <id> quit │ │ select <id> count <id> │ │ type <text> list-testids │ │ eval <js> │ │ dom │ └──────────────────────────────────────────────┘
If your application uses a state management library like Zustand, Redux, or MobX, expose the store on the window object so the agent can read it directly. This is the key insight that makes Knight Rider work. The agent does not just look at the DOM. It reads the actual data structures the UI renders from. When the DOM says one thing and the store says another, you have found a real bug. The test suite is a markdown document. Each test has a name, a description of what to verify, and the success criteria. Do not write implementation steps. Write intent. "Send a message and confirm a response arrives" not "call fill on message-input with text hello then call press Enter then poll tabs[1].messages.length until it equals 2." The agent is better at figuring out the implementation steps than you are at predicting them. The agent prompt needs three things: the command reference for your harness, the test suite, and instructions to write a report. Tell it to take screenshots at key moments. Tell it to read store state to verify results, not just check if elements exist in the DOM. Tell it that tab indices shift when tabs are created or closed, so it should always verify positions before acting. These are the kinds of lessons that take a human tester hours to learn and an AI agent one sentence to internalize. Knight Rider is not a replacement for a skilled QA engineer. A human tester will find bugs that no AI agent would think to look for. It is not free, running multiple LLM agents overnight against a live application costs real money in API calls, and you should budget for that before committing to the pattern. And the agent itself can hallucinate. It can skip a test because it could not figure out the harness command and report 23 out of 23 when it actually ran 22. You need to read the reports, not just the summary line. What Knight Rider does is fill a specific gap. It runs while you do not. Before this pattern, my workflow was: prompt the coding agent, wait, review the code, open the app, find the bugs, explain the bugs, wait, review again. Each cycle took 30 to 60 minutes of my active attention. After Knight Rider, my workflow is: prompt the coding agent, define the validation suite, launch the crew, go to sleep. In the morning I read the report. If it says 23 out of 23, I review the diff and ship. If it says 19 out of 23, I read the four failures and decide whether to fix them myself or send another agent. Not better tests. Not more tests. Fewer tests that actually mean something, validated by an agent that is autonomous, independent, and tireless. You define what "working" means. The agent checks whether it is true. You sleep.