Tests Beat Instructions for AI Coding Agents

14 min read

The Best AI Agent Rules Are the Ones That Break the Build

17.04.2026, By Stephan Schwab

Everyone wants AI coding agents to follow the rules. So they write long markdown files full of instructions, conventions, and warnings. The agents read them, sort of understand them, and then drift anyway. There is a better approach, and it has been around for over two decades: test-driven development. Tests are executable constraints that fail instantly when an agent goes off track. No interpretation needed. No probabilistic compliance. Red or green.

Tests Beat Instructions for AI Coding Agents

The Markdown Instruction Arms Race

"You can't instruct your way out of non-determinism."

Every AI coding tool now has its own instruction file format. Cursor has .cursorrules. GitHub Copilot has copilot-instructions.md. Windsurf has its own conventions. The idea is the same everywhere: write down your project’s rules in prose, and the agent will follow them.

Developers pour hours into these files. “Always use factory functions, never constructors.” “Prefer composition over inheritance.” “Never modify the database schema without a migration.” “Use our custom error handling pattern from src/errors/.” The files grow. They become internal style guides, architecture documents, and wish lists rolled into one.

And the agents drift anyway.

Not always. Not immediately. But eventually. The AI produces code that technically satisfies the letter of one instruction while violating the spirit of three others. It invents a new helper function instead of using the existing one. It refactors a module that was working fine because it “noticed” an improvement opportunity. It changes an interface signature because the new approach seemed cleaner.

This isn’t a bug in the model. This is what non-determinism means. LLMs process instructions probabilistically. Every token is a weighted choice. “Always use factory functions” doesn’t become a hard constraint in the model’s reasoning. It becomes a nudge, one signal among thousands competing for attention during generation. Sometimes the nudge wins. Sometimes it loses to a pattern the model saw more frequently in training data.

So developers react by adding more instructions. More specific ones. “When creating a new service class, ALWAYS check src/services/ first for existing patterns.” “NEVER rename existing public methods.” “If you’re unsure about the architecture, ASK.” The file grows to 500 lines. Then 1,000. The agent now has so many competing directives that it can’t consistently satisfy all of them. The instructions contradict each other at the edges, the way all sufficiently detailed prose contradicts itself.

Why Prose Can’t Constrain a Probability Machine

"Natural language is ambiguous by design. That's a feature for humans and a disaster for constraints."

Here’s the fundamental problem: you’re using the most ambiguous communication medium (natural language) to constrain a system that operates on probability distributions. Every sentence you write has multiple valid interpretations. “Prefer composition over inheritance” sounds clear until the agent encounters a case where inheritance genuinely simplifies the code. What does “prefer” mean then? 70% of the time? 90%? Always except when the agent decides otherwise?

Instructions in markdown are hopes. They’re wishes expressed in human language, fed into a system that doesn’t reason about them the way a human colleague would. A human colleague reads “prefer composition” and builds a mental model of your architecture, your team’s history, your past design decisions, the specific codebase context. An LLM reads it and adjusts token probabilities.

The more instructions you add, the more you’re playing a game you can’t win. You’re trying to enumerate every possible situation an agent might encounter and pre-specify the correct behavior. That’s the same trap that BDUF (Big Design Up Front) fell into decades ago. The world is too complex to specify completely in advance. Every new instruction creates new edge cases where instructions conflict.

Tests Don’t Ask for Compliance. They Demand It.

"A failing test doesn't care about the agent's reasoning. It cares about the result."

Now consider what happens when you have a comprehensive test suite.

The AI agent writes code. The tests run. They pass or they fail. There is no interpretation. There is no probabilistic compliance. There is no “I followed the spirit of the instruction.” The build is red or it’s green.

When the agent invents a new pattern instead of using the existing one, the integration tests catch the inconsistency. When it refactors a public interface, the contract tests break. When it changes database behavior, the data integrity tests fail. When it alters business logic, the specification tests scream.

Tests are specifications that execute. They define what the system does, and they verify it continuously. An AI agent can’t sweet-talk its way past a failing assertion.

This is why test-driven development turns out to be the most effective governance tool for AI coding agents in 2026. The practice dates back to 1957, when Daniel D. McCracken described it in Digital Computer Programming: prepare the expected output first, then write code until the actual output matches. Kent Beck rediscovered and formalized it in 2003. Neither of them was thinking about LLMs. But TDD produces the exact artifact that constrains non-deterministic code generators: a dense, executable specification of intended behavior.

The Executable Specification Advantage

"Your test suite is the one instruction file the AI can't misinterpret."

Think about what a well-maintained test suite actually represents. It’s a precise, unambiguous, machine-verifiable description of what your software is supposed to do. Not what it should look like. Not what patterns it should follow. What it should do.

A unit test that says expect(calculateTax(100, 'DE')).toBe(19) is a specification. It says: for an amount of 100 in Germany, the tax is 19. No LLM can misinterpret that. No amount of probabilistic token selection can make 19 equal 21. The test passes or it fails.

Multiply that by hundreds or thousands of tests. You get a specification so dense and so precise that the AI agent is operating inside a corridor. It can be creative about how it implements something. It can choose different variable names, different control flow structures, different internal patterns. But it can’t change what the code does without immediately triggering failures.

Here’s the part most people miss: tests don’t just catch mistakes after the fact. Modern coding agents read the tests before they write a single line of implementation. The test suite is context. When an agent sees this:

expect(createUser({role: 'admin'})).toHavePermission('delete')

It doesn’t just know what to validate. It knows what to build. The tests become the clearest, most unambiguous specification the agent can find in the codebase. Prose instructions compete with training data and context window noise. A test with concrete inputs and expected outputs cuts through all of that. The agent reads it, understands the contract, and generates code to satisfy it. Tests are simultaneously the blueprint and the inspector.

That’s the actual governance you want. You don’t care if the agent uses a factory function or a constructor, as long as the behavior is correct. You don’t care if it restructures internal modules, as long as all the contracts hold. Tests govern outcomes, not style. And outcomes are what matter.

Compare that to a markdown instruction file. “Always use factory functions” governs style, not outcomes. It constrains the wrong thing. The AI might follow it perfectly while producing code that breaks your business logic, and you won’t know until production.

The Vibe Coder’s Endless Chase

"They keep refining instructions because they don't have tests. And they don't have tests because they never learned TDD."

Here’s what happens in practice. Someone with little or no experience in TDD starts using an AI coding agent. The agent produces code that works for the first few prompts. Then it starts drifting. It changes patterns. It renames things. It invents abstractions. The coder notices the drift and does the only thing they know: they write more instructions.

“Don’t rename existing functions.” Added. The agent stops renaming but starts wrapping functions in unnecessary adapters. “Don’t add adapter layers without asking.” Added. The agent now asks before every change, which slows everything down to the point of uselessness. “Only ask for confirmation on architectural changes.” Added. The agent interprets “architectural” differently than the developer does. More drift.

This is the vibe coder’s treadmill. Every instruction is a patch for the previous instruction’s failure. The instruction file becomes a growing pile of special cases, contradictions, and increasingly desperate ALL-CAPS warnings. Some developers end up with files that read like legal contracts, full of “MUST”, “SHALL NOT”, “UNDER NO CIRCUMSTANCES”, as if threatening an LLM with contractual language would make it comply.

It won’t. The model doesn’t understand threat or obligation. It processes text and predicts tokens. The uppercase words might slightly increase the probability of compliance, the same way writing “IMPORTANT” in an email slightly increases the chance someone reads it. Slightly.

Meanwhile, a developer who practices TDD works differently. In practice, the agent writes both test and implementation together. That’s fine. Speed matters. But here’s the critical difference: those tests exist before the next change. When the agent touches that code again, the tests from the previous round act as guardrails. They define what must not break. The agent drifts? The test fails. Immediately. No instruction file needed. No negotiation with a probability machine.

The human’s job shifts. You don’t write every test by hand anymore. You review the tests the agent wrote, make sure they capture the right behavior, and tighten them where they’re too loose. Then you move on. The next time the agent modifies that module, it runs into a wall of specifications it can’t ignore. Each round of development leaves behind constraints for the next round. The test suite grows into an increasingly dense safety net, built collaboratively between human and machine.

And here’s something that makes this workflow even more powerful: you don’t have to read every test and every line of code yourself. You can ask the agent about them. “Does any test verify that expired tokens are rejected?” “What happens if the payment amount is zero?” “Show me which tests cover the user deletion flow.” You’re interrogating the codebase through conversation, verifying your assumptions without manually tracing through files. It’s like having a witness on the stand who has read every line and can answer instantly. The tests become queryable documentation, and the agent becomes the interface to it.

But asking the right questions is where actual experience matters. Someone who has built systems, shipped them, watched them fail at 3 AM, dealt with race conditions and data corruption and cascading timeouts knows what to probe for. They ask about edge cases because they’ve been burned by edge cases. They ask about cleanup on failure because they’ve debugged orphaned resources. They ask about concurrency because they’ve seen what happens when two requests hit the same row. The agent can answer any question you throw at it. It can’t tell you which questions you’re forgetting to ask. That’s the gap between a developer and someone who just learned to prompt.

What About Style? What About Architecture?

"Linters enforce style. Tests enforce behavior. Instructions enforce nothing."

The reasonable objection is: tests don’t govern everything. They don’t enforce coding style. They don’t prevent the agent from using tabs instead of spaces, from writing overly clever one-liners, from choosing unfortunate variable names.

True. But you know what does? Linters. Formatters. Static analysis tools. Ruff, ESLint, Prettier, Checkstyle, whatever your stack provides. These tools enforce style deterministically. They run, they flag violations, the build fails. Just like tests.

Architecture? If your architecture matters, express it in tests. Write tests that verify module boundaries. Write tests that check dependency directions. Write tests that ensure certain packages don’t import from certain other packages. ArchUnit for Java does this explicitly. Other languages have equivalents. If an architectural rule exists only in a markdown file, it’s a suggestion. If it exists in a test, it’s a constraint.

The pattern is always the same: things that run and fail are constraints. Things that sit in a file and hope to be read are suggestions. AI agents are bad at following suggestions consistently. They’re very good at making tests pass.

The Paradox: Old Practices as Modern Tools

"TDD wasn't designed for AI. It just happens to solve exactly the right problem."

There’s an irony here that’s worth sitting with. The most hyped technology of the decade, AI-assisted coding, is best governed by a practice older than most programming languages. McCracken described test-first programming the same year Fortran shipped. Kent Beck rediscovered it four decades later. The core idea never changed: define what correct looks like before you write the code. That discipline gives you small steps, fast feedback, and verified behavior. It turns out those exact properties are what you need when a non-deterministic agent is writing your code.

Small steps mean the agent can’t go far off track before hitting a test boundary. Fast feedback means drift is caught in seconds, not days. Verified behavior means you have an objective, automated judge of whether the agent’s output is acceptable.

Vibe coding skips all of this. The vibe coder prompts, the agent generates, the coder eyeballs the output, and pushes to production if it “looks right.” That works for throwaway prototypes. It fails catastrophically for anything that needs to work reliably, be maintained over time, or be modified by someone other than the original prompter.

This is also why AI isn’t replacing developers. It’s replacing the illusion that you could get by without actually understanding what you’re building. The people who feared replacement were often the ones who couldn’t articulate what their code should do beyond “it works on my machine.” AI didn’t create that problem. It exposed it. A developer who knows how to write tests, ask the right questions, and spot structural weaknesses will use AI to move three times faster. A developer who was coasting on copy-paste and Stack Overflow will find that the AI can do that part too, and cheaper. The threat isn’t artificial intelligence. The threat is having entered the profession for the salary instead of the craft. Good developers with AI will outperform mediocre teams without it. That’s not a prediction. It’s already happening.

The vibe coder who discovers their agent keeps drifting has two choices. Learn TDD. Or keep writing instructions that will never be enough. Most choose the second option because it feels like progress. The file gets longer. The agent keeps drifting. But at least they’re doing something.

How to Start

If you’re using AI coding agents and don’t have a test suite, start building one. Not as an afterthought. Before you prompt the agent.

Write a failing test that describes what you want. Let the agent propose an implementation. Run the test. If it passes, write the next test. If it fails, the agent’s proposal was wrong. Tell it so. Let it try again, with the failing test as the specification.

This workflow is faster than writing detailed instructions, because you’re giving the agent an unambiguous success criterion instead of hoping it interprets your prose correctly. And every test you write stays. It becomes part of the permanent specification. The next time the agent touches that code, the test prevents regression. Your instruction file? The agent forgets it exists the moment the context window fills up with something else.

For those who already practice TDD, you already have the answer. Keep doing what you’re doing. Your test suite is the best AI governance tool money can’t buy. The agents will get better. The models will get smarter. But they will remain non-deterministic probability machines, and they will always need hard constraints. Your tests are those constraints.

A Starting Point for Your Instruction File

"Keep instructions short. Let the tests do the talking."

None of this means instruction files are useless. They just shouldn’t try to do the job that tests and tooling do better. A short instruction file that establishes how the agent should work beats a long one that tries to specify every coding decision. Here’s a starting point. Put it in your copilot-instructions.md, claude.md, or whatever your tool expects.

## Development Workflow

Follow test-driven development strictly:
1. Write a failing test first
2. Implement the minimum code to make it pass
3. Refactor while keeping all tests green
4. Never write production code without a corresponding test

Run the full test suite after every change. Do not consider
a task complete until all tests pass.

## Code Principles

- Don't Repeat Yourself. Extract shared logic into functions 
  or modules. If you see duplication, fix it.
- Keep functions short. One responsibility per function.
- Prefer composition over inheritance.
- No dead code. If it's not tested and not called, delete it.

## What Not to Do

- Never modify or delete an existing test to make your 
  implementation work. If a test fails, your code is wrong.
- Never skip tests to "fix later."
- Never add dependencies without verifying they're already 
  in use in this project.

## Style and Formatting

Defer to the project's linter and formatter configuration.
Do not override them.

That’s roughly 30 lines. It tells the agent how to work, not what to build. The “what” lives in the tests. And unlike a 500-line instruction file full of contradicting architecture edicts, this one is short enough that the agent will actually keep it in context.

Is 30 lines enough? Try it. Add rules only when you see the agent repeatedly doing something that tests and linters can’t catch. Most of the time, you won’t need to.

The rest is vibes.

Talk It Through

Tell me what is happening. I listen, ask a few practical questions, and reflect back what I see: where the risk may sit, what may be blocking delivery, and what looks worth checking next. No pitch, no obligation. Confidential and direct.

Talk it through. Practical reflection, no pitch.

Start a Conversation

🎭 This Week's Episodes

Telenovelas show what we can't say in client meetings. The drama is heightened, but the patterns are real.

The Live Launch

The tournament system goes live in production for the August event. When a sudden network partition splits the EU and US clusters, the team's new a...

The Run Tells the Truth

Ethan Carter turns the harness into a habit: every change runs the batch in daylight, and the Cucumber scenarios refuse to lie. Nathan Cole feels s...

Technical Consultancy

Embedded Delivery Partner

Embedded into your team as an active contributor, reducing delivery friction and helping important work move cleanly.

Technical Advisor

Peer-style technical assessments before far-reaching decisions; reduce architectural and product risk early.

Product & Delivery

Ship working software to real users earlier. Measure impact and adapt based on evidence rather than assumption.

Custom Software Development

High quality, maintainable software. Short-term augmentation that leaves lasting capability in your own team.

Tests Beat Instructions for AI Coding Agents

The Best AI Agent Rules Are the Ones That Break the Build

The Markdown Instruction Arms Race

Why Prose Can’t Constrain a Probability Machine

Tests Don’t Ask for Compliance. They Demand It.

The Executable Specification Advantage

The Vibe Coder’s Endless Chase

What About Style? What About Architecture?

The Paradox: Old Practices as Modern Tools

How to Start

A Starting Point for Your Instruction File

Talk It Through

🎭 This Week's Episodes

Technical Consultancy

Embedded Delivery Partner

Technical Advisor

Product & Delivery

Custom Software Development

Recent Articles

Explore More

Tests Beat Instructions for AI Coding Agents

The Best AI Agent Rules Are the Ones That Break the Build

The Markdown Instruction Arms Race

Why Prose Can’t Constrain a Probability Machine

Tests Don’t Ask for Compliance. They Demand It.

The Executable Specification Advantage

The Vibe Coder’s Endless Chase

What About Style? What About Architecture?

The Paradox: Old Practices as Modern Tools

How to Start

A Starting Point for Your Instruction File

Talk It Through

Related Articles

AI Agents Make You a Software Company

Agentic Coding Needs an Air Traffic Control Shift Model

AI Needs Developers, Not Just Data Scientists

Newsletter

🎭 This Week's Episodes

Technical Consultancy

Embedded Delivery Partner

Technical Advisor

Product & Delivery

Custom Software Development

Recent Articles

Explore More