Stop Arguing From Memory About Code
Teams fight over remembered code long after the code changed. The fix is evidence, not louder opinions.
10 min read
19.06.2026, By Stephan Schwab
AI coding tools are excellent at continuing a pattern. They are much weaker at deciding when the pattern should end. I built a multi-mode web component with agentic coding and watched the models extend it cheerfully into a hostage situation: mode selection, prompt handling, option rendering, workflow branching, and four sets of special-case behavior all crammed into one body that used to be simple. No model stopped to say the architecture was wrong. That part still required human judgment. The tools did not fail — they did exactly what they are designed to do. The problem is that people keep expecting continuation engines to volunteer the refactor, and they don't. Design judgment is not being automated. It is being quietly handed back to whoever is paying attention.
The setup was not exotic.
No giant framework. No dependency carnival. Just web components, deliberately chosen to keep the surface area under control and avoid dragging in half the JavaScript ecosystem for a problem that did not need it.
That choice aged well. The component did not.
The more features I asked for, the more the models kept doing what these tools are very good at: extending the local pattern in front of them. New mode? Add another branch. New option? Add another conditional. New workflow? Thread another state through the same component and hope nobody notices the smell.
They were not being stupid. They were being statistically obedient.
That is the part many people still refuse to understand. Claude Code, Codex, and similar tools are strong at continuing the work. They are much weaker at deciding that the current shape of the work is wrong.
A growing component can look productive for quite a while.
Each change works locally. The demo survives. The new button appears. The new mode does the thing. The diff looks plausible. The branch count rises politely in the background.
Then one day the component is no longer a component. It is a hostage situation.
That was happening to neo-chat-box. The component had become the place where too many decisions met each other:
None of those concerns is absurd by itself. Shoving all of them into the same body is how you grow a very modern monster.
The models did not protest. Why would they? The repository showed them a component that already handled multiple modes, so they kept reinforcing that shape. If you ask for image controls in a multi-mode chat box, the tool does not naturally ask whether those controls belong in a specialized child component. It usually asks how quickly it can add the next conditional.
That is continuation. Not design.
The way out was not another grand specification. It was a sequence of targeted questions.
What is genuinely shared across all modes?
What changes only because the user is creating an image instead of sending a plain message?
Which pieces represent stable chat behavior, and which pieces are really mode-specific configuration UIs pretending to be chat behavior?
Where is the component branching because the business concept differs, not because the rendering detail differs?
Which decisions should be visible in the UI as explicit composition rather than hidden in conditionals?
That investigation led to the obvious answer that the models had not volunteered on their own. The generic chat shell should stay generic. The mode-specific behavior should move into specific components. The user should select mode-specific options through dedicated UI elements instead of forcing one swollen component to impersonate four products badly.
So the refactor direction became composition.
A regular chat can stay regular chat.
An image workflow can expose image-specific choices through a focused component.
A video workflow can do the same for video.
An email workflow can surface the controls and constraints that belong to email work instead of pretending those are just another branch of a generic conversation box.
Same app. Fewer lies.
A lot of people now watch demo videos and conclude that human design judgment is on borrowed time.
The demos are persuasive because the tools are real. They can inspect a repository, generate code quickly, thread changes through several files, and recover from obvious mistakes. That part is no longer hypothetical.
The mistaken leap comes right after that.
People see the fluent output and infer that the machine will also recognize when the structure itself needs to change. They assume it will volunteer the uncomfortable refactor, reject the convenient branch, and simplify the design before complexity hardens into maintenance cost.
Usually it will not.
Anthropic’s own guidance on building effective agents is much more sober than the public fantasy. They argue for simple, composable patterns over unnecessary framework complexity, warn about compounding errors in autonomous agents, and say coding agents work especially well where solutions are verifiable through automated tests. They add the sentence that matters most here: human review remains crucial for ensuring solutions align with broader system requirements.
That is not a minor footnote. That is the whole argument.
Broader system requirements are exactly where the trouble lives:
Those are not autocomplete problems.
OpenAI’s paper on why language models hallucinate makes the adjacent point from the model side. These systems are still rewarded too often for confident guessing instead of calibrated restraint. The practical consequence in coding is familiar: if you leave the door open for a plausible continuation, the model will often choose continuation over hesitation.
That does not always show up as a fabricated API call. Sometimes it shows up as fabricated confidence in a design direction.
The component already has four modes? Fine, let us add a fifth branch.
The class already owns UI rendering, state transitions, workflow options, and mode-specific behavior? Fine, let us keep all of that in one place and make the names longer.
The tool is not evil. It is not lazy. It is just not carrying the cost of the future structure the way an experienced developer does.
If you have spent years cleaning up systems that got “just one more branch” added to them for two quarters straight, you can feel the damage early. The tool usually cannot. Or more precisely, it will not act on that feeling unless you force the conversation there.
The belief that code generation naturally leads to cleaner systems was always lazy. External evidence has become less patient with it.
GitClear’s report Coding on Copilot: 2023 Data Suggests Downward Pressure on Code Quality analyzed roughly 153 million changed lines of code and found rising churn, more added and copy-pasted code, and less evidence of reuse. Their summary is blunt enough on its own: AI-heavy code changes increasingly resembled the work of an itinerant contributor rather than a careful long-term maintainer.
That does not mean AI coding tools are useless. It means speed is real and design discipline is still required.
The messy truth is more interesting than the hype.
These tools can help experienced developers move much faster. They can also industrialize mediocre structure at shocking speed if nobody is actively steering architecture, boundaries, and tests.
My neo-chat-box example is exactly that in miniature. The assistant was useful for implementing changes. It was not spontaneously protecting the component from accumulated design debt.
This part is worth watching because it cuts through the marketing fog.
The labs selling coding magic are also hiring large numbers of software developers for deeply judgment-heavy business and enterprise work.
Anthropic has advertised a Software Engineer, Enterprise Foundations role for its Claude for Work initiative. The description is not about typing speed. It is about enterprise-grade features, security, compliance, scalability, identity management, governance, role-based access control, and industry-specific solutions for healthcare, finance, and education.
OpenAI’s career listings now read the same way. Their public careers page includes roles across Forward Deployed Engineering, B2B Applications, Internal Applications, Finance, and enterprise deployment work. A published Backend Software Engineer, Enterprise AI Platform role described work on secure and compliant systems, enterprise data access, authentication, reliability, and customer-managed infrastructure so that large organizations can actually run agents safely in production.
That is the tell.
The frontier labs are not staffing as if software development has collapsed into prompting. They are staffing as if the valuable work is moving into the hard middle where business constraints, security, enterprise identity, deployment reality, compliance, and product judgment all collide.
Because that is exactly what is happening.
I did not need the model to replace judgment. I needed it to help me apply judgment faster.
That is a much saner expectation.
Once the design questions were clear, the tool became useful again:
That is strong leverage.
But notice the ordering.
The leverage arrived after the design judgment, not instead of it.
The valuable human contribution was recognizing that the problem was no longer “add another feature” but “stop one component from pretending to be four different products.” Once that decision had an owner, the AI could accelerate execution.
Without that decision, it would have kept smiling and adding branches.
When a component starts swelling under AI-assisted development, these questions help more than another heroic prompt:
The last question matters because it turns design taste into evidence.
A good refactor is not just prettier. It should make later change cheaper, testing clearer, and mode-specific behavior less entangled.
The most damaging AI myth in software right now is not that the models are useless.
It is that good demos and early wins prove human judgment is becoming optional.
They do not.
They prove the clerical parts of software development are getting cheaper. They prove that continuation is faster. They prove that one capable developer can now move much more quickly from idea to implementation.
They do not prove that the machine will decide when the architecture has started lying.
My web component did not need a more enthusiastic assistant. It needed someone willing to say the current shape was wrong, ask targeted questions, and choose composition over branch accumulation.
That is why Claude Code and Codex are powerful tools and still not substitutes for judgment.
They can help build the thing.
Someone still has to decide what the thing should be.
Let's talk about your real situation. Want to accelerate delivery, remove technical blockers, or validate whether an idea deserves more investment? I listen to your context and give 1-2 practical recommendations. No pitch, no obligation. Confidential and direct.
Need help? Practical advice, no pitch.
Let's Work TogetherA senior developer for your team
Our Developer Advocate writes production code with your team, improves the pipeline, and accelerates delivery. 60-70% coding, 30-40% coaching. A temporary teammate who ships from day one.