Last week, I wrote about the score nobody reads — how skills and verification rules mean nothing without enforcement architecture. That article ended with a question: What do you actually enforce, and where does the knowledge come from?
I found the answer in a book. Not an AI paper. Not a framework for prompt engineering. A book about testing, written by people who’ve been refining these ideas since before large language models existed.
The Book and the Fleet
While shipping six releases of the Agentic QE platform this week, I was also finishing Bach and Bolton’s “Taking Testing Seriously” — a book about the Rapid Software Testing (RST) framework, including the latest Heuristic Test Strategy Model (HTSM).
I wasn’t reading it to debug my agents or diagnose a production failure. I was reading it to go back to basics, to remind myself what good testing actually is, and to see whether I could read something new that would sharpen the skills I’m encoding into my QE fleet.
I read plenty.
The book is dense with questions, personal experiences, and examples contributed by practitioners I respect, some of whom I know personally, such as Lalit Bhamare, Paul Holland, Huib Schoots, and Keith Klain. If you’re serious about testing and haven’t read it yet, stop reading this article and go order it. I mean that.
What struck me most was the parallel. They condensed decades of collective testing knowledge into a book: structured, searchable, cross-referenced. I’m condensing the same kind of knowledge into agents and skills: codified good practices that others can use, learn from, and adapt to their own context. Lalit did exactly this when he contributed his QCSD framework as a skill for the fleet. The medium is different. The mission is the same: make hard-won experience transferable.
And then the book started talking back to my week.
When the Tests Pass But the System Doesn’t Work
One of the week’s releases shipped with a tool prefix mismatch. Tests passed. Code compiled. CI was green. But the agents couldn’t actually call their tools at runtime. The system looked healthy by every automated measure. It was broken for every user.
How do you categorize that failure?
The HTSM has an answer. It identifies eight categories of product factors, and this bug doesn’t fit where most engineers would put it. It’s not a function problem; the tools work. It’s not a structure problem; the code compiles. It’s a Composition problem — how the agents integrate with the MCP layer. And it’s a Testability problem; nothing in the test suite was designed to catch a mismatch between what agents reference and what the system actually exposes.
The testability framework goes deeper: controllability (can you set up the test conditions?), observability (can you see what the agent actually did?), and decomposability (can you test the agent-MCP interface in isolation?). Through these lenses, the diagnosis was instant. Low observability: the prefix mismatch was invisible until runtime. Poor decomposability: no way to test tool resolution without launching the full agent.
The same week, Nate reported a batch of bugs that were invisible on my machine. Hook paths that embedded my machine’s absolute path, instead of being portable. Settings that overwrote user permissions instead of merging them. Frontmatter with blank lines breaks parsers. Or, a Platform and Operations problem in HTSM terms. I’d been testing on my own machine, in my own environment, with my own configuration. A testability failure in the availability dimension: the right test platform was never even accessible.
I already knew this intuitively. The book reminded me of the vocabulary to name it, and more importantly, to encode that vocabulary into the skills themselves so the agents could use it too.
The Polarities We Already Use
The industry conversation treats classical testing and agentic systems as separate worlds. They’re not.
“Taking Testing Seriously” describes Exploratory Polarities: paired approaches that sharpen thinking through deliberate switching. Focused vs. Diversified. Intense vs. Relaxed. Active vs. Passive. Your mind gets sharper from the very act of switching between opposed modes of attention.
I recognized this immediately. My /brutal-honesty-review skill is Focused — attack this specific implementation. My /sherlock-review skill is Diversified — follow the evidence wherever it leads. When I use them in sequence, I catch things neither would find on its own. I’d designed these as agentic tools, knowing I was implementing a principle from classical exploratory testing.
The testing skills taxonomy has five categories: modeling, coverage design, oracle design, judgment, and engagement. Each maps to a capability my QE agents need. The coverage analyst needs modeling skills. The adversarial Devil’s advocate agent needs oracle design — knowing what “correct” looks like. The fleet orchestrator needs engagement skills — managing the testing process itself. Same categories. Same reasoning. Machine speed and machine coverage.
Not analogy. Isomorphism.
Where Trust Actually Lives
The same week, I was working through something else: how the industry’s trust mechanisms are evolving, and where most teams are stuck.
TDD anchored trust in code. Test passes, code works. Trust lives at the function level.
BDD moved trust to behavior. Scenario passes, feature works as specified. Business stakeholders can read the trust artifact.
EDD — Andrea Laforgia’s Expectation-Driven Development, published the same week — anchors trust in expectations. Write what you expect before the agent writes code, then verify against those expectations. Laforgia’s key insight: when an agent “proves” code works, you must distinguish executed evidence (the agent actually ran the tests) from generative evidence (the agent produced text that looks like test output but was generated, not executed). That distinction is lethal. Most teams don’t make it.
ODD — Outcome-Driven Development — anchors trust in delivered outcomes. Not “did the test pass” but “did the user get the value we promised?”
Each transition relocates trust one level further from code and one level closer to the customer. Each requires better quality engineering, not less.
This played out concretely during the week. A QE swarm analysis found that benchmark runs had leaked junk data into my working AQE learning database, inflating quality scores. Tests passed. Behavior matched specification. But the outcome — “the learning system provides accurate quality intelligence” — was compromised because the data was lying. I caught it only because I asked the ODD-level question: Does this system actually help users make better decisions?
That question doesn’t live in any test suite. It lives in the judgment of someone who understands what the system is for.
Anthropic Study On How AI Agents Are Used
In February 2026, Anthropic published the first large-scale empirical study of how AI agents are actually used, analyzing millions of interactions. Two findings stood out.
Co-constructed autonomy. Agent autonomy is not a fixed capability level. It’s emergent, shaped by the agent’s capability, the user’s trust, and the task’s requirements. The same agent, with the same user, operates with high autonomy on familiar tasks and requires close supervision on novel ones. Autonomy develops through repeated interaction. You don’t configure it. You grow it.
Oversight evolution. Experienced users shift from reviewing every step to monitoring outcomes. This is learned efficiency, not laziness. But it creates a specific risk: when monitoring catches a failure, the user must switch back to close oversight instantly. The system needs to support that switch.
Both findings mapped directly onto my week. My oversight had evolved; I was monitoring outcomes, trusting the release pipeline to run autonomously. When the prefix mismatch surfaced, I had to drop back to reviewing every step. The system didn’t support that transition. There was no “show me proof you actually executed the verification” mechanism.
The Rapid Software Testing framework has been saying this for decades: testing is a human performance. Human skill, judgment, and engagement determine the quality of the outcome — whether that testing is done by hand, via scripts, or by agents. The agent changes the scale. It doesn’t change the epistemology.
What This Looks Like in Practice
Here’s where classical and agentic converge. Not in theory. In the code shipped this week.
Evolving work products. “Taking Testing Seriously” describes artifacts that are refined through actual testing experience, not designed once and for all from theory. This week, I restructured all the QE skills with gotchas sections drawn from production learning data. Each skill now carries its own failure history — “here’s what goes wrong when you use this” — derived from real outcomes, not textbook warnings. I stripped out hundreds of lines of generic knowledge because they weren’t earning their place. The skills got shorter and more honest. That honesty comes from the principle that testing knowledge must be grounded in experience, not authority.
Testability as enforcement. The week’s biggest release introduced coherence-gated safety pipelines: filters that validate every agent output before execution. Reasoning chain coherence. Tool call validity. Output consistency. That’s controllability and observability implemented as runtime gates. The same release introduced a witness chain in which every agent’s decision is cryptographically linked. How do you know the agent made the right decision? Trace the chain: what information it had, what options it considered, what it chose, and why. The oracle problem, solved at the infrastructure level.
Observability as a quality investment. The week’s final release replaced hundreds of raw console calls with structured logging and decomposed a monolithic hooks file into focused modules. When all you have is unstructured output, you can’t trace an agent’s decision chain. Structured, leveled, centralized logging makes the invisible visible. Those who worked with me know how I always ask for a way to check what the system is doing. What information is flowing between components, and can you observe it?
Security as a quality question. I run a regular full QE analysis swarm, and it found command injection in the test verifier and SQL injection across multiple interpolation sites. Fixing the vulnerabilities was engineering. Recognizing that the polluted learning database was a quality problem, not a data engineering problem, is what classical thinking buys you. The system wasn’t just insecure. It was giving me wrong answers. That’s a quality failure that no security scan would flag.
None of this is revolutionary on its own. What’s different is that these aren’t classical QE principles applied to code written by humans. They’re classical QE principles applied to systems in which agents write, test, and review code. The principles don’t change. The execution layer does.
The Gap Most Teams Are Living In
Let me be direct about what I see in the wider industry.
Dan Shapiro describes five levels of AI coding adoption, from autocomplete through fully autonomous. Most teams are at the first two levels: accepting or rejecting copilot suggestions, delegating small tasks, and reviewing everything. A couple is reaching the third stage, where the AI handles features and the human reviews the results. Quite rare is operating where the AI handles full workflows.
The problem: each level needs a parallel quality dimension that almost nobody is building. At the lower levels, human QE practices work fine. But at higher levels, the volume of AI-generated code exceeds the capacity of human review. You need agents testing agents, with human judgment at the inflection points.
Most teams I talk to are trying to solve advanced problems with a beginner’s quality practices. They use copilots to generate code faster. They run the same test suite they always run. They wonder why the number of escaped defects is increasing.
Stuart Winter-Tear names the root cause precisely. His Contact Principle demands direct, unmediated experience with the live working system, not mediated through dashboards, demos, or metrics reports. His diagnosis of borrowed certainty names the organizational anti-pattern where leaders adopt AI confidence without verifying it through contact. “Our copilot handles that.” Does it? When did you last check?
Until you answer that question with evidence, you are not scaling capability. You are scaling exposure.
I’m not claiming to have solved this. The prefix bug that broke my agents proves I haven’t. But I have a working system where agents verify agents, where classical testing heuristics guide that verification, where failure data feeds back into the learning loop, and where enforcement architecture is replacing trust-me compliance.
To be that bridge between classical and agentic QE is an evolution of our practices.
The Classics Are the Craft
I keep coming back to a line I wrote a few weeks ago: the generation is fast, the testing is the craft, the judgment is what you’re still for.
This week refined it. The testing craft is not something new we need to invent for the agentic era. It’s something old, tested, argued over, and refined over decades by Bach, Bolton, Hendrickson, Kaner, and Weinberg. We need to translate it into an enforcement architecture now.
The HTSM gives me a structured way to design what my agents test. The testability heuristics tell me where verification will be hard. The exploratory polarities explain why my adversarial review sequence works. The oracle design framework tells me how to know whether an agent’s output is correct. The Elements of Excellent Testing gives me the philosophical foundation: science is testing, and testing is science.
None of these thinkers designed for agentic systems. All of their thinking applies. The irony is that the industry is looking for “AI-native testing frameworks” when the framework already exists. It just needs an execution layer that runs at agent speed with agent scale.
They condensed decades of wisdom into a book. I’m trying to condense similar wisdom into agents and skills that others can use, adapt, and contribute back to, as Lalit did with his QCSD framework. The medium changed. The mission didn’t.
Six releases. The most valuable thing I did all week was finish a testing book and connect classical to agentic worlds.
The classics aren’t dead. They just got new instruments.
Previous in this series: “The Gate That Fights Back” introduced adversarial QE and Loki-Mode verification. “The Score Nobody Reads” exposed the compliance gap between having rules and following them. This one bridges classical testing wisdom with agentic enforcement and argues the industry needs the bridge more than it needs new frameworks. For the full archive of 24 Quality Forge articles, visit forge-quality.dev/articles.
Dragan Spiridonov is the Founder of Quantum Quality Engineering, an Agentic Quality Engineer, and a member of the Agentics Foundation. He is currently building the Serbian Agentic Foundation Chapter in partnership with StartIt centers across Serbia.