Last Tuesday, my quality assessment tool reported 95% test coverage on a module for which I had not written a single test.
Not low coverage. Not partial coverage. 95%. For a file with zero tests. The number was fabricated, stitched together from fragmented key conventions, stale snapshots, and a pipeline that had never been wired to read real data. Five different key formats existed for the same concept. None of them talked to each other. The quality score consumed whichever fragment it found first and called it truth.
I did not discover this through a dashboard. I discovered it because I asked a question that classical testers have been asking for decades: where is the evidence?
The Industry’s Trust Problem
The conversation around trust in AI agent outputs follows a familiar script. Teams deploy agents, agents produce results, someone asks, “Can we trust this?” and the answer is usually some variation of guardrails, prompt engineering, or human-in-the-loop review.
This is not a trust architecture. It is a hope architecture.
The testing community solved this problem years ago. Not perfectly, not universally, but the thinking tools exist. An oracle is any mechanism you use to recognize a problem. A test is a performance, not a document. Coverage is not a percentage; it is a claim about what you verified, and claims require evidence.
Bach and Bolton dedicated an entire section of Taking Testing Seriously to the distinction between checking and testing. Checking is confirmatory. Testing is investigative. Most agent evaluation frameworks do checking: they confirm the agent produced output in the expected format, passed the expected assertions, and returned the expected status code. Almost none of them do testing: they do not investigate whether the output actually means what it claims.
My coverage pipeline was checking. It found a number. It reported the number. It never tested whether that number represented reality.
Mapping the Playbook
The classical testing vocabulary maps onto agentic systems with almost uncomfortable precision. Here is how I have been applying it this week across seven releases.
The Oracle Problem stays the Oracle Problem. In classical testing, an oracle is any mechanism you use to recognize a problem: a specification, a comparable product, a historical result, or a consistency check. The oracle problem is knowing what “wrong” looks like. In agentic systems, that problem gets harder because agent outputs are generative, varied, and arrive with false confidence. But the classical oracle heuristics still apply. When my coverage pipeline reported 95% on a file with zero tests, the oracle I was missing was trivial: cross-reference the claim against the evidence. If there are no test files, coverage cannot be 95%. That is a consistency oracle: the same heuristic testers have used for decades. I did not need a new theory. I needed to apply the old one.
Testability applies as-is. HTSM treats testability as a product factor, not whether you tested it, but how easy it is to test. Controllability, observability, and decomposability. My knowledge base scored low on all three. You could not query what it knew. You could not control it independently of the rest of the system. Its persistence was scattered across separate databases with governance references pointing to paths that no longer existed. The CLI query commands I added this week — what the hypergraph knows, what is untested, where the gaps are — are controllability. The degradation events published on initialization failure are observability. Unifying the persistence layer is decomposability. No new theory required. Bach’s checklist told me exactly where to look.
Evidence integrity becomes cryptographic integrity. In classical testing, evidence must be traceable. A test result you cannot reproduce, a log you cannot verify, a screenshot you cannot timestamp — none of these count as evidence. They count as claims. My coherence gate had the same problem. It was producing witness records hashed with djb2, a function designed for hash table lookups, not for proof. Any process could recreate a matching hash and backdate the record. I replaced it with SHA-256. Now, a witness record is tamper-evident. The same principle testers apply to defect evidence, applied to machine-generated proof.
Enforcement, Not Aspiration
The standard approach to agent trust is aspirational. Write better prompts. Add more guardrails. Review more outputs. Train the model on better data. These are reasonable activities. They are also insufficient for the same reason that testing without tool-assisted verification is insufficient. Not because human judgment is wrong. Because human attention does not scale, and it does not persist.
Classical testing learned this decades ago. The check must be automated. It must run on every change. It must produce evidence, whether or not a human is watching. That is why CI pipelines exist. That is why test suites run on commit, not on request.
This week, I shipped deterministic YAML pipelines — declarative quality gates that execute without consuming a single token of LLM context. No agent decides whether to run the check. No prompt asks “Did you verify coverage?” The pipeline is defined once and enforced every time. The same principle as a CI gate, applied to agent coherence.
Aspiration says: “We should check coverage before merging.” Enforcement says, “The merge is blocked until coverage is verified, and here is the pipeline that does it.”
The CUSUM drift detector I added to the coherence gates extends this further. It watches for statistical shifts in coherence scores across all gate types, continuously, without anyone having to ask. A quality metric that only gets checked during quarterly reviews is not a quality metric. It is a lagging indicator dressed up as governance. Continuous monitoring is not a new idea. It is an old idea that most agent-building teams have not yet applied to their own systems.
What the Industry Preaches vs. What It Practices
I read a lot of articles about AI trust. Most of them describe the problem well and propose solutions that sound reasonable on paper. Few of them describe what the enforcement architecture actually looks like in production.
Here is what I see in practice: teams add an evaluation framework on top of their agent system. The framework runs a benchmark suite. The benchmark reports a score. The team publishes the score. Almost nobody asks where the score came from, whether it represents the production distribution, or whether the evaluation framework itself has been validated.
This is the same failure mode my coverage pipeline had. A number that looks authoritative. A pipeline that looks rigorous. A result that does not correspond to reality.
What I do differently is not revolutionary. It is what testers have been doing since before agents existed. I ask: what evidence supports this claim? I ask: can this evidence be forged? I ask: does this check run when nobody is watching?
The What I Tell the Room
At meetups and in the Serbian Agentic Foundation sessions we run at StartIt, someone always asks the same question: how do you build trust in agent outputs? My answer is always the same. Stop building trust. Start building evidence.
Trust is a feeling. Evidence is a fact. The testing community has always known the difference. A passing test is not proof that the software works. It is proof that a specific scenario, under a specific configuration, at a specific moment, did not produce an observable failure. That framing: precise, humble, useful, is exactly what agent evaluation needs.
Laforgia’s distinction between executed evidence and generative evidence applies here directly. Executed evidence comes from running the system. Generative evidence comes from the model reasoning about what the system would do. Most agent evaluation is generative — the model predicts its own performance. My coherence gates, my coverage pipeline, my witness chain, these produce executed evidence. The system ran. Here is what happened. Here is the cryptographic proof.
The Holistic Testing Model, which I have been evolving with PACT principles, was never designed specifically for agentic systems. It was designed for quality systems generally. Anticipate failures before production. Execute verification with explainability. Let humans, agents, and systems each contribute what they are good at. Allocate effort where the risk is highest.
Every one of these principles predates the agentic era. Every one of them maps directly onto it.
What Actually Changed This Week
Seven releases. Not a list, a direction.
The fabricated coverage fix taught the system honesty. If you do not have the data, say so. Do not invent a number that makes the dashboard green.
The SHA-256 witness chain gave the coherence proofs a cryptographic backbone. A claim without verifiable evidence is an opinion. An opinion from a machine is worse; it arrives with the authority of computation but the reliability of a guess.
The YAML pipelines enforced quality gates without agent involvement. The gate does not need to understand why it exists. It needs to execute.
The MCP-free migration made every agent operation independent of infrastructure assumptions. If your testing tool only works when everything else is working, it is not a testing tool. It is a dependency.
The RuVector Phase 5 additions — hyperdimensional computing fingerprints, CUSUM drift detection, and Modern Hopfield pattern recall — gave the learning system the ability to detect when its own knowledge was going stale. A system that cannot detect its own drift cannot be trusted to advise on drift in others.
None of this required inventing new principles. All of it required applying principles that the testing community has been articulating for decades.
The Uncomfortable Part
I should be clear about what I am not claiming. I am not claiming that classical testing completely solves the agent trust problem. The oracle problem, knowing what “correct” looks like for a creative, generative system, remains genuinely hard. There is no heuristic for “did the agent make a good architectural decision?” That requires human judgment and will continue to do so.
What classical testing provides is the infrastructure around that judgment. The evidence collection. The drift detection. The enforcement of minimum standards. The honesty about what has been verified versus what has been assumed. The discipline of saying “unavailable” instead of fabricating a comforting number.
That infrastructure is what most agent-building teams are missing. Not because they lack the capability. Because the industry conversation about trust gravitates toward the visible — the benchmarks, the leaderboards, the evaluation frameworks with impressive names — and skips the foundational work underneath: witness integrity, key convention unification, deterministic enforcement.
Classical testing teaches you to start with the foundation. That is where trust actually lives.
This is the twenty-fifth article in The Quality Forge series, documenting the practice of building and operating agentic quality systems. Previous: “The Book That Talked Back” bridged classical testing wisdom with agentic enforcement. This one puts agent outputs on the witness stand and asks: where is the evidence? For the full archive of 25 Quality Forge articles, visit forge-quality.dev/articles.
Dragan Spiridonov is the Founder of Quantum Quality Engineering, an Agentic Quality Engineer, and a member of the Agentics Foundation. He is currently building the Serbian Agentic Foundation Chapter in partnership with StartIt centers across Serbia.