Last Monday, my fleet was searching its own memory and finding the wrong things. Not failing. Not crashing. Finding the wrong things, on time, and reporting them as right.
A user asks the agent, “Have I seen this kind of problem before?” The agent calls into its pattern store. The pattern store calls into the vector index. The vector index returns 10 results that appear to be nearest neighbors. The agent acts on them. The dashboard stays green.
For most of the past week, those ten results were essentially random.
That is the story I want to tell, because it is the most expensive failure mode in agentic systems and almost nobody is testing for it.
What I Promised Last Week
In The Witness Stand, I said trust lives in the foundation. I argued that classical testing already gives you the vocabulary you need to interrogate agent outputs — oracles, testability, evidence integrity — and that the work is to apply it. I closed with: Classical testing teaches you to start with the foundation. That is where trust actually lives.
This week, the foundation was lying.
I want to be precise about what that means in user terms, not implementation terms.
The pattern store is the part of the fleet that lets every agent ask, “Have I solved something like this before?” It is what makes the system get smarter as it runs. If it returns the right past patterns, the next answer is grounded in real experience. If it returns wrong ones, the next answer is grounded in nothing, and nothing in the output ever announces itself as nothing. The agent will write a confident plan based on three patterns that were never relevant to the question, and a user reading the output cannot tell the difference.
That is what was happening in the AQE fleet for most of the week, and the only reason I caught it is that I was running it on real work in another workspace and refused to accept the easy fixes.
A Cascade of Almost-Fixes
The week started with a major release I was proud of. Then it turned into five hotfixes in two days.
I shipped v3.9.0 on Wednesday: production-hardening, plugins, a background quality daemon, retry and circuit-breaker behavior, session durability across reconnects, eight kill-switches so users could disable any of it. The kind of release I write a value-and-benefits explainer for.
Two days later, I shipped v3.9.1: persistent vector storage so the fleet’s memory survived restarts, per-agent isolated workspaces, and faster pattern lookups. Benchmarks looked great. Tests were green.
That evening, I switched to a different workspace, a separate devpod where I had been building another project with Claude Code, Ruflo, and AQE running together against real work, and tried to install the new version on top of the old one. It hung. A lock file appeared on disk at the moment everything froze. I had a deadlock that none of my own tests had caught, surfaced the moment I tried to use my own tool on a real project I was building.
v3.9.2 fixed one deadlock. I switched back to the other workspace, tried again on a different project I had been building there, and the hang came back.
v3.9.3 was a five-fix release: I removed a duplicate process that was racing the parent for file locks, made the CLI startup lazy so that simple commands stopped opening the pattern store, and added a watchdog that was supposed to interrupt any indexing operation that took too long. I verified it on five scenarios. All green. Shipped. Switched workspaces, tried again on the same real project, hang came back.
v3.9.4 was the moment I understood the watchdog could not help me. When a native library blocks the entire program at the operating system level, no JavaScript timer will save you. I added per-file logging so the last line you saw before the hang would name the exact file that was being processed, and I added an emergency escape hatch so I — and anyone else hitting the same stall — could keep working while I figured it out. Then I switched workspaces again and let the indexer run until it stopped. The last log line named the exact file.
I downloaded that file from its public repo, reproduced the hang on my machine, and examined the process’s kernel state. Six worker threads from a Rust async runtime, all waiting on a lock with no timeout, inside a native module my fleet was loading. Wait forever.
v3.9.5 disabled the deadlocking path by default and routed pattern lookups through a slower-but-correct fallback. The hang was gone. The install worked. The dashboard was green again.
That should have been the end. It was actually the beginning.
The Question That Kept the Investigation Alive
The discipline that kept the investigation alive was a single internal rule I refuse to break: no shortcuts, no workarounds, find the proper solution. The same rule I have been giving every Claude Code session I run on the AQE codebase, in every workspace, for months. Disabling the broken path was a workaround. The deadlock was gone, but the deadlock was never the real problem — it was just the most visible symptom of whatever was actually wrong inside that library.
So I went back to the disabled path and asked a different question. Not “why does it deadlock?” Not “is it fast enough?” The question I should have asked in v3.9.1: Does it return the right answer?
I wrote a small fixture: a thousand random vectors with known nearest neighbors, the kind of textbook test every vector-search library is benchmarked against. I asked the library to find the nearest neighbor of each vector and compared its answers against the ground truth.
It got one in ten right.
Then I tried the most basic possible test: store a vector, then search for it. The top result should be itself. The library returned the wrong vector.
The thing my fleet had been calling “recall the most similar past pattern” had been, for as long as I had been using it, equivalent to “return ten patterns at random and call them similar.”
Five hotfixes were chasing the symptom. The disease was sitting one layer down, calmly returning wrong answers to every question I asked it.
I replaced the library. v3.9.6 went out the same evening with a new vector-search backend. The fixture went from one-in-ten to ten-in-ten. Self-query returned the actual self. The fleet’s memory now finds what it is supposed to find.
Every “have I seen this before?” question inside the fleet gets a real answer again.
The Failure Mode Nobody Names
I want to dwell on this part because it is the part most agentic-systems conversations skip.
The library was not crashing. It was not slow. It was not throwing errors. It was returning values of the right type, with the right shape, in the right latency budget, and the values were wrong.
This is the failure mode that defines agentic systems in production. It almost never looks like a hard crash. It looks like quietly succeeding with incorrect values. A pilot run against this kind of failure looks excellent. The integration tests pass. The dashboards stay green. The customer demo goes well. And the system is silently building decisions on top of fabricated retrievals, hour after hour, until somebody asks the one question nobody asked.
There is a sentence I keep returning to. It came out of a learning I logged this week into the self-learning system I keep running alongside my own work — a personal knowledge base that captures patterns from my consultancy and weekly reading so they are searchable next time I need them. The sentence is: most AI pilots fail by succeeding just enough to become difficult to question. That was the AQE fleet’s vector store for at least a week, and probably longer than I want to admit. It worked. It worked wrong.
What Classical Testing Already Knew
There is nothing new I had to invent to find this. Everything I needed was already in the testing vocabulary I have been using for years.
Oracles. James Bach and Michael Bolton define an oracle as any mechanism by which a tester can recognize a problem. The oracle for a vector search is the simplest possible heuristic: the closest match for a stored vector is the vector itself. A self-query returns the self. I had not written that test. I had written tests for shapes, dimensions, insertion success, and latency. I had measured the function’s behavior. I had not asked an oracle whether the answer was right. The HTSM has been telling me to do this, and I’ve shipped a feature flag enabled by default without doing it.
Testability. The reason it took five hotfixes to find the deadlock is that the indexer had near-zero observability. It was a black box that either returned or did not. The first time I could see which file the indexer was processing when it stopped was v3.9.4 — the per-file log line — and that single line of output made every release after it possible. There is no clever debugging trick that substitutes for being able to see what your system is doing. HTSM treats observability as a product factor for a reason.
Risk-based focus. The risk that a vector-search library returns wrong results is not exotic. It is the central risk of vector search. It belonged in the headline test for the release that flipped the feature flag. I had it nowhere. I had it in the follow-up drilldown after the deadlock — the second-best time to find it. The first-best is before the flag flips.
Holistic Testing under PACT. Proactive means the test for “does the vector store return correct results” is written before the vector store is enabled by default, not after a user reports a hang. Autonomous means agents that run the benchmark on every merge without me having to ask. Collaborative means the user fixture that taught me my benchmark suite was insufficient is treated as a contribution, not a complaint. Targeted means the test lives where the risk is, not where the convenience is.
I had the Proactive part wrong this week. I caught up.
None of this is a new theory. All of it is the work my profession has been articulating since before language models existed. The agentic moment is the moment when the cost of skipping it gets paid in public.
The User Value That Came Back
I want to end with what users actually get from v3.9.6, in plain words.
When you ask an AQE agent, “Have I solved a problem like this before?” the answer is now grounded in your real history again. The “similar patterns” the agent surfaces are actually similar. The retrievals that feed into a quality assessment are real retrievals. The learning loop is actually learning, instead of randomly sampling from your stored work and calling it experience.
The install on top of an older version no longer hangs. The CLI exits cleanly. The pattern store survives a restart. The fleet’s memory is finally what I was claiming it was a week ago.
And if any of that ever stops being true again, the per-file logging from v3.9.4 stays in the codebase, the textbook-fixture test from v3.9.6 runs on every merge, and the next person who hits something strange will see exactly what file caused it. The visibility I had to add to fix this cascade is now permanent.
That is the part I am most grateful for. Not that I shipped six fixes in two days. That the discipline that drove the investigation — no shortcuts, find the proper solution — is now wired into how the next release will be built and verified.
What I Tell the Room
People ask me, in meetups, Serbian Agentic Foundation sessions, and one-on-one calls, how do you know the agents are doing what they say they are doing?
This week, the honest answer became sharper. You do not, by default. You have to build the predicate. You write the textbook fixture before you flip the flag. You ship the verbose log line before you optimize it away. You treat the dependency you pulled from the registry as a code-review subject, not as infrastructure. You ask the simplest possible oracle question about every component whose correctness is a property of what it returns rather than whether it returns.
And you keep someone in the room who is willing to be inconvenient. Someone who refuses to accept the first thing that relieves the symptom. This week, that person was me, in another devpod, dogfooding my own tool against a real project I cared about. Most weeks, it has to be the engineer doing exactly that, running their own work through their own tool, on something that matters, until the tool either earns its claims or admits it cannot.
The compass pointed randomly for a week. We re-magnetized it. Next week’s release ships with the recall test that should have been in v3.9.1 and a benchmarking step that runs against a known-ground-truth fixture before any vector-search code path goes live.
We are in turbulent times for agentic systems. Vendors are shipping faster than they can characterize what they ship. Native modules with foreign runtimes are now normal inside everyday processes. A small version number on a published library is a polite way of saying we are still figuring this out, and almost no one reads version numbers. The risk is not that your system fails loudly. The risk is that it succeeds quietly with wrong values, while your dashboards stay green and your pilots look excellent.
The way out is older than the problem. Build the predicate. Ask the oracle. Make the system tell you what it is doing. Then keep someone in the room who will not accept the easy answer.
That is the whole craft. It is also the whole job.
This is the twenty-sixth article in The Quality Forge series. Previous: “The Witness Stand” put the agent’s outputs on the witness stand and asked where the evidence was. This one is what happens when the foundation that produces the evidence turns out to be guessing. The releases described are public on github.com/proffesor-for-testing/agentic-qe. Reproductions of the bugs in the previous library have been documented and shared upstream; the authors are building hard things in the open, and I am grateful that the path forward exists.
Dragan Spiridonov is the Founder of Quantum Quality Engineering, an Agentic Quality Engineer, and a member of the Agentics Foundation. He is currently building the Serbian Agentic Foundation Chapter in partnership with StartIt centers across Serbia.