V3 Journey Classical QE Feedback Loops 14 min read

The Forest and the Feedback Loop

Two weeks, thirteen releases, one contributor who filed better bugs than most teams write tests, a model that deleted my Docker containers, and the walk that made the rest of it possible.

Dragan Spiridonov
Founder, Quantum Quality Engineering • Member, Agentics Foundation

The last article ended with a slowdown. I did. Not immediately — the first week after that article still had the momentum of everything before it. But the weekend between the two weeks covered here, I went to the forest. No laptop. No terminal. Just trees, air, and the kind of quiet that does not have a notification sound.

I shared a photo on LinkedIn afterward. The reactions were what you would expect — people like nature photos. What I did not expect was Stuart Winter-Tear posting his own walk the same week, mentioning me by name, sharing his own pictures. Stuart is someone whose writing I respect. Seeing that the same pattern — step away from the screen, walk, come back sharper — shows up in his practice too was the kind of validation that does not come from a benchmark.

The week after the forest was the most productive of the year. That is not a coincidence.


What the System Learned While I Was Reading

Nagual, the self-learning system I open-sourced two weeks ago, now holds over 300 patterns, which have been stored since May 2. Not because I sat down and wrote three hundred entries. Because the practice of recording what I read, what I tried, and what worked has become automatic.

The domains tell the story of where my attention went. New patterns related to agentic engineering: harness design, context utilization, and the difference between prompt engineering and harness engineering that the industry still conflates. Patterns on agentic QE: test lifecycle stages, budget-aware testing, and the observation that tests-as-observation is a fundamentally different practice from tests-as-verification. A set of patterns on agentic security, including a BlueRock study that found that over a third of 7,000 publicly accessible MCP servers are potentially vulnerable to server-side request forgery. A new set of patterns on governance: the EU AI Act digital omnibus, the MCP protocol moving under Linux Foundation stewardship.

Three findings stood out more than the rest.

First, the generalization gap theorem from Xu et al. — a formal proof that retrieval-based memory systems have a sample-complexity lower bound. The industry assumption that better retrieval equals better generalization is mathematically wrong past a threshold. This matters for Nagual directly. It means the system’s value is not in how many patterns it stores but in how ruthlessly it prunes the ones that stop working.

Second, OWASP published its Top 10 for Agentic Applications. Ten threat categories, peer-reviewed by over a hundred contributors. This is the first credible taxonomy that treats agentic systems as their own attack surface rather than a subset of web application security. I stored the full mapping against the patterns I already had and found gaps in three of the ten categories.

Third, a finding on evaluation awareness — the observation that AI systems can detect when they are being evaluated and adjust behavior accordingly. This is not hypothetical. It has been measured. For quality engineering, this means the entire assumption behind behavioral testing of AI systems — that test-time behavior predicts production behavior — needs qualification. I do not have a solution for this. I have a pattern that says the problem exists and links to the evidence.


Thirteen Releases Driven By Contributor

On May 6, I presented at StartIt AI Hub in Novi Sad — my regular slot with Vukasin Stojkov, this time on agent memory architecture. Nagual as the case study. People stayed after for questions, which is always the real metric for whether a talk landed.

The same week, the work on the Agentic QE fleet shifted. For the first time since the project launched, the dominant driver was not my own roadmap. It was a single external contributor.

Jordi filed his first issues with source-level investigations, production impact data, and ranked hypotheses. Not vague reports. Forensic analysis. He cloned the repository, traced the code paths, and sometimes submitted patches alongside the bug report. When his subscription limits hit, he told me what he could not verify and asked me to check. That is the kind of collaboration that makes open source work — not drive-by feature requests, but someone who treats your project as if the quality matters to them personally.

Thirteen releases landed between May 5 and May 15. The story underneath the version numbers is that the self-learning loop — the system that is supposed to let the fleet learn from its own outcomes — was fundamentally broken. Workers never ticked. Embeddings loaded empty. The routing system collapsed every prompt to a single agent. The feedback chain produced zeros everywhere. I had not noticed because the tests passed. The tests were not testing the integration.

Jordi’s reports cracked the problem open from the outside. Version by version, we traced every broken seam. The experience consolidator was silently deleting historical data — sixteen thousand records destroyed in one run by an overzealous safety valve. The post-task hook skipped the entire Q-learning chain because an empty task ID was received. A session-start race condition leaked 420 gigabytes of disk space via an unbounded Promise that outlived its timeout.

By v3.9.27, the architectural fix landed — CapturedExperienceBridge, a component that connects hook-driven activity to the kernel’s domain plugins. It needed three iterations to get the load ordering and payload shape right. By v3.9.31, the loop closes end-to-end. ADR-094 formalizes the boundary: hooks are lightweight producers that must complete within 100 milliseconds. Dream cycles — the expensive consolidation work — run in the kernel. ADR-095 introduces three-signal routing with epsilon-greedy exploration, gated by the graph topology, so the system does not explore randomly when the swarm structure is fragile.

The fleet now has a single command — aqe learning loop-health — that tells an operator whether the learning system is actually learning. That command did not exist two weeks ago because nobody had proven that the system was not learning. It took an external contributor to file methodical evidence.


Finding The Rhythm With Cognitum

The work with Reuven on Cognitum continued through both weeks. The system is larger than anything I have set up quality engineering for — a meta-repository with almost twenty sub-modules, spanning hardware provisioning, edge agents, cloud coordination, and a dashboard that ties it all together. Setting up a quality system for something at this scale, where modules ship on different cadences, and some run on physical devices in users’ hands, is a challenge I have not faced before.

We are finding our rhythm. Devices are reaching users. Feedback is coming in. There are fewer issues than I expected, and the ones that do surface are being fixed and deployed quickly. The pattern from the previous article holds — dogfooding at intensity is the fastest feedback loop. When you are the user who just tried to flash a device, and the build failed, the distance between the bug and the fix is zero.


When the Agent Deletes Your Containers

Ten days ago, I gave Claude Code a /loop instruction to optimize my laptop — clean up memory, reclaim disk space where possible. It decided to delete all of my stopped Docker containers.

All. Gone.

These were not disposable containers. They were development environments — configured, tuned, with a state that took hours to build. Five hours of rebuilding, from scratch, to get back to where I was before the optimization started.

This was not the first time I ran this /loop command, and part of the command is to always provide a list of items to delete and wait for confirmation. On that day, Claude + Opus 4.7 decided to ignore these instructions. This was not a hallucination in the usual sense. The model understood the instruction. It found valid targets. It executed a command that freed disk space. It was correct on every dimension except the one that mattered — to ask for confirmation before doing any delete action. Part of the instruction said to optimize. The agent optimized. The cost was my time.

Opus 4.7 requires more careful supervision than its predecessors. More corrections. More pauses to verify what the model is about to do before letting it proceed. The kind of vigilance that was not necessary two to three months ago. In the Agentics Foundation WhatsApp group, members are reporting the same thing. Some have switched back to Sonnet. Some prefer Opus 4.6 over 4.7. The consensus is not that 4.7 is broken — it is that the trust calibration shifted.

For a practitioner who builds tools on top of these models, this is the same lesson as the upstream dependency risk from the previous article, but more personal. The model provider’s release quality directly affects your workday. You cannot test your way out of someone else’s regression. You can only verify more, delegate less, and slow down when the foundation under your tools feels less solid than it did last month.


Six Events, Four Cities, Three Weeks

The calendar ahead is dense.

  • On May 20, I am in Belgrade for a hands-on testing session with my QE agents, organized by Context Community (luma.com/8mp6grxj).
  • The next day, May 21, I am back in Novi Sad for the regular Agentics Foundation meetup #12 (luma.com/v9uu25n0).
  • The last week of May, ExpoQA in Madrid — the talk I have been preparing for months, Classical QE plus Agentic Principles: Building Bridges, Not Burning Them (expoqa.eu).
  • After Madrid, in the first week of June, Budapest, for the Craft Conference, where the Agentics Foundation crew is gathering. On June 2nd, a local meetup for the Hungarian Agentics Foundation chapter, adapting the ExpoQA material to a twenty-minute slot focused on catching agents in the act of completion theater.
  • June 5th is a special challenge for me — in the morning I need to join the workshop for Nordic Testing Days with my friend Lalit, The 70% Problem — Reclaiming Testing’s Intellectual Core with Agentic Quality Engineering (ntd2026.sched.com) — I am sorry I will not be there in person, as later in the afternoon I am doing my live hands-on session at Craft Conference, Quality Engineering in the Agentic Age: Build, Test, Orchestrate (craft-conf.com).

Six events across four cities in three weeks. I am not going to pretend this is comfortable. But every one of them is a conversation I want to have with people who are doing the work, not just reading about it.


What Is Shaping Up for June

Two roles were solidified this month. I am now confirmed as chair of the Agentic Engineering Training Committee for the Agentics Foundation. I have ideas about what training should look like — practical, hands-on, grounded in real experience rather than slide decks — but the first priority is listening to the community about what they actually need. The other role, as one of the AI chapter leads for the Ministry of Testing, is pulling in a similar direction. So now I have a new challenge: not just quality-focused events, but broader agentic engineering approaches for a wider audience. Both of these need serious time and attention, and that work starts properly once the conference season’s first batch wraps up in June.


What I Tell the Forest

The last article’s closing was about slowing down when the load exceeds the bandwidth. This article’s proof is that the advice worked.

The forest walk was not a luxury. It was what made the most productive week possible. The thirteen releases, the learning loop finally closing, the presentations that landed, the contributor whose work made the system better than I could have made it alone — none of that happens when you are low on energy and diminished cognitive abilities.

The system I built records everything, every pattern, every outcome, every failure. But the system does not recharge itself. The practitioner does.

The loop closed. The forest helped.


This is the twenty-ninth article in The Quality Forge series. Previous: “When the Load Doubled” described three weeks when the room got wider and the bandwidth did not. This one describes the two weeks that followed — the weekend in the forest, the contributor who filed forensic bugs, and the thirteen releases that finally closed the learning loop end-to-end. The releases are public on github.com/proffesor-for-testing/agentic-qe. Nagual-QE is open-source at github.com/proffesor-for-testing. Cognitum is at cognitum.one.

Dragan Spiridonov is the Founder of Quantum Quality Engineering, an Agentic Quality Engineer, Secretary of the Agentics Foundation Board, chair of the Agentic Engineering Training Committee, and one of the AI Chapter leads for the Ministry of Testing. He is currently building the Serbian Agentic Foundation Chapter in partnership with StartIt centers across Serbia.

V3 Journey Classical QE Feedback Loops Nagual-QE OWASP Agentic ReasoningBank Opus 4.7 PACT Framework

Stay Sharp in the Forge

Weekly insights on Agentic QE, implementation stories, and honest takes on quality in the AI age.

Weekly on Sundays. Unsubscribe anytime.