The last article ended with me saying the line back to every room that had said it to me. Own your own AI. Own the harness, and the models can come and go without taking your capability with them.
Three weeks later, something shifted. I stopped having to say the line first. The industry started saying it on its own, in articles, in benchmark write-ups, in market-category renamings. And the community that first met in person in Budapest did what I hoped it would when I wrote that closing section: it started traveling to each other’s cities. Since Wednesday, Adam and Klara are here in Novi Sad, and my job this week has been less about releases and more about being a decent host.
This is the article about the three weeks when the loop stopped being something I explain and became something that came home, in both meanings of the word.
Beta Testing the MetaHarness
In the middle of June, Reuven Cohen asked for beta testers for his new project, the agent-harness-generator, the MetaHarness. A harness that generates harnesses, measured and governed across nine OIA layers, with a learned router that promotes a model only when it beats the incumbent on quality AND cost AND latency, and an MCP that defaults to deny. I have been following his lead on harness work for over a year, so I did what I do: I pointed the fleet at it.
The QCSD development swarm ran a full quality analysis of the project, deep product analysis, user journeys, the works. And on the first pass, it skipped its own specialized QE agents and tried to hand me a shallow report. I caught it, told it exactly which part of the skill definition it had ignored, and it re-ran properly. Completion theater in action, again. “The agent says done” is a claim, not a fact, and the discipline of checking that claim is the most transferable classical QE skill I know.
The reports went back to Ruv as a public issue and a set of gists, the same pattern as the Universal Pattern Space review last month: one honest review becomes a shared test-and-eval bed. And, as last time, reviewing someone else’s system surfaced disciplines worth turning back on for myself. That became a cross-pollination plan in the AQE repo, a GOAP plan with implementation, integration, and verification lanes, and it set the direction for almost everything I shipped in the weeks that followed.
The Benchmarks That Were Allowed to Say No
Most of my time in the second half of June went into one question, and it is the question I keep hearing from every team that gets past the demo phase: can cheaper models do QE work without losing the quality?
The idea is simple and a little uncomfortable. Most “AI for QE” tools run one expensive frontier model for everything. But verifying a test is far cheaper than writing one, and much of QE work is bounded enough that a cheap local model can do it if the software around the model is smart about when to spend more. So I borrowed the economic structure Ruv proved in his MetaHarness work: cheap-first, repair, then escalate, and did the one thing QE lets you do that open-ended coding can’t: I measured it against ground truth. A generated test either kills the mutant or it doesn’t.
The lab was my own desk. Ollama on my Mac, the fleet in a devpod, and a parade of models through the benchmark: gemma 4 12B, qwen3 8B, qwen3 30B, a 27B model a friend managed to run on a mini PC, a small coder model to finish the curve, and cheap cloud models like GLM 5.2 and DeepSeek through OpenRouter. The oracle was never an LLM’s opinion. It was mutation kill, real coverage, and deterministic suite cost, the same fitness function that drives the aqe arena, where test strategies compete in reproducible tournaments instead of being chosen by dogma. Bring an arena, not an opinion.
What the benchmarks actually said:
An 8B local model is below the quality floor for test generation. A 30B clears it, at roughly 89% mutation kill. The cheap-first, repair, escalate ladder keeps somewhere between 70 and 83% of tasks at zero dollars while staying within the noise of a frontier model. Pulling candidates from different model families adds about six quality points, because the families cover each other’s failures. And never let a model grade its own homework: only a real test run, a coverage number, or a mutation result is allowed to move the routing confidence.
I kept the negative results too, because they’re the trustworthy part. “Cheap model replaces the frontier” was rejected by our own data. Ruv’s own ablations refuted two of the three patterns everyone was praising in his repo, including the cross-provider coder swap and the frontier “sniper” pass on top of a pure-cheap run. The law that survived is blunt: the coder binds, not the oracle. Reasoning is still the wall. The honest architecture is cheap evidence plus frontier reasoning, not cheap everything.
All of it shipped across v3.10.9 through v3.11.4 as the darwin-qe lane, and it is opt-in, off by default, because nothing changes for a user unless they turn it on. For users, the practical sentence is short: routine test generation can now run free-local-first with repair and escalation in place, and the hard cases still climb the paid ladder.
The same thread ran through the Cognitum work. The edge appliance is the same argument in hardware: a small brain ingesting sensor data from multiple sources, analyzed entirely with local models, nothing leaving the premises. And it produced my favorite honest bug of the month: an appliance release that passed every gate while a binary was silently missing from the artifact. The checksum list said the file existed; the release pipeline never verified the payload matched the list. Third time I write this lesson in this blog, and it will not be the last: the most dangerous failure is the one that passes every check.
The Industry Starts Saying It Out Loud
At the end of June, I read the Forward Future piece “Build the Loop, Not the Agent” over morning coffee, and it read less like news and more like confirmation. The thesis is one the Agentics Foundation has been living for over a year: you are not building an agent, you are building the loop that rebuilds it. Models age out faster than any modernization project can finish, so the durable asset was never the agent. It is the loop that re-tunes, re-evaluates, and re-ships it, in days, not quarters.
I have said the QE version of this for years: quality is built in, not tested in. The loop is the architecture, and it is a living one. But a living loop is only worth trusting if it is allowed to report failure. That is the part I have been building into the fleet: the review swarm runs three blind refuters per finding, a majority can kill a finding, and killed findings keep their refutations for audit. Verdicts are structured contracts: approve, block, escalate, with evidence attached, not a bare green check. And the benchmark gate is allowed to abort: if a tuned QE harness does not beat the plain model on objective scorers, we ship the negative result, not the story.
Ruv put the market version of the same argument plainly: the easiest way to lie with AI is to focus only on benchmark scores. Cost and performance are one graph, and there is no universal “best,” only tradeoffs on a Pareto frontier. Gartner is even renaming the AI-augmented testing market category as the tools turn agentic. The vocabulary is catching up.
Here is the strange part, and I only see it clearly when I talk to people outside the Foundation. Inside our meetups and channels, harnesses, loops, refuters, and local judges are Tuesday. Outside, most of the industry is still at “we bought licenses, why is nothing changing?” The gap I measured in Madrid and Budapest has not closed; if anything, the distance is easier to see now that the industry has started using our words. And that gap is exactly why the answer cannot be gatekeeping. The knowledge and the tools have to be available to everybody. Open source wins in the end, not because it is morally nicer, but because a community that shares its harnesses compounds faster than any vendor roadmap.
Two Meetups, Two Formats
Meetup thirteen, June 16 at the AI Hub in Novi Sad, was the format the regulars know: real codebase, real agents, rough edges on screen. I continued a Claude Code session that was almost a month old, kicked off a loop that fixed bugs from our known list in five-minute iterations, each fix in its own subtask, each one validated from the user’s perspective by the browser skill. Then I ran the QCSD refinement swarm on the community social network app and let the audience watch it give me an honest status report: development in progress, not ready, effectively untested beyond the unit level, one specification for the whole web app. The swarm also tried to skip its specialists on the first pass, and the audience got to see me catch it live. And the finding of the night came from the adversarial cross-validation agents: a privilege-escalation bug in which any active member could grant themselves the admin role. Nice bug. Not one a tired human reviewer scrolls past on a Thursday evening, and exactly the kind of cross-file connection agents hold better than we do.
Meetup fourteen, July 2, was something we had never done: a panel. Enterprise AI adoption, at the StartIt AI Hub. Three guests: Adam Kovacs and Klara Hermesz of the AI Enablement Academy, in town from the Seattle chapter, and Predrag Skoković of Quality House, thirty years in the industry and one of the people who built the testing community in this country.
I will not compress ninety minutes into a paragraph, the recording is public, but three moments stayed with me. Adam set the tone early: “If you make AI the goal, you will fail. AI is still just a tool, you are still solving business problems.” Klara told the story of a former employer that handed ChatGPT to tens of thousands of employees with zero training, and two years later cannot measure whether anyone uses it well: “Giving access to tools does not mean that you enable them.” And Predrag, carrying the regulated-industry perspective, kept returning the room to responsibility: if AI writes the code, a person still stands behind it, and his advice for where to start was two words long: “Automate yourself.”
The moment I will carry to Munich came from the audience. A QA practitioner told the room that he had built his own agentic QA setup, inspired by the fleet, and cut a sprint’s test workload from 7 days to 2. Not to work less. The freed time goes into deeper manual testing, verification, and arguing with product owners, “especially when they write with AI too.” His company is hiring more QAs, not fewer. His sentence, not mine: “QAs are needed now more than ever.”
Adam checked the numbers across the Foundation and told us something I had not realized: fourteen editions in, the Serbian chapter is the longest-running in-person Agentics Foundation meetup in the world. London started at the same time and is on its second. I do not say that to boast. I say it because showing up fourteen times is the whole method.
Training the Trainers
The Training Committee work I promised in the last article took its first concrete shape: a full proposal outlining how the Foundation should launch global agentic engineering training, delivered in three versions: one for the committee, one for the community, and one for partners.
The architecture came out of my own Nagual assets: a six-level competency model from AI-Curious to Architect and Mentor, five tracks from foundations through building agents, agentic QE, and operations, safety, and governance, up to applied capstone work, with a Train-the-Trainer spine running through all of it. Certification is evidence-based and project-assessed, because a multiple-choice test cannot tell you whether someone can ship a governed agent system, only a shipped project can. And the design principle I refuse to compromise on is the delivery: pairing and ensemble, learning by shipping, mentors all the way down. The KPI I care about is not enrollment. It is the mentor multiplication rate, how many trainers each trainer produces.
If the loop is the product, then this is the training loop: train trainers who train trainers. The Serbian chapter, with the StartIt centers, is the pilot. Scaling knowledge is the same problem as scaling agents, and it has the same failure mode: throwing tools at people in a top-down way, with no enablement. The panel spent an evening explaining why that fails in enterprises. We do not get to make the same mistake in our own Foundation.
Guests in the House
And now the part of the article that no benchmark can measure.
A heat wave parked itself over Novi Sad, the kind where the city moves to the riverbank and waits for evening. Fable 5 came back to general availability, a quiet reversal of the access decision I wrote about last time, and a reminder that the leash I described gets loosened as arbitrarily as it gets tightened. I noted it, verified my stack did not care either way, and went back to work. That indifference is what owning your harness buys you.
But the real event of the week arrived on Wednesday. Adam and Klara, who came to Novi Sad to spend a few days. So I have been a host: walking them through the city, giving the history lessons that come naturally when your mother was a history teacher, feeding them local food, so far, so good, they love it. One more field trip today, and tomorrow they continue their journey.
In between the sightseeing, the conversations kept circling one theme: how to help people adopt AI in a responsible and organized way, enablement before automation, competence before scale. The panel on Wednesday was one output of those conversations. The training plan will be larger.
A year ago, these were faces in a Zoom grid. In Budapest we could shake hands and hug. Now they are guests at my table, and the meetup we ran together made the local community measurably richer. This is what I meant when I wrote that the community traveling to each other’s cities is exactly the texture I want the Serbian chapter to have. It is no longer an aspiration. It happened, in my own city.
What Is Ahead
Next week, I am in Munich at Accenture’s Quality Matters event, sharing my updated story of the golden age for QA. I have been making this argument for months, and every week hands me more evidence: the practitioner who cut test work from seven days to two and became more valuable, not less. The benchmark that needed a mutation oracle, a classical testing idea, to keep a cheap model honest. The refuters, the verdicts, the gates that are allowed to say no, all of it is testing discipline, promoted to the center of the architecture. The skills good QA and QE people carry — risk thinking, oracle design, the reflex to ask “how do I actually know this worked?” — are not being retired by this technology. They are being priced up by it.
Adam closed our panel with his version of the line: keep building, keep sharing, stay curious. I will close with mine, the one the regulars already know.
The industry is learning to build the loop. We spent the year giving the loop permission to tell the truth. And this month, for a few days in a heat wave, the loop came home.
Stay curious. Keep learning. Keep sharing. Knowledge is power.
This is the thirty-second article in The Quality Forge series. Previous: “The Same Line in Every Room” described Madrid, Budapest, and the moment vendor independence became the work. This one describes the three weeks after: beta testing Ruv’s MetaHarness, the darwin-qe local-model benchmarks shipped across v3.10.9 to v3.11.4, two meetups in two formats including our first panel, the Agentics Foundation training proposal, and the week Adam and Klara came to Novi Sad. The releases are public on github.com/proffesor-for-testing/agentic-qe. The meetup 14 recording is on YouTube. Cognitum is at cognitum.one.
Dragan Spiridonov is the Founder of Quantum Quality Engineering, an Agentic Quality Engineer, Secretary of the Agentics Foundation Board, chair of the Agentic Engineering Training Committee, and one of the AI Chapter leads for the Ministry of Testing. He is currently building the Serbian Agentic Foundation Chapter in partnership with StartIt centers across Serbia.