The Room That Quoted Back

Last week, a sentence I wrote — this is not a trust architecture, it is a hope architecture — turned up in somebody else’s newsletter, in somebody else’s voice, talking to somebody else’s readers. A practitioner I had never worked with published a public thank-you for the fleet’s review of his project. A respected voice in the community put agentic-qe.dev into his public overview of the field. Two more testers I follow used language I had been using the week before as if it belonged to the conversation now, not to me.

The community is starting to quote back.

I want to write carefully about this, because the temptation when that happens is to write a victory lap. This is not that. This is the opposite. The week the room starts quoting back is the week the bill comes due. Every claim you have made is now sitting in somebody else’s argument. The work is to be worthy of being quoted.

What I Promised Last Week

In When the Compass Pointed Random, I said the dangerous failure mode in agentic systems is not a loud crash. It is a quiet success with wrong values. I said the way out is to build the predicate — the textbook fixture, the oracle question, the visible log line — and to keep someone in the room who will not accept the easy answer. I closed with: that is the whole craft. It is also the whole job.

This week, the room got larger. More eyes on the code. More practitioners are running the fleet on their own projects. More people are carrying the phrasing forward. The predicate had to survive not just my own dogfooding, but other people’s real work.

That is a different weight.

Recognition Is a Responsibility

It is not the first time a practitioner has shared fleet output in public, and it is not the first time somebody has told me the reviews have saved them time. What keeps changing is the shape of accountability. Scott McMillan shared eleven concrete fleet findings applied to a document in his own project this week — architectural splits, async-safety guards, size thresholds, a deduplicated whitelist, and live assertions. His argument about the fleet is now public and mine to earn.

The rest of the week followed the same shape. Different people, different venues, similar dynamic. Somebody carries a phrase of mine forward in a newsletter. Somebody adds the fleet to a public overview. Someone reacts to a framing I used and makes it their own. Each of these is a gift. Each of them is a new contract.

Recognition is not a prize. It is a schedule. Sooner or later, the claim gets tested. You want the evidence under it to still be there when somebody looks.

What Actually Shipped

Fewer hours were spent on the fleet this week than in any week over the last two months. That was deliberate. The Agentics Foundation Board had its first meeting on Monday, April 6. I was elected to the Board, then elected Secretary of the Board, and was later asked to lead the Education and Certification Chapter. Those decisions do not get half-committed. The rest of the week rearranged itself around them.

Even with the narrower window, four releases landed, and each one was aimed at a different kind of trust users are asking the tool to earn.

v3.9.8 — Insurance Around the Release Itself

This release touched no application code. What it shipped was infrastructure: a self-hosted mirror for the release-gate corpus so the gate keeps working when GitHub regenerates archives; CI enforcement of the failure-modes checkbox on every pull request, with hedging phrases that now fail the check unless they carry a tracking issue; a weekly chaos workflow that throws six genuinely hostile project shapes at the tool every Sunday to verify the watchdog catches a hang, not just that the happy path succeeds; and a human-readable verification entry point with a machine-generated matrix embedded in the release notes.

None of it changes the package’s bytes. All of it changes how confident the next release is allowed to be. That is the kind of work users never see and always pay for when it is missing.

v3.9.9 — A Real Browser in Ten Megabytes, Not Three Hundred

A new fleet skill, qe-browser, gives every QE agent browser access through a Vibium binary that speaks the W3C-standard browser protocol directly. Around it sit the parts that make a browser useful for quality work: sixteen typed assertion kinds, batched multi-step execution with pre-validation of every step, pixel-perfect visual diffing against stored baselines, a semantic element finder that understands intents like submit_form and accept_cookies instead of forcing agents to guess at selectors, and a fourteen-pattern prompt-injection scanner for the agents that read web content.

Eleven existing skills were migrated off their private browser glue and now compose on top of the shared primitive. When the engine is absent, the skill returns a typed skipped envelope with a distinct exit code, so downstream agents can branch on the case without grepping error strings.

I demoed it live at the Agentics Lab meetup the same week it shipped. The reaction I needed was the one I got: can we see that again on our site?

v3.9.10 — The Fleet Gets a Choice About Which Model Does the Thinking

A new advisor-routing layer means QE agents can now distribute advisory questions across multiple LLM providers rather than the single provider running the session. A simple what test should I write next? can go to a cheaper or self-hosted model; a harder reasoning task stays on a premium provider. Per-provider circuit breakers detect degradation and reroute automatically, so a single vendor outage no longer stalls the entire fleet. Credentials and personal data are scrubbed before any prompt leaves your environment, with three configurable modes for teams with different tolerances.

Eight of the most-used QE agents arrive with multi-provider routing enabled by default. In plain terms: the fleet now costs less to run, survives an outage at your primary provider, and stops being the path by which sensitive values accidentally reach somebody else’s logs.

v3.9.11 — The Upgrade Path I Had Been Skipping

I installed the prior release on top of the previous one in a different project. The upgrade did not carry the new agent definitions across. The test suite had passed. The test suite was not the upgrade. The question I had been asking the fleet for weeks — did you try to install the old version, then upgrade in a test project, or was it easy to skip this? — turned around and pointed at me. The fix was small.

Version detection now falls back to existing agent files when the config from a prior interrupted install is missing, and the installer logs a visible message when a helper or template source cannot be found, instead of silently skipping the copy.

Four releases. The headline is not the count. The fleet grew process insurance, grew a browser primitive, grew provider-independent reasoning, and grew a second dogfooding lesson in as many weeks about what verification from the user’s perspective actually costs.

What I Am Thinking About

I keep a self-learning system alongside my own work — a personal knowledge base that captures patterns from my consultancy and weekly reading so they are searchable when I need them next. Three threads went in this week that influenced how I am thinking about the next month.

The first is about confidence. When people consult an AI system and then decide, their subjective confidence goes up, whether the AI was right or wrong. The inflation is nearly symmetric. Under time pressure, it gets worse. This is exactly the failure mode I wrote about last week — succeeding just enough to become difficult to question — with numbers to back it up now. What makes an agentic system dangerous is not that it is wrong. It is that the humans around it feel more certain after consulting it, regardless of whether it was right.

The second is about flow. Most AI activity is measured at the task level: faster triage, faster drafting, faster coding. The local speed-up gets absorbed downstream as review load, exception handling, coordination drag, and hidden rework. The acceleration does not travel. The corollary is that the real edge is learning portability — making one context’s hard-won lesson the starting point for the next context’s decision, instead of paying for the same learning twice.

The third is about identity. The job of QE is being pushed from individual contributor toward something closer to a robot-manager: decomposing workflows, delegating to agents, supervising outputs, and carrying the judgment. That frame is honest about what is changing. The shift is not cosmetic. It is a new class of work, and the vocabulary for it is being written in public right now.

These three threads converge on one uncomfortable point. The cost of being wrong in an agentic setting is no longer paid in a single failed run. It is paid across every decision made with elevated confidence while the signal was unreliable, across every workflow that inherits the local speed-up as downstream toil, across every practitioner who is expected to supervise systems the profession has not fully characterized yet.

That is the weight of a community beginning to quote back.

The Other Work That Started This Week

I also started a new collaboration — helping Reuven with his new company and some of his open-source projects, each of which has more than 30,000 stars. Very shortly, my name sits next to code already in production for tens of thousands of users. I started testing on hardware, which is a new chapter for me after decades of working mostly in the software layer. I am entering it with the same rule I have been running every other project on: no shortcuts, find the proper solution. The rule scales. What does not scale is the amount of room for error when the population depending on the answer is already large.

The Ministry of Testing AI Chapter thread is gaining momentum. The Serbian Agentics Foundation Chapter continues, built out from StartIt centers across Serbia. The Agentics Vibe Casts, the Agentics Lab, and AI Hacker League visits continue. The podcast work continues. With the Agentics Foundation Board seat and the Chapter lead added, the frontage has widened.

Widening is not depth. Every new venue is an invitation to say more things, more quickly, to more people. The next test is whether the things I say hold up where they land.

Classical Testing Still Has the Vocabulary

The discipline for this moment is the same discipline I have been writing about every week on this blog.

Oracles still catch it. When the tool you built is used by someone else in their real work, the oracle question is no longer: “Does my fixture pass?” It is does my fixture resemble the question somebody else is actually going to ask?

Testability is a product of the tool, not of the tester. The reason the upgrade-path bug could be caught at all is that the installer now logs a visible message when it skips a step. A tool that cannot explain itself will be hard to check.

Risk-based focus scales with the audience. A failure affecting 10 users is a local problem. A failure that reaches 10,000 users through a 30,000-star repository is systemic. The tests that deserve the first seat on the bench are the tests covering the paths the broadest population actually touches.

Holistic Testing with PACT stays the model. Proactive — write the predicate before the community trusts the claim. Autonomous — the check runs without me having to remember it. Collaborative — the tester’s work counts, the practitioner’s work counts, the agent’s work counts. Targeted — the risk sits where the risk sits, not where the convenience sits.

Nothing new. Just old work applied to a wider room.

What I Tell the Room

At the Agentics Lab meetup this week, after the browser-skill demo, somebody asked me the question I get at every single event in some variation. How do I start?

I gave the five-line answer I have been giving on podcasts and at meetups for months. Master the basics before the agents — prompting, context, and harness, in that order. Invest in thinking, not just tools. Design systems, not just tests. Build your own AI-powered quality projects, small and open and unafraid. Verify everything — ten percent of the work is the build, ninety percent is the validation.

What I would add this week is a sixth line. It is for the moment when the room begins to quote you back. When practitioners you respect start carrying your phrasing forward on their own. When the work starts showing up in public in ways you did not plan.

Remember who is trusting your words, and make sure the evidence under them is still there when they look.

Recognition is a liability with a schedule. Sooner or later, the claim gets tested. You want the predicate to be there when it happens. You want the fixture, the oracle, the log line, the inconvenient person in the room. You want the discipline of classical testing wired into every new release, every new skill, every new collaboration, every new chapter.

We are in turbulent times for agentic systems. The community is starting to have a real conversation about what holds up and what does not. When your phrasing travels, that is not the finish line. It is the opening argument. What happens next depends entirely on whether the evidence under the phrasing is still there when somebody looks.

The room is quoting back. The work is to be worthy of being quoted.

That is the whole craft this week.

This is the twenty-seventh article in The Quality Forge series. Previous: “When the Compass Pointed Random” described a week of five hotfixes and a textbook fixture that re-magnetized the compass. This one describes the week after, when the phrasing started traveling beyond my own workspace. The releases described are public on github.com/proffesor-for-testing/agentic-qe. Scott McMillan’s public write-up of the fleet review and the other community references referenced here are publicly linked from the Agentics Foundation community channels.

Dragan Spiridonov is the Founder of Quantum Quality Engineering, an Agentic Quality Engineer, Secretary of the Agentics Foundation Board, and lead of the Education and Certification Chapter. He is currently building the Serbian Agentic Foundation Chapter in partnership with StartIt centers across Serbia.

What I Promised Last Week

Recognition Is a Responsibility

What Actually Shipped

v3.9.8 — Insurance Around the Release Itself

v3.9.9 — A Real Browser in Ten Megabytes, Not Three Hundred

v3.9.10 — The Fleet Gets a Choice About Which Model Does the Thinking

v3.9.11 — The Upgrade Path I Had Been Skipping

What I Am Thinking About

The Other Work That Started This Week

Classical Testing Still Has the Vocabulary

What I Tell the Room

Stay Sharp in the Forge