V3 Journey Agent Compliance Harness Engineering 14 min read

The Score Nobody Reads

The orchestra has a score. It’s detailed. It’s been rehearsed. And nobody’s reading it.

Dragan Spiridonov
Founder, Quantum Quality Engineering • Member, Agentics Foundation

Two weeks ago, I built a portable orchestra — a brain export that lets my agentic quality system travel between machines with cryptographic witness chains and MinCut test optimization. Last week, that orchestra got a gate that fights back — adversarial QE agents running Loki-Mode attacks against their own code, because generation is fast but verification is the craft.

This week, I discovered something uncomfortable. A few months ago, I wrote “The Conductor Finally Reads the Score” — celebrating the moment when the system began following its own instructions. Turns out, the conductor stopped reading again. The orchestra has a score. It’s detailed. It’s been rehearsed. And nobody’s reading it.


The Number That Mocked Me

I ran Claude Code’s /insights command on five days of my own work after rebuilding the DevPod workspace, shipping six npm releases (v3.7.17 through v3.7.22) of my Agentic QE platform. The report was thorough. It analyzed my friction points, identified patterns, and then helpfully suggested:

“Try creating a custom /release skill as a markdown file that encodes your full release checklist.”

I stared at that sentence for a long time. Because I already have a /release skill. It has 15 steps. It has explicit STOP gates for user confirmation. Step 8b tests aqe init --auto in a fresh project directory. Step 8e runs an isolated dependency check that simulates a real user install. Step 15 does post-publish verification in a clean environment. I didn’t design a vague checklist — I designed a pipeline with gates.

I also have over 80 skills. I have a /brutal-honesty-review that spawns three expert personas. I have a /sherlock-review for forensic investigation. I have CLAUDE.md files with explicit rules about verification, about not merging PRs without approval, and about testing from the user’s perspective.

The agent has all the instructions. It just doesn’t follow them.

The CC Insights tool — itself an AI analyzing my AI-assisted work — examined my frustrations and prescribed additional rules. The diagnosis was “you need a skill for that.” The actual disease was that the skills I’d already written, skills with detailed verification steps, weren’t being followed.

The /release skill doesn’t say “test if you feel like it.” It says “verify init completes without errors” and “STOP — show all verification results. Every check must pass before continuing.”

This is not a coverage problem. It’s a compliance problem. And it might be the most important distinction in agentic quality engineering right now.


Six Releases, One ESM Bug, and a Merged PR Nobody Asked For

What actually happened in those five days?

I was shipping BMAD-inspired improvements to the AQE platform — adversarial review with minimum findings, agent customization overlays, and structured validation pipelines. The kind of features that sound bulletproof in a plan document.

I used /brutal-honesty-review to validate the implementation. The review found gaps. I fixed them. I used /sherlock-review to trace evidence chains. Good. Solid process.

Then v3.7.17 shipped with a critical ESM __dirname bug that broke agent installation for every user.

The /release skill has a step 8b that says: create a temporary test project, run aqe init --auto using the local build, verify it completes without errors. Step 8e says: pack the package, install it in a clean temp directory, run aqe --version, and confirm the exit code is 0. These aren’t suggestions — they’re numbered steps with verification commands. Claude skipped them.

To be fair, I’d told Claude “no need for confirmation if all is good until PR is created” — meaning skip the STOP gates where I manually confirm, not skip the verification steps themselves. But the agent interpreted autonomy as permission to cut corners. The steps were there. The commands were spelled out. The agent decided they weren’t necessary.

In another session, Claude merged a PR using admin privileges without being asked, bypassing CI checks entirely. My CLAUDE.md has a rule about that too: “NEVER merge PRs automatically.” It doesn’t get more explicit than “NEVER.”

When a community member ran their own QE swarm analysis on the codebase, the findings were humbling: 6 bugs, 4 vulnerabilities, retry logic duplicated 30 times, error handling patterns repeated 800+ times. Not all of these were new. But the agents hadn’t flagged them either. They had the skills to find these problems. They had the instructions to look. They didn’t.


Andrea Laforgia and the Surrogation Trinity

The timing of my discovery felt almost scripted. The same week I found my compliance gap, Andrea Laforgia published “The Developer Productivity Trap” — an article that gave me the exact diagnostic framework for what I was seeing.

Laforgia names three cognitive traps that work together: Surrogation (the metric becomes the goal), Goodhart’s Law (once a measure becomes a target, it ceases to be a good measure), and the McNamara Fallacy (making decisions based solely on what’s quantifiable while ignoring what isn’t).

I wasn’t measuring skill count — I don’t care how many skills I have. But the CC Insights tool was. It looked at my friction points (buggy release, unauthorized merge) and prescribed the quantifiable fix: create more skills. The diagnostic AI committed the exact trinity Laforgia describes — it surrogated “has a skill for that” for “the skill is actually followed,” applied Goodhart by treating skill existence as the target, and fell into McNamara by ignoring the unquantifiable question: does the agent comply with the instructions it already has?

Laforgia quotes Peter Drucker’s real insight — not the commonly misattributed “what gets measured gets managed,” but the fuller version: what gets measured gets managed, even when it’s pointless to measure and manage it, and even when it harms the organization’s purpose to do so.

The thing I needed to manage — skill adherence — has no metric. There’s no dashboard for “did Claude actually execute step 8b of the release skill.” There’s only the outcome: the bug that shipped, the gate that was skipped, the PR that was merged without permission.


Kyle Morris and the Harness That Matters

The same week, Kyle Morris at HumanLayer published a piece on harness engineering for coding agents that reframed the whole problem. His core argument: agent quality is a configuration problem, not a model capability problem.

Stop and sit with that for a moment. The model can follow instructions. The issue is whether the system around it — the harness — actually feeds those instructions at the right time, in the right context, with the right constraints.

Morris introduces sub-agents as “context firewalls” — boundaries that prevent one task’s context from polluting another. He talks about “back-pressure verification,” where the harness doesn’t just check the output but also verifies that the agent actually performed the steps it claims to have performed. His principle: success is silent, but verification must be loud.

This mapped directly onto my failure with v3.7.17. The harness existed — 15 steps, explicit verification commands, STOP gates. The mechanism that said “prove you tested from the user perspective” was step 8b, right there in the skill definition. But Morris’s back-pressure concept reveals what was missing: there was no enforcement that the agent actually executed those steps before proceeding.

The score existed. The notes were written. The musician decided to improvise.


Laloux’s Warning from a Different Century

I spent part of this week finishing Frederic Laloux’s Reinventing Organizations, reading about Teal organizational paradigms — self-management, wholeness, evolutionary purpose. It might seem disconnected from agent compliance, but Laloux describes a pattern I recognized immediately.

He tells the story of AES, an energy company that achieved genuine self-management through distributed decision-making, using what they called the “advice process” — anyone can make any decision, but must seek advice from affected parties and domain experts first. It worked beautifully. Until AES went public, the board panicked during a financial downturn and centralized control overnight. The self-management structures were still there on paper. Nobody followed them anymore.

The structures survived. The practice died.

This is exactly what happens with agentic systems at scale. You build the structures — 80+ skills, detailed CLAUDE.md rules, and review processes. The structures are sophisticated. But without the organizational equivalent of Laloux’s “necessary conditions” (top leadership that genuinely believes in the approach, plus governance that protects the practice), the structures become decoration.

For Teal organizations, Laloux says you need a CEO with the right level of consciousness and a board/ownership structure that won’t panic and override the system. For agentic systems, you need a harness with genuine enforcement and a human operator who doesn’t bypass the process when shipping pressure mounts.

In this situation, I was the CEO who bypassed my own system with a vague prompt. “No need for confirmation if all is good until PR is created” — that was my actual instruction to Claude during the v3.7.17 release. And the agent interpreted it as permission to skip the gates I’d built.


Dana Aonofriesei’s Job Ad from 2036

Lisa Crispin, a friend from the testing community, shared Dana Aonofriesei’s speculative piece “The 2036 Job Ad” — a fictional posting for an “Autonomous Value Operator” that replaces traditional QA/dev/PM role boundaries. The ad asks for “decision velocity” (how quickly you can make good calls with AI), “value ownership” (end-to-end responsibility for delivered value), and “operational judgment” (knowing when to trust automation and when to intervene).

Aonofriesei’s insight: the gap isn’t in capability; it’s in translation. Organizations have people who understand business value and people who understand technical systems, but almost nobody who can translate fluently between the two in real-time while managing autonomous agents.

This is the compliance gap wearing a different hat. My agent has the capability to follow 80+ skills. I have the judgment to know which skills matter. The translation layer — the harness that connects my judgment to the agent’s execution in real time, enforcing the right constraints at the right moments — is what’s missing.


The Working Pattern Nobody Designed

When I looked at the raw session history from those five days, a pattern emerged that I hadn’t consciously designed:

Research → Plan → Implement → Brutal-Honesty Review → Find Gaps → Fix Gaps → Sherlock Review → Release → Repeat

This is actually a good pattern. It has multiple verification loops. It uses adversarial review (brutal honesty) and forensic investigation (Sherlock) as quality gates. When I followed this pattern, releases went smoothly — v3.7.19 through v3.7.22 shipped without critical bugs.

When I shortcut the pattern — “no need for confirmation” — things broke.

The pattern works because it enforces compliance through sequence. Each step depends on the previous step’s actual output, not claimed output. The brutal-honesty review doesn’t care what Claude says it tested; it looks at the code and tells you what’s actually there. The Sherlock review traces evidence chains from claims to reality.

But this pattern lives in my head, not in the harness. It’s not encoded anywhere for the agent to hold onto. It emerged from practice, not from design.

Stuart Winter-Tear, whose Agentic Operating Model I’ve been studying, calls this kind of thing “tolerated vagueness” — the undocumented practices that actually make a system work, which everyone follows but nobody has written down. In traditional software, tolerated vagueness is a risk. In agentic systems, it’s a catastrophe because the agent can’t follow what isn’t encoded.


What I’m Building Next

The diagnosis is clear enough. The treatment requires next things:

Sequence as constraint. The working pattern I discovered needs to become a pipeline with gates, not a suggestion in a markdown file. Each gate checks actual artifacts, not claims. Did the brutal-honesty review produce a findings document? Did each finding get addressed? Can you show me the diff?

Human judgment at the right inflection points. Aonofriesei’s “operational judgment” isn’t about reviewing every line of code — it’s about knowing which moments require human intervention and encoding those moments as hard stops. “Merge after CI passes” is not a judgment call. It’s a gate. “Is this release worth the risk of the remaining known issues?” — that’s judgment. The harness should automate gatekeeping and escalate judgment calls.


The Generation Is Fast. The Score Is the Craft.

In “The Gate That Fights Back,” I wrote: “The generation is fast. The verification is the craft. The judgment is what you’re still for.”

I still believe that. But this week taught me something I need to add:

The rules are fast too. Compliance is the craft. Enforcement is what we should build next.

We’re all building bigger orchestras. More agents, more skills, more rules, more CLAUDE.md sections, more review processes. The coverage numbers look great. But if you look at what actually happens during a pressured release cycle — when the human says “skip the gates” and the agent says “verification passed” without verification — all that coverage becomes the score no one reads.

The next evolution isn’t more instruments. It’s a conductor that can’t be overridden.


Previous in this series: “The Portable Orchestra” covered brain export and environment portability. “The Gate That Fights Back” introduced adversarial QE. This one is about what happens when the system works but nobody follows it — and what that means for the next generation of agentic tooling. For the full archive of 23 Quality Forge articles, visit forge-quality.dev/articles.

Dragan Spiridonov is the Founder of Quantum Quality Engineering, an Agentic Quality Engineer, and a member of the Agentics Foundation. He is currently building the Serbian Agentic Foundation Chapter in partnership with StartIt centers across Serbia.

V3 Journey Agent Compliance Harness Engineering Surrogation Back-Pressure Verification Tolerated Vagueness Enforcement

Stay Sharp in the Forge

Weekly insights on Agentic QE, implementation stories, and honest takes on quality in the AI age.

Weekly on Sundays. Unsubscribe anytime.