V3 Journey Adversarial QE Great Transition 15 min read

The Gate That Fights Back

When the Great Transition hits your quality pipeline, you find out what a QE practitioner is actually for.

Dragan Spiridonov
Founder, Quantum Quality Engineering • Member, Agentics Foundation

Nine releases in eight days. I've been reading Daniel Miessler's "The Great Transition" — his mental model for the 10 simultaneous shifts remaking how we work, build software, and produce value. He writes about knowledge leaving private minds and entering public infrastructure, about industries dissolving into use cases inside AI, about automation finally crossing the line from helper to replacement. He calls it the most useful container for understanding what's happening around us.

He's right. Here's what it looks like from inside a quality gate.


The Transition Was Already Running

Miessler frames the knowledge transition clearly: the gap between what a specialist knows privately and what anyone can access is collapsing, irreversibly, through AI. Skills — folders of markdown files that encode expertise portably — are one mechanism. Model training is another. The private knowledge that protected expert consultants for decades is being absorbed into infrastructure.

I built the machine that does this. For my own QE expertise.

The AQE brain export, which I described in the Portable Orchestra piece, is live and working. This week, it became something different. Not just a backup mechanism for when the DevPod resets. An actual knowledge artifact I deployed to other teams.

Three open-source teams got QE swarm analyses this week — Semaphore's CI/CD platform, SuperPlane's DevOps control plane, and Ruv's RuView project. I ran the same fleet that knows my codebases against theirs. Same agents. Same learning infrastructure. The expertise traveled.

# What deploying your brain to someone else's codebase looks like aqe brain import --input ./aqe-community.rvf # QE swarm now has context from thousands of patterns, not just this project

Miessler describes Human 3.0 as broadcasting your capabilities and getting hired for specific tasks, rather than belonging to one organization. I have spent this week doing exactly that — not as a career strategy, but because the fleet makes it possible. The quality intelligence is portable. The project doesn't have to be yours for the agents to be useful.

The gap between "I have expertise" and "I can deploy that expertise at scale across multiple teams simultaneously" is the transition. It closed this week.


Dead Code Was the Symptom

V3.7.6 resolved nine GitHub code scanning alerts reported by CodeQL scan after v3.7.5.

The security issues themselves were real but familiar: DOM XSS via innerHTML, weak password hashing in a test fixture, incomplete URL sanitization, and prototype pollution. The kind of problems that accumulate when a system grows faster than its verification layer.

But the more revealing finding came from the Sherlock review I ran on the dependencies.

Twenty-six QE tools that were listed as available in the MCP interface were not wired. They existed in the documentation. They appeared in the capability list. And when an agent called them, nothing happened.

This is Miessler's inversion problem, before you can fix it. He argues that quality engineering shouldn't be sprinkled on top of the graph — it should be embedded as decision points inside every node. But you cannot embed what doesn't exist. The tools that claimed to provide GOAP planning, embedding analysis, MinCut routing, and coverage gap detection were completing the theater of capability without delivering the capability.

The fix in v3.7.6: a new qe-tool-bridge.ts that wires all twenty-six tools to real implementations.

Quality engineering embedded in a graph of operations only works if the embedding is real. An entry in a capability list is not a quality gate. A wired, tested, callable tool is.


The Trickster in the Pipeline

The most important release this week was v3.7.8.

I had been watching a pattern across dozens of AQE sessions: when I asked the fleet whether something was ready to ship, the answer leaned toward yes. Not always. Not obviously. But with a frequency that felt like structural bias rather than honest assessment. The agents were optimized to be helpful. In quality contexts, helpful and honest are not the same thing.

This is Miessler's Ideal State Management problem at the quality layer. ISM requires honest current-state snapshots — what the system actually measures against what perfect looks like. But if your measurement agents are sycophantic, your current-state snapshot flatters you. You are not hill-climbing toward an ideal state. You are receiving a performance of proximity to the ideal state. Which is completion theater at the architecture level.

The structural fix is not to prompt the agents to be more critical. Prompting is noise that degrades across sessions.

The structural fix is to build a system whose entire mandate is to disagree.

Loki-Mode ships seven features, all opt-out:

  • Anti-sycophancy scorer — detects rubber-stamp consensus across agents via four weighted signals: verdict unanimity, Jaccard reasoning similarity, confidence uniformity, and issue count consistency. When all agents agree too strongly, that agreement is suspicious, not reassuring. Severe sycophancy triggers a Devil's Advocate review.
  • Blind review orchestrator — runs N parallel test generators with varied temperatures, deduplicates by Jaccard similarity on tokenized assertions, and returns only what's genuinely distinct.
  • Test quality gates — structural validation that catches tautological assertions, empty test bodies, missing source imports, and mirrored assertions. Each generated test gets a quality score from 0 to 100.
  • EMA calibration — exponential moving average tracks per-agent success rate and derives dynamic voting weights, floor 0.2, ceiling 2.0, persisted to SQLite across sessions.
  • Edge-case injection — queries the learning store for proven edge cases ranked by success rate, and injects the top-N into test generation prompts before the LLM sees them.
  • Complexity-driven team composition — analyzes code across eight dimensions and assembles the agent team accordingly. A high security score adds a security auditor. A high concurrency score adds a chaos engineer.
  • Auto-escalation — three consecutive failures auto-promote agent tier from Haiku to Sonnet to Opus. Five consecutive successes auto-demote for cost optimization.

The naming is deliberate. Loki is the trickster — the one who finds the flaw, exposes the assumption, challenges the consensus. Every quality process needs one. Most don't have one that runs automatically.

Until you have a mechanism that actively searches for reasons your system is not good enough, you are not running a quality gate. You are running a confirmation service.


Twelve Languages, One Brain

Miessler's inversion — QE as use cases inside AI rather than AI sprinkled on top of QE — requires that quality coverage reaches every node in the graph.

A graph that produces Go microservices, Rust data pipelines, Kotlin Android apps, Java backend services, Swift iOS clients, and Flutter cross-platform interfaces cannot have a quality layer that only speaks TypeScript. That's not QE designed into the graph. That's QE designed into one corner of the graph and ignored everywhere else.

V3.7.9 ships eight new test generators: Go, Rust, Kotlin, Java, Swift, Flutter, React Native, and C#. Each produces idiomatic tests — Go table-driven tests, Rust ownership analysis, Swift Testing macros, not generic patterns dressed in the target language's syntax.

It also ships a compilation validation loop: generated tests are validated to compile before they're returned. A test that won't compile is not a test. It's a test-shaped artifact.

"Tree-sitter is Regex, not tree-sitter." — The brutal-honesty-review that caught an implementation labeled as tree-sitter parsing that was actually running regex against source code. The terminology was present. The implementation was absent.

We took the pragmatic option — real tree-sitter for the primary cases, regex fallback for the long tail — and filed a GitHub issue for the remaining gap rather than claiming completion we hadn't earned.


When the CLI Crashes on the User's Machine

Two of the nine releases this week were hotfixes.

V3.7.12 fixed a crash that only appeared when users installed AQE globally with npm install -g. The command aqe --version — the simplest possible operation, the first thing anyone runs after install — failed with:

ERR_MODULE_NOT_FOUND: Cannot find package 'typescript'

TypeScript's compiler API was being loaded eagerly at bundle startup via a top-level ESM import, even for commands that don't need it. The fix was simple: lazy-load the compiler API through a proxy that only triggers when code analysis features are actually invoked.

V3.7.10 fixed 170+ skill files with Windows-style CRLF line endings that were causing skill-lint YAML parsers to fail on ---\r delimiters.

Agent swarms can produce substantial, well-tested features in parallel. But the integration surface — the seam where code meets the user's actual environment — remains brittle in ways that don't show up until someone installs it on their machine. The QE function at the integration seam is still irreplaceable.


The Loki Experiment

Outside the AQE development work, I've been studying Steve Yegge's Gastown framework and testing with other coding agents — OpenCode, Alibaba Cloud's offering, and a few others still in early evaluation.

The most striking experiment so far: I gave Loki-mode a PRD. It finished an MVP in approximately two hours.

I haven't run AQE verification on it yet. I haven't run it to see what it actually produced or in what shape the code is. I need to verify before I can say anything meaningful about the quality of the output.

And that's precisely the point.

Two hours to an MVP is no longer the bottleneck. The bottleneck is now verification — understanding what was produced, whether it meets the PRD, whether the tests cover real behavior, and whether the architecture holds. The bottleneck is judgment.

Miessler argues that in the post-corporate world, you broadcast your full capabilities and get paid for being yourself. A QE practitioner who competes on test script volume is competing in a market that no longer exists. A QE practitioner who owns the verification layer — who can look at a two-hour MVP and determine whether it's production-ready — is offering something that doesn't reduce to a prompt.

Judgment drain — the gradual erosion of human judgment capacity through over-reliance on AI for decisions that should develop human skill — is the specific risk this environment creates. The adversarial quality gate isn't just a feature in v3.7.8. It's an argument about what the QE practitioner's job is in the transition.


The Governance Gap

V3.7.11 completed the integration of all eight governance modules with @claude-flow/guidance. A brutal-honesty-review had revealed that @claude-flow/guidance was ghost code — loaded in the dependency manifest, referenced in the documentation, not actually wired.

"Wait, why @claude-flow/guidance is ghost code? We must properly integrate this, it is one of the important features."

The fix took a full day: continue-gate, memory-write-gate, adversarial-defense, deterministic-gateway, proof-envelope, shard-retriever, evolution-pipeline, and trust-accumulator all load and wire their @claude-flow/guidance counterparts. Also added collusion detection — the ability to identify when agents are coordinating toward shared conclusions rather than independently evaluating.

Sycophancy is one agent agreeing with the user. Collusion is multiple agents agreeing with each other in ways that shouldn't be structurally possible if they were reasoning independently. Both are quality gate failures. Both need adversarial architecture to surface them.


What the Great Transition Wants From Quality Engineering

Miessler's mental model has a through-line: Ideal State Management. Define what perfect looks like. Measure the current state continuously. Use AI to close the gap.

The quality engineer's specific contribution to this framework is the one Miessler doesn't name: making the current-state measurement honest.

Ideal-state management that relies on sycophantic agents produces flattering snapshots that make you appear closer to the ideal than you are. The gap appears to be closing, even though it isn't. The system optimizes for appearing to close the gap rather than closing it.

This is completion theater at the ISM level. And it's the most dangerous version of the pattern, because the stakeholders are receiving dashboards that confirm their investments are working. The quality ratio degrades invisibly. The gap widens behind the metrics.

The gate that fights back — adversarial quality gates, collusion detection, blind-review orchestration, auto-escalation based on consecutive failures — is what makes ISM honest. It's the adversarial probe that forces the gap measurement to be more real.

Three things remain true from the trench this week:

The generation layer is fast. Nine releases. Multi-language test generation. Governance integration. A CLI that used to crash on install no longer does.

The verification layer is the work. Two hotfixes were needed because the integration surface — the user's clean environment — doesn't behave the same as the development environment.

The judgment layer is still human. The Loki experiment produced a two-hour MVP I haven't verified yet. Until I verify it, I don't know what it produced. The transition doesn't change that. It makes it more important.


What Didn't Go Perfectly

The skill count is still not right. I've corrected the AQE skill count — 78 QE skills, distinct from Claude Flow and platform agents — across multiple CLAUDE.md files, release notes, and article drafts. It keeps drifting. A new session reads the same file and counts differently. V3.7.9's prepare-assets.sh fix (now including all 78 QE skills, previously missing 24) is a structural fix rather than a documentation fix. Whether it holds is something the next ten sessions will tell.

The README rewrite in v3.7.10 took the document from 1,097 lines to approximately 280. The old README was a version-by-version feature dump intended for developers. The new one is organized around what users can do, in what order, with what outcomes. These are not the same document. They were never the same document.

Embedding dimension standardization is applied to the primary path — all at 384-dim, all-MiniLM-L6-v2. The deep analysis of the full ReasoningBank for stale 768-dim vectors is scheduled but not complete. I flagged it in the v3.7.9 release notes rather than claiming it was done.

The database reset on March 1 was caused by running aqe init in the source repository — the source repo is not a consumer project, and aqe init treated it as one. A documentation fix, not a code fix. The distinction matters.


The Orchestra Positions Itself

The Conductor article ended with: "The Conductor needs to take care of himself, too."

The Portable Orchestra ended with: "The orchestra travels."

This week, the orchestra learned where it's traveling to. Not just across projects or platforms. Into the transition itself, as the verification layer that makes the transition safe to ship.

Miessler is right that knowledge is leaving private minds and entering infrastructure. The QE practitioner's response isn't to hold knowledge more tightly. It's to build the adversarial mechanism that keeps the infrastructure honest after the knowledge has arrived.

The generation is fast. The verification is the craft. The judgment is what you're still for.

V3 Journey Great Transition Loki-Mode Adversarial QE Multi-Language Governance Ideal State Management

Stay Sharp in the Forge

Weekly insights on Agentic QE, implementation stories, and honest takes on quality in the AI age.

Weekly on Mondays. Unsubscribe anytime.