The next morning after the architecture rewrite, I ran the full test suite again. Green across the board. All tests passed.
I tried to use the system. The Queen Coordinator "orchestrated" by publishing events into the void. The ReasoningBank stored patterns that were not retrieved. The TinyDancer computed routing tiers that nobody checked.
Agents have built a system in which every piece works in isolation. And none of it worked together.
What followed was ten days of detective work—not writing code, but proving what the code wasn't doing. Eight releases. At least ten forensic investigations. More humbling discoveries than I'd like to admit.
This is what I learned about the gap between "the tests pass" and "it actually works."
The Detective's Toolkit
Early in the week, I kept running /brutal-honesty-review—a skill that channels Linus Torvalds for technical precision, Gordon Ramsay for quality standards, and James Bach for methodology rigor. It would tell me things like:
Useful. I knew something was broken. But I didn't know where, or why, or what to fix first.
Then I tried something different:
"Use /sherlock-review and analyze implementation, integration and persistence for ReasoningBank Learning System, and if qe v3 agents are using it."
What came back wasn't opinion. It was evidence:
createQEReasoningBank()called inhandleFleetInit()coherenceServiceparameter optional, defaults to undefined- Without coherenceService,
verifyCoherence()returns early with fallback - WASM engines never load because coherence check never runs
- Tests mock the coherence response, hiding the initialization failure
Line by line. File by file. An evidence chain from symptom to root cause.
Forensic investigation beats opinion every time.
The Database That Learned Nothing
On January 27, I noticed my DevPod status line:
The numbers hadn't changed in days. In a system explicitly designed to learn from every interaction.
Sherlock found the problem: two databases.
.agentic-qe/memory.dbin project root (where MCP tools wrote)v3/.agentic-qe/memory.dbin v3 folder (where domain services read)
The system captured experiences to one location and looked for them in another. No errors. No warnings. Just silent failure to learn.
After consolidation, by January 30:
What this means for users: The fleet now actually learns from your sessions. Patterns from successful investigations feed into future recommendations. The ReasoningBank builds genuine institutional memory.
The Queen Who Finished First
By January 29, I noticed something strange in my logs. When I'd spawn a QE swarm—multiple agents working on test generation, coverage analysis, security scanning—the Queen Coordinator would finish before any of her agents.
The orchestrator completes before the orchestra. That's not how conducting works.
I ran Sherlock:
"Analyze how qe-queen is coordinating qe agents swarms."
The findings: Queen spawned agents. Queen published "task assigned" events. Queen reported success. Queen terminated.
Meanwhile, the agents kept working. Nobody tracked their progress. Nobody collected their results. Nobody knew when they finished.
The Queen wasn't orchestrating. She was announcing and leaving.
The fix in v3.3.5: real orchestration via MCP tools. The Queen now spawns agents, monitors their work, collects their outputs, and only reports completion when the swarm actually completes.
What this means for users: When you run qe-queen-coordinator now, you get coordinated results—not scattered outputs from agents that may or may not have finished.
The Speed You Can Feel
Some improvements are invisible. Others hit you immediately.
v3.2.3 replaced linear similarity search with HNSW indexing. The technical version: O(n) became O(log n). The practical version: when you search for similar patterns across thousands of stored experiences, it's now 150x to 12,500x faster.
The difference between waiting three seconds and waiting less than a millisecond.
v3.1.0 added browser swarm coordination. Four browsers are working in parallel instead of one running serially. The same accessibility audit, security scan, and visual regression test—4x faster.
v3.3.1 jumped quality scores from 37 to 82. Not through new features, but through systematic cleanup. Maintainability index from 20 to 88. The codebase became easier to extend, so future improvements can come faster.
What Sherlock Kept Finding
I ran forensic investigations at least ten times in ten days. Here's the pattern of what was broken:
Queen Coordinator: "domainPlugins always undefined. handleEvent errors silently ignored. Domains don't report completion back to Queen."
Dream Cycles: "Dreams scheduled but never triggered. No cross-domain insight broadcasting. Soft delete implemented, but dreams never resurface."
TinyDancer Routing: "Routing tier computed correctly. Task executor ignores tier and uses default LLM anyway."
MinCut Consensus: "Test-generation domain has full integration. Nine other domains have the code but never call verifyFinding()."
Every component implemented. Every test passing. None of it connected.
Each Sherlock investigation produced an evidence chain.
Each chain led to a specific fix.
Each fix shipped in the next release.
Community and Crashes
Two people made these ten days more interesting.
Nate filed bugs within hours of v3.0's release. MODULE_NOT_FOUND on the version command. Timeout problems. Fresh install UX issues. Each report was fixed in subsequent releases. Fast feedback from real usage beats any amount of internal testing.
Lalit submitted a PR with quality criteria analysis agents, cross-phase memory, and accessibility auditing skills. Good ideas. Complete architectural mismatch—TypeScript agent infrastructure when v3 uses markdown definitions, file-based memory when we have UnifiedMemoryManager, code scattered across root folders when v3 consolidates.
I spent most of January 29 refactoring the contribution. Extracting the valuable logic, adapting it to existing patterns, and deleting the incompatible scaffolding. Lalit's QCSD ideation workflow and accessibility skills are now part of v3. But integration took longer than fresh implementation would have.
Community contributions add real value. They also surface hidden complexity in "just wire it up."
Meanwhile, DevPod crashed at least six times. Out-of-memory when spawning too many browser processes. Stuck processes that never terminated. Each crash lost uncommitted work.
The solution was behavioral: treat every interaction as potentially the last before a crash. Commit early. Commit often.
Where Things Stand
v3.3.5 as of January 30:
- 51 QE agents, including specialized TDD subagents
- 63 QE skills, including Sherlock Review and community contributions
- 12 bounded contexts with working MinCut/Consensus integration
- Real orchestration where the Queen actually coordinates
- Self-learning that persists in the right database
- Similarity search that returns in milliseconds, not seconds
The integration gaps are largely closed. Sherlock finds fewer critical issues with each pass.
But I'm not pretending this is done. Every release surfaces new integration problems. Every feature exposes assumptions that don't hold in production.
The Uncomfortable Lesson
I spent more time tracing integration failures than writing new features. That's the honest version of ten days.
Unit tests don't catch integration failures. Your tests can pass while your system is fundamentally broken.
Self-learning systems fail silently. No error when experiences don't persist. The system just doesn't learn.
Evidence chains beat opinions. "Something is wrong" is less useful than "here's the exact execution path proving it."
Sometimes the most valuable work is proving what isn't working.
Try It Yourself
npx agentic-qe@latest init --auto
# Orchestrated QE swarm (with real coordination now)
claude "Use qe-queen-coordinator to orchestrate: test generation,
coverage analysis, security scan, and quality gate"
# Run your own forensic investigation
claude "Use /sherlock-review and analyze implementation,
integration and persistence for [your feature]"
# Get quality recommendations
claude "Use quality-criteria-recommender to analyze this
[API/feature/module] and suggest testing approaches."
What integration gaps are hiding behind your passing tests?