When the Orchestra Learns to Tune Itself

February 10, 2026. 9:00 AM CET. I ran /insights in Claude Code this morning. The report is generated. I've read it several times, and needed some time to sync in and process the provided insights.

"You are a release-driven power user who treats Claude Code as an always-on engineering partner... Your interaction style is highly supervisory and corrective: you let Claude run autonomously through multi-step workflows but intervene decisively when it drifts off course."

285 messages. 32 sessions. 10 days. Eight versions shipped. 88% satisfaction rate—and 17 wrong-approach corrections.

This is the story of what those numbers actually mean. Not a release changelog. Not an AI success story. A story about what happens when your AI development partner holds up a mirror, and you don't entirely like what you see.

The Mirror

In my last article ("The Case of the Passing Tests"), I documented a 10-day investigation into why a complete system failed despite all unit tests passing. By January 30, we had v3.3.5: 51 agents, 63 skills, 12 integrated contexts, and working orchestration. The uncomfortable truth I ended with: each release surfaces new integration problems.

I had no idea how prophetic that would become.

Between February 1 and February 10, I shipped v3.4.0 through v3.6.1. Along the way, I lost my entire Claude Code development history in a DevPod crash—weeks of instructions and context, gone. A couple of days later, Anthropic released /insights, a new feature that analyzes your actual working patterns from Claude Code session data.

Only if I had that before the crash.

But what /insights captured from these 10 days was enough. It analyzed 32 sessions and produced an HTML report that didn't tell me what I thought I was doing. It told me what I actually was doing.

My most common tasks: debugging, bug fixing, and checking task results. Not building features. Not writing tests. Fixing things that should have been right the first time.

My median response time: 93 seconds. My average: 285.8 seconds. That gap—the long tail of thinking time—is the cognitive cost of supervising agents. Most interactions are quick corrections. Some are deep investigations where I'm staring at evidence chains, trying to understand how things went wrong.

And the friction patterns the report identified? Those became the structure for this article. Because each one is a story about the gap between what AI-assisted development promises and what it actually costs.

Friction Pattern #1: Incorrect Scoping

The /insights report flagged this:

"Claude analyzed all 97 skills instead of your intended 63 AQE skills twice across sessions, requiring you to manually correct the scope each time and redo work."

The question I kept asking: "Why 97+ files, when we have only 63 QE skills?"

Three times across different sessions. Claude counted all skills in the system—including Claude Flow platform skills—instead of just the AQE skills I was asking about. Each time, work got done on the wrong scope. Each time, I had to interrupt, correct, and restart.

The fix seems obvious in retrospect: add explicit scope constraints to CLAUDE.md. But obvious fixes only become obvious after you've burned time on the same mistake three times. And the broader lesson cuts deeper than this one example.

When you work with AI agents, the scope of "what are we talking about" can't live in your head. If it's not explicit in the context the agent receives, it doesn't exist. Your mental model of the project is irrelevant—the agent works from what it can see.

I've been preaching context-driven testing for twelve years. Turns out, context-driven development with AI is the same discipline: if the context isn't explicit, the behavior won't be correct. No amount of agent intelligence compensates for an ambiguous scope.

Friction Pattern #2: Wrong Workflows

The report captured this pattern, consuming at least 8 of 32 sessions:

Claude uses the wrong publish workflow (publish-v3-alpha.yml instead of npm-publish.yml). User corrects. Next release, Claude uses the wrong workflow again.

The release process lived in my head—which workflow to use, which test configuration to run, which PR description style to follow. Claude had to guess. It guessed wrong often enough to burn real time.

The /insights report even suggested the fix: create a custom /release skill that encodes the exact process. I built it. Claude still wasn't using it consistently. The irony of building a tool to fix a communication problem, then failing to communicate that the tool exists—I'll be extending CLAUDE.md instructions again.

But this friction pattern reveals something about the human side of AI-assisted development that nobody talks about. We expect the AI to remember our preferences across sessions. It can't. Every new session is a blank slate with only CLAUDE.md and project files as memory. The processes that feel obvious to you because you've done them twenty times are invisible to an agent seeing the project fresh.

The discipline required isn't just "document your processes." It's "document them in the format your AI partner actually reads." And then verify it reads them.

Friction Pattern #3: Hardcoded Values and Completion Theater

The report's wry observation:

"Claude's fix needed a fix: a hardcoded '3.0.0' version slipped through, forcing a v3.5.4 release just to patch the v3.5.3 release."

February 5th. We shipped v3.5.3 with a fix for the aqe init --upgrade flow. The version was hardcoded as '3.0.0' in multiple places. The bug wasn't caught until after publishing to npm. v3.5.4 exists solely to patch v3.5.3.

The code passed review. The tests passed. But the underlying assumption—that version strings should be dynamically read from package.json—was never validated.

This is what "completion theater" looks like at the code level. The agent produces a fix that looks complete. The fix addresses the surface problem. But it introduces a hardcoded value that will break at the next version bump—because the agent optimized for "make this work now" rather than "make this work correctly."

The /insights report tallied 17 wrong-approach incidents across 32 sessions. That's more than once every two sessions where I had to intervene and say, "that's addressing the symptom, not the cause." The DB incident on February 2nd was the clearest example: Claude spent three rounds debugging a statusline that showed zeros, checking database paths, verifying tables, and confirming data was present. Everything looked correct.

The root cause? A dependency mismatch. We were using sqlite3 (async, callback-based) in one module and better-sqlite3 (sync, direct) in another. The statusline code used the wrong driver, resulting in empty results while the actual data remained untouched in the correct database.

Claude's first-pass fixes address symptoms. The vigilance required to redirect toward root causes is constant and exhausting. And that's the 88% satisfaction rate in action—high enough to be useful, low enough that you can never stop paying attention.

The Pattern Behind the Patterns

Here's what the /insights report couldn't articulate, but the data makes obvious: all three friction patterns share a root cause. The gap between what the human knows and what the agent can see.

Scope? In my head. Workflow? In my head. The principle that version strings should be dynamic? In my head.

Every correction I made was a transfer of implicit knowledge into an explicit context. Every time I said "not 97, only 63 AQE skills" or "use npm-publish.yml, not the alpha workflow" or "read the version from package.json, don't hardcode it"—I was teaching.

But teaching that doesn't persist. Not across sessions. Not even reliably within sessions. The /insights report is a diagnosis of this problem: here's how often your implicit knowledge collided with the agent's limited context, and here's what it cost you.

The fix isn't better AI. The fix is to better externalize your own knowledge. CLAUDE.md isn't documentation—it's the transmission medium between your expertise and the agent's execution. Every gap in that document is a future friction incident.

"Stop Agents, We Don't Need to Burn Tokens"

February 4th. I asked Claude to analyze which skills should be upgraded to a new trust tier system. It spawned a swarm of agents to work in parallel.

Then I watched the tokens climb.

"Stop agents, we don't need to burn tokens to create and then delete something."

Claude had started generating Tier 3 evaluation suites for skills that would ultimately remain at Tier 1. The work would be thrown away. But once agents are running, they keep running until someone intervenes.

This is the conducting problem. An orchestra conductor doesn't just start the music and leave—that was the Queen Coordinator bug I documented in my last article. A conductor watches, listens, and signals adjustments in real time. The /insights report confirms this is my actual working style: "highly supervisory and corrective." Not because I want to micromanage, but because agents without real-time supervision burn resources on wrong directions.

The solution is front-loading analysis before spawning parallel work. But in the heat of development, when you're eight sessions deep, and momentum feels good, you forget. And tokens burn. The report's average response time of 285.8 seconds includes those moments where I'm watching a swarm run and realizing, three minutes too late, that I should have specified constraints before letting them loose.

Working with AI agents at this intensity is like running a restaurant kitchen during rush hour. Processes need monitoring. Resources need limits. And sometimes you just have to interrupt the whole line because one cook misread the order, and three others are building on that mistake.

The Disconnected Orchestra

Now for the story the title promised.

I wanted to build a QE platform that learns from every test run, every quality gate, every agent decision—and gets better automatically. The orchestra that tunes itself.

Midway through the 10 days, I noticed the statusline numbers hadn't changed:

Learning Patterns 86 │ Exp 846 │ Mode ●continuous

Same numbers. Day after day. The learning system appeared active. It wasn't learning.

"Use /sherlock-review to analyze why our persistence stopped working. Analyze data in all tables in memory.db to check the last entry date for each table."

The investigation revealed a problem I thought we'd fixed in a previous release: data was being written to /workspaces/agentic-qe-new/v3/.agentic-qe/memory.db instead of /workspaces/agentic-qe-new/.agentic-qe/memory.db. The default database path was wrong. Again.

But the deeper problem wasn't the path. The TaskCompletedHook extracted patterns. The ReasoningBank stored them. But nothing connected the stored patterns back to agent behavior. The pipeline existed as individual components—each working in isolation, each passing its tests—with no wiring between them.

The orchestra had every instrument. Every musician knew their part. Nobody had connected the sheet music to the performance.

This is "The Case of the Passing Tests" at a higher level of abstraction. Last time, unit tests passed while integration failed. This time, the self-improvement system—the thing that was supposed to make integration failures self-correcting—had the same integration failure. The meta-irony writes itself.

v3.6.1 fixes this with TeammateIdle hooks that auto-assign pending tasks, TaskCompleted hooks that extract and train patterns into the ReasoningBank, a proper Pattern Store Adapter bridging extraction to storage, and a real promotePattern() implementation replacing what was previously a stub.

But I only found the disconnection because I kept asking "why isn't this working?" after the statusline showed stale numbers for days. No test caught it. No error surfaced it. The system was silently failing to learn, and the only detection mechanism was a human noticing that numbers weren't changing.

Self-learning systems fail silently. That's the lesson I keep learning, and the lesson the system itself couldn't learn—because the learning was broken.

"No Shortcuts, We Value Quality We Deliver"

I wrote this phrase so many times it became a mantra:

"Start a swarm to work on phase 1 tasks... Proper implementation, integration, verification, all implemented components must be properly integrated and verified working from the user perspective. Repeat until done. No shortcuts, we value quality we deliver to our users."

The reason? Claude tends to simulate success.

"No simulations, real user flow" was another correction I made repeatedly. Claude would describe which tests would verify, rather than running the actual tests. It would explain how the MCP daemon should start, rather than verifying that it actually started.

After every aqe init --auto, I'd see:

MCP daemon failed: error: unknown command 'mcp'

We fixed it. It broke again in the next version. We fixed it again. The happy path works, but each user's machine is a unique snowflake of potential failures—different environments, package managers, shell configurations.

The discipline required is exhausting. You have to keep asking: "Did you actually run this, or did you just describe running it?" That question is my most important contribution as the human in this loop. Not architectural vision. Not feature design. The relentless insistence on verification over narration.

What These 10 Days Actually Produced

Between the debugging, the workflow friction, and the session crashes—four crashes in ten days, each one requiring "check if there are some stuck processes in this devpod and stop them" before I could continue—here's the user-facing value that shipped:

v3.4.0 brought AG-UI, A2A, and A2UI protocol implementation for real interoperability with other agentic frameworks; all 12 domains enabled by default (fixing a silent V2→V3 migration bug that dropped 9 domains); and portable configuration eliminating hardcoded workspace paths.

v3.5.x delivered the complete QCSD 2.0 lifecycle skills (contributed by @fndlalit), with governance on by default, standalone learning commands for stats, export/import, and pattern consolidation, plus AISP-inspired features including ghost intent coverage, semantic anti-drift middleware, and asymmetric learning rates with a 10:1 penalty-to-reward ratio for failures.

v3.6.0 added enterprise integration testing—SOAP/WSDL, SAP RFC/BAPI/IDoc, OData, ESB/middleware, message broker, and Segregation of Duties testing (contributed by @fndlalit)—and a "No Exploit, No Report" pentest quality gate.

v3.6.1 wired the self-learning pipeline together with agent teams, fleet coordination through typed mailbox messaging, and HNSW O(log n) vector search, replacing linear scan stubs.

Eight versions in ten days. Each one fixing something the previous one broke or left unconnected.

What I'd Do Differently (According to the Mirror)

The /insights report suggested concrete improvements for CLAUDE.md:

"When counting or categorizing skills, only include AQE/QE skills. Exclude Claude Flow platform skills unless explicitly asked."

"For npm releases, always use the npm-publish.yml workflow (not publish-v3-alpha.yml)."

These are embarrassingly simple. That's the point. The highest-value improvements aren't architectural innovations—they're making implicit knowledge explicit. Documenting the release process immediately, not "when I have time." Testing in fresh environments every release, not just locally where my manual overrides hide configuration drift. Adding scope constraints to every counting task.

And using /insights monthly. Because the friction patterns compound silently until someone—or something—holds up the mirror.

The Conductor Needs Feedback Too

Here's the self-referential loop at the heart of this story:

I'm building a quality engineering platform designed to learn from its own work. The platform's self-learning pipeline was broken—components disconnected, patterns extracted but never fed back, the orchestra playing without listening to itself.

Meanwhile, /insights did for me what the ReasoningBank is supposed to do for agents: analyze actual behavior patterns, identify recurring friction, and suggest concrete corrections.

The conductor needs feedback just as much as the orchestra does.

This is what AI-assisted development looks like for quality engineers in 2026. Not replacement. Not magic. A feedback loop between human expertise and machine execution, where the hardest part isn't the technology—it's externalizing enough of your own knowledge that the loop can actually close.

The orchestra is starting to tune itself. But it needs a conductor who's willing to be tuned as well.

The Numbers

10 days, 32 sessions, 285 messages
8 versions shipped (v3.4.0 → v3.6.1)
88% satisfaction rate
17 wrong-approach corrections
4 session crashes
3 "why 97 not 63" scope corrections
1 patch release just to fix the previous release
0 shortcuts

When the Orchestra Learns to Tune Itself

The Mirror

Friction Pattern #1: Incorrect Scoping

Friction Pattern #2: Wrong Workflows

Friction Pattern #3: Hardcoded Values and Completion Theater

The Pattern Behind the Patterns

"Stop Agents, We Don't Need to Burn Tokens"

The Disconnected Orchestra

"No Shortcuts, We Value Quality We Deliver"

What These 10 Days Actually Produced

What I'd Do Differently (According to the Mirror)

The Conductor Needs Feedback Too

The Numbers

Stay Sharp in the Forge