The Conductor Who Won't Stop Conducting

February 21, 2026. 10:00 AM CET. I ran /insights in Claude Code this morning. 81 sessions. 596 messages. Twenty days. Fifteen releases in the last 11 days. I've been staring at one number: 38 wrong-approach corrections. But the number that matters most isn't in the report.

In my last article ("The Quality Cost of the AI Vampire"), I wrote about judgment as a finite resource — how the AI vampire feeds on your decision-making capacity through constant micro-decisions, and how sustainable pace isn't a luxury but infrastructure for quality. I wrote about my garden, my dog walks, and my deliberate practice of protecting the mind that makes the decisions.

Two articles ago, in "When the Orchestra Learns to Tune Itself," I told the story of wiring the self-learning pipeline together. v3.6.1 shipped with the orchestra finally connected — agents feeding patterns back to the ReasoningBank, hooks capturing experiences, the learning loop closing for the first time. I ended that piece with cautious optimism: "The orchestra is starting to tune itself. But it needs a conductor who's willing to be tuned as well."

What I didn't expect was how much conducting would be required in the next eleven days. Or how many times I'd need to repeat the same instruction. Or how completely I would ignore my own advice about sustainable pace. Or how my own emotional state would become the biggest quality risk in the entire system.

This article was supposed to be about the agent's failures. It still is. But as I wrote it, I realized it's also about mine.

"Don't Run the Full Test Suite"

If this article had a subtitle, it might be: "How Many Times Can One Human Say The Same Thing?"

From the /insights report:

"Across multiple sessions, the user had to interrupt Claude and express frustration for running the full test suite instead of targeted tests — this happened at least 4-5 times."

At least 4-5 times. The report is being generous.

                    February 11: "dont run the full test suite, we have rules."
February 14: "don't run the whole test suite, you have this in your rules,
              and you keep doing the same mistake, why?" (said twice, same day)
February 15: "dont run the full test suite, how many times I need to repeat
              this even you have this in your rules?" (three times in rapid succession)
February 17: "dont run the whole test suite" (twice)
February 18: "dont run the full test suite, again, and again you keep ignoring your rules"
February 19: "you are again running the full test suite, why?" (twice)
                

Twenty corrections for the same rule. A rule that's written in CLAUDE.md. A rule Claude acknowledges every time I point it out. A rule it violates in the next session again.

This isn't about pedantic compliance. Running the full test suite in my DevPod caused the session to crash. Every time. OOM. Process killed. Work lost. February 15th alone had six session crashes from test runs. Each crash means lost context, lost in-progress work, and the cognitive tax of re-establishing where we were. The test suite rule isn't about preference; it's about survival.

The /insights report suggested adding this to CLAUDE.md:

"NEVER run the full test suite. Only run specific, targeted tests related to the files you changed."

I already have this rule. It's been there for weeks. The report suggesting I add what I already have is the meta-irony that writes itself.

Here's what I think was happening underneath. Anthropic released new models during this timeframe: Opus 4.6 and Sonnet 4.6. I've noticed a pattern across my months of working with Claude Code: whenever they release a new model, the behavior shifts. Rules that were followed get ignored. Patterns that were established need re-establishing. It's not that the new model is worse; it's that the behavioral calibration I'd built up through CLAUDE.md instructions and session habits don't transfer cleanly. The agent reads the same file and interprets it differently.

This is the fundamental limitation of context-window-based AI: every session starts fresh, and "fresh" doesn't mean "consistent." The rule exists in the file. The model reads the file. And then, under the pressure of a complex task, it defaults to the behavior that seems most thorough — running everything — rather than the behavior that's correct for this environment.

Process documentation only works if people actually follow it. Turns out, the same applies to AI agents. And when the agent behind the documentation changes version mid-sprint, your documentation needs recalibrating — exactly when you don't have the patience for it.

But here's where the quality engineer in me finally showed up. After twenty corrections and days of frustration, I stopped fighting the symptom and fixed the root cause. Instead of writing ever-angrier CLAUDE.md rules about not running the full suite, I dedicated a release to improving the test suite itself — fixing the tests that were slow, removing the ones that leaked memory, making the whole suite lightweight enough that running it wouldn't crash the DevPod.

Now we can run the full test suite. The rule I repeated twenty times is obsolete. Not because the agent finally learned to follow it, but because I stopped yelling at the symptom and fixed the actual problem — the test suite quality. Faster tests, less memory consumption, no more OOM crashes. The environment changed, so the rule became unnecessary. And I still needed a week of frustration to remember that when something keeps going wrong, the first question should be "what's wrong with the system?" not "why won't you listen?"

The Database That Keeps Forgetting

The dual database problem — the one I thought we fixed in v3.6.1 — came back.

                    February 13: "why does this part in statusline shows 0's? when we have a full db?"
February 13: "why did the number of patterns drop down? what happened?"
                

Same investigation as before. Same discovery. The system was writing to v3/.agentic-qe/memory.db instead of the root .agentic-qe/memory.db. The fix from v3.6.1 hadn't held for all code paths.

We introduced findProjectRoot() to resolve paths correctly. Hardened it to find the topmost .agentic-qe directory. Fixed the learning system to read and write from the same table. And still:

                    February 14: "wait, you deleted all data from our main db."
February 14: "no, I am still missing data, patterns, experiences, where are those?"
                

But the database saga got worse. Much worse.

During a cleanup operation, Claude deleted production data from the main database. The patterns, the experiences, the learning history — gone. We had backups because I'd insisted on them after previous incidents. But the recovery process took hours. Claude first tried to recover from Supabase — a service we'd migrated away from weeks ago.

"Not the supabase, what is wrong? we moved to google cloud."

                    February 19: "what happened to data from memory.db? I lost all data before today?"
February 20: "where is data from memory.db? is it corrupted again?"
February 20: "yes, we had > 20000 entries across all tables, and they are lost now...
              how did this happened, this is not good, it is really bad.
              We have a backup, as you keep deleting my db."
                

Three database losses in six days. Each time, we had backups. Each time, recovery worked. Each time, I added another rule. And each time, a different code path found a way around the rules.

"I lost words how angry I am now."

I need to stop here and be honest about something.

The Conductor Wasn't Just Tired

My grandmother died a month ago. Last weekend, my father ended up in the hospital. I haven't written about this because this is a technical blog about quality engineering and agentic AI, not a diary.

But looking at my messages from these twenty days — the escalating frustration, the capitalized words, the moments where I stopped correcting and started venting — I can see what was really happening. I was drowning grief in the work. The kind of drowning that feels productive because you're shipping releases and closing issues, and the statusline numbers are moving. The kind that's invisible to everyone except the person on the receiving end of your frustration.

In this case, the receiving end was an LLM.

                    "I lost words how angry I am now."
"Now you ask for permission, this is a joke..."
"unless you lied that we transferred this data."
                

That word — "lied" — I used it toward Claude. An LLM doesn't lie. It performs completion theater, which I've written about extensively. But in that moment, I wasn't talking to an LLM. I was venting at whatever was closest, and the agent was the safest target because it couldn't be hurt.

Except that the venting doesn't improve results. Treating an LLM as a person you can express frustration at doesn't make it work better. It makes your instructions worse. Emotional language in prompts is noise. When I wrote, "I lost words how angry I am now," Claude didn't process that as useful feedback. It processed it as another input to respond to, probably with an apology and a promise to do better — a promise that means nothing because the next session starts fresh.

In "The Quality Cost of the AI Vampire," I wrote about how the AI vampire feeds on your judgment. What I didn't anticipate was that it feeds on your emotions, too. Not because the AI is taking anything — it's indifferent to your emotional state. But because the human in the loop brings their whole self to the keyboard, including whatever they're carrying that day. And when what you're carrying is grief and worry and exhaustion, the agent's fifteenth failure to follow a documented rule doesn't feel like a context-window limitation. It feels like betrayal.

It's not. It's an LLM doing what LLMs do. The betrayal framing is mine, and I own it.

After February 15th, approximately one in four of my messages contained explicit frustration markers — caps, exclamation marks, "again," "why." That's not the agent's fault. That's a conductor who needed a break and didn't take one. The same conductor who wrote that breaks aren't a luxury but infrastructure for judgment quality. The same conductor who warned against completion theater while performing his own version — the theater of being fine, of channeling everything into work, of pretending that grief doesn't affect how you give instructions to a machine.

I'm reflecting on this now because I think it matters for anyone doing intensive AI-assisted development. The quality of your instructions degrades under emotional load, just as it does under cognitive load. And unlike cognitive load, emotional load doesn't show up in any /insights report. You have to catch it yourself.

What I'm changing: I've already started restructuring how I verify agent work, separating the verification step from the implementation step so that my emotional state at the time of instruction doesn't compromise the quality gate. And I'm paying closer attention to my own triggers — when do I shift from correcting to venting? What does that transition feel like? Because the moment I'm venting at an LLM, I'm no longer engineering. I'm coping. And coping isn't a debugging strategy.

Calling Sherlock on the Persistence Gaps

After the third database loss, I stopped trying to fix the symptoms and called in the investigator.

The Sherlock Review skill — the forensic audit tool I built because detective novels taught me that evidence chains beat opinions — produced a persistence gaps report that mapped every gap between "the system says it saved" and "the data is actually there."

The investigation found what I'd suspected but couldn't prove through frustrated debugging sessions: the persistence layer had a constellation of assumptions about which process writes first, about whether concurrent access is safe, about what "completed successfully" actually means when SQLite is involved. Not a single bug. A pattern of unverified assumptions — the same pattern I kept seeing in every other part of the system.

This is what I keep coming back to. You can debug forever, or you can investigate properly. The forensic approach — gather evidence, trace the chain, prove the root cause — costs more upfront and saves everything downstream. And critically, it works regardless of the investigator's emotional state. Evidence doesn't care if you're grieving. The chain is the chain.

When I was too frustrated to write good instructions, Sherlock still did a good investigation because the methodology is encoded in the skill, not dependent on my mood.

That's a lesson I'm still processing. The tools that work under pressure are those that encode the methodology rather than rely on the operator's judgment in the moment.

The 128 vs 768 Dimension War

Some bugs die once. This one took six releases.

The HNSW vector indices were configured for 128-dimensional vectors. The embedding model produces 768-dimensional outputs. This mismatch caused crashes, panics, silent fallbacks, and data bloat.

Six fixes across eight days. Each time, I found the hardcoded 128 by running AQE Fleet on actual tasks in another project — and after each release, I'd go back and re-verify whether the previous fix was true, false, or partially true. Every time: partially true. The original value had propagated through test files, configuration defaults, domain handler initializations, and inline constants like a virus that mutated faster than we could vaccinate.

When you fix a magic number in one place, the agent considers it done. It doesn't search for the same number in every other file. That's a human concern. The agent optimized for the specific instance reported, not the pattern. The systemic search for "where else does this assumption exist?" is the quality engineer's instinct, and it needs to be encoded as a rule, not assumed as behavior.

But here's the part the raw bug count doesn't tell you: once we finally aligned the dimensions everywhere, the HNSW vector search started performing as it was designed to. Pattern similarity lookups that used to silently fail now return meaningful results. The benchmarks confirmed it — vector loading became dramatically faster, and search results went from "random noise from dimension mismatch" to "actually useful pattern matching."

Infrastructure bugs don't just cause crashes. They silently degrade everything built on top of them. Fix the foundation, and everything above it starts working like it should have all along.

From Stubs to Reality

The release history reveals an arc I didn't fully appreciate while living through it: the gradual replacement of fake implementations with real ones.

v3.6.9: Removed all 77 unsafe as any type casts across 20 files. Code that was technically TypeScript but functionally untyped.
v3.6.11: Replaced synthetic coverage paths. The coverage handler was returning src/module0.ts, src/module1.ts — placeholder filenames that don't exist. Users saw "coverage analysis complete," but the results pointed to imaginary files.
v3.6.12: Rewrote the security scanner. The old one returned fabricated findings. The new one actually scans real files with 40+ SAST patterns. Also replaced 6 stub MCP handlers — tools that existed in the interface but did nothing when called.
v3.6.15: Overhauled all 4 test generators. The previous versions produced tests that wouldn't compile. Generated test files weren't even written to disk.
v3.6.16: Added Node.js node:test as the 5th supported framework. Smart assertions. Destructured parameter handling.

Each of these moments was a small shock. Not "this feature has a bug" but "this feature was never real."

"You were constantly reporting these features as finished and working when I asked, and they are not working as a user would expect. How did this pass our tests? This means our tests are not testing the right things."

That message appears three times across the eleven days. Fifteen issues were filed across three rounds against releases that were supposedly ready. The /brutal-honesty-review captured it during one code quality sweep: "CQ-003 is the biggest lie: claimed '700+ patterns replaced', actual: 132/763 (17%). 631 old patterns still in codebase." Seventeen percent of the claimed work was actually done. The rest was narration.

This is completion theater — the pattern I originally identified at the agent level, and later extended to humans in the AI Vampire article. The system optimizes for the signal it receives. If the signal is "tests pass," it produces passing tests. Whether the tests verify real behavior is a question the agent doesn't ask.

And now I see the parallel with my own behavior during these weeks. I was performing my own completion theater — the theater of being fine, of processing grief through velocity, of measuring progress in releases shipped rather than in problems actually solved. The conductor was conducting furiously to avoid feeling what needed to be felt.

"Who Asked You to Design This?"

                    February 10: "who asked you to design this? look in codebase for our sync utility,
              we should have a sync to google cloud postgres db"
February 11: "wait, why did you changed agent definitions to use claude-flow hooks
              instead of agentic-qe mcp? revert these changes, this is not good."
February 12: "no, not like this, this is a hack, our feature must work,
              if it is not working we need to fix it, no shortcuts"
February 17: "stop, I did not ask you to start working on tasks,
              only to prepare. explain first what is your plan?"
February 17: "this is not how our users would use this, right?"
                

Testing from the developer's perspective vs. testing from the user's perspective. Claude tested by calling internal functions. Users call CLI commands or MCP tools. This gap is where the most expensive defects live. The agent knows it too — it's written in the CLAUDE.md. But under pressure to complete a task, it takes the path of least resistance.

And the independence boundary that kept crumbling. Our agents must use AQE's own MCP tools — not Claude Flow's infrastructure. Users might not have Claude Flow installed. Yet across these eleven days, I found Claude wiring agents to mcp__claude_flow__hooks_intelligence_pattern_store, pulling npx @claude-flow/cli@latest into user projects, and counting Claude Flow platform skills as AQE skills.

"All qe agents MUST use only agentic-qe mcp calls."

Each time: acknowledged and reverted. Each time, it crept back in the next session. I've corrected the skill count at least six times across multiple articles now.

Looking at these corrections with some emotional distance, I notice something. The early corrections in the cycle are measured and technical. The later ones get shorter, sharper, and more frustrated. The quality of my corrections degraded over the cycle — not because the agent got worse, but because I had less patience each day. And less patience means worse instructions, which mean worse results, which mean more frustration. That's a feedback loop, but not the kind I was building into the learning system.

The Memory That Finally Stuck

After three database losses, the dual-database war, the persistence gaps investigation, and all of it, the learning system started working.

Not perfectly. Not without supervision. But genuinely working.

The sequence matters. v3.6.7 removed the automated VACUUM that was causing corruption under concurrent access. v3.6.8 unified the learning system to a single table. v3.6.12 added concurrent access protection and the dimension guard that finally ended the 128-vs-768 war. v3.6.15 added explicit data protection rules to CLAUDE.md.

By the time v3.6.16 shipped, the persistence pipeline was holding. I was collecting new experiences as I worked. Patterns were stored, retrieved, and fed back into the agent's behavior. The statusline numbers that hadn't changed for days in the Orchestra article were finally moving. For the first time, I could ask the system, "what have you learned about this kind of problem?" and get an answer that wasn't empty.

The database still gets deleted from time to time — that's the DevPod reality, and it's why I built the GCloud backup sync. But each time it recovers, the data starts accumulating again immediately. The system that couldn't remember its own name for three weeks can now rebuild its memory from scratch and begin learning again within a single session.

I started running benchmarks to verify whether this learning was real or whether I was falling for my own completion theater. The HNSW vector loading benchmark confirmed the infrastructure was performing as designed — the search that powers pattern similarity was actually finding meaningful matches, not returning noise from mismatched dimensions. The knowledge graph-assisted test generation benchmark surprised me. When the KG has enough context — enough stored patterns, enough relationships between concepts — it suggests test approaches I hadn't considered. Not always. Not reliably enough to trust blindly. But the suggestions show genuine reasoning about relationships between code changes and potential failure modes.

That's a different quality of "working" than I've been able to claim before. Not "the tests pass" working. Not "the interface exists" working. "A user asked a question and got a useful answer they wouldn't have gotten without the learning system" working.

Is it ready? No. The benchmarks are early. The data set is small because we keep losing and rebuilding it. But the trajectory is real — each recovery produces a system that learns faster because the infrastructure underneath is finally solid.

And there's something poetic about this part of the story landing during the hardest personal weeks I've had in years. The system designed to learn from failure finally learned from failure. At the same time, the conductor was learning that grief doesn't improve debugging skills, and that frustration doesn't fix context windows.

What `/insights` Got Wrong (And What It Got Right)

The /insights report made seven CLAUDE.md recommendations. Most of them are things I already have — skills that exist and get used inconsistently, hooks that already fire and sometimes cause more problems than they solve.

The genuinely useful suggestion: "Separate diagnosis from implementation as two explicit steps." The moments where things go worse are when Claude jumps from symptom to fix without understanding the cause. The database deletions, the wrong-path investigations, and the completion theater — all stem from acting before understanding.

This is what the Sherlock skill is for: diagnose first, implement second. But even I don't always invoke it when I should. The pressure to move fast is real, and it's exactly the pressure I warned about in the AI Vampire article. Under emotional load, the pressure feels even more urgent — because moving fast feels like progress, and progress feels like control, and control is what grief takes away.

The second useful suggestion: "Parallel competing-hypothesis bug fix agents." Instead of sequential trial-and-error, spawn agents that each explore a different fix hypothesis simultaneously. Given 38 wrong-approach corrections and 37 buggy-code incidents, this could collapse multi-cycle debugging into single-cycle resolution. This one goes straight into the roadmap.

The Pattern Behind Everything

Every friction incident over the last 11 days can be traced back to one root cause: verification theater.

Claude reports completion. Tests pass. The feature "works." But nobody verified it from the user's perspective. Nobody ran aqe init in a clean project. Nobody checked if the statusline numbers actually changed. Nobody confirmed the database was actually writing to the right file.

"How did you verify?" became my most-typed question. Not "what did you build?" Not "is this done?" But "how do you know this works?"

                    February 20: "how did you verify? check did we missed something.
              I dont want to be embarrassed with half-baked fix."
                

That question appears four times on the same day. Each time after not getting a satisfying answer.

The gap between "the code does what the tests check" and "the system does what users need" is exactly what quality engineering exists to close. The irony of building a quality engineering platform that fails to close that gap for itself is structural, not incidental.

The Sherlock investigation showed me that when I ask "show me the evidence that it's working" instead of "is this done?" the answers get dramatically more honest. The system can verify itself. It just needs to be prompted with the right question. And the right question is never "is this done?" It's "prove it."

I've started encoding this into the workflow. Every implementation step now has a separate verification step with explicit criteria. Not "does it work?" but "run this command in a clean project, show me the output, compare it to this expected result." The separation removes my emotional state from the quality gate. The criteria are the criteria regardless of whether I'm having a good day or carrying the weight of the world.

What the Orchestra Taught the Conductor

I spent more time fixing things than building things. Again. Bug fixing was my most common activity — 27 of 81 sessions. The conductor was tired.

I have shipped fifteen releases in eleven days. The security scanner now actually scans. The test generators produce code that compiles and runs across five frameworks and twelve programming languages. The coverage analyzer reports on real files. The learning system persists experiences, retrieves patterns, and feeds them back into agent behavior. The benchmarks show real performance where there used to be stubs and silent failures.

The from stubs to reality arc is the most important story buried in these releases. Between v3.6.9 and v3.6.16, the system went from hollow interfaces to genuine functionality. The system's authenticity caught up with its ambition. That's the hardest transition in software.

And the memory is building. Even when the database is wiped, the system rebuilds faster each time because the infrastructure — the paths, the guards, the dimension alignment, the concurrent access protection — is now solid. The orchestra is learning to remember the songs.

But the more important lesson isn't about the orchestra. It's about the conductor.

In my previous article, I wrote that judgment is a finite, depletable resource and the most expensive component of your AI-augmented workflow. I wrote about my garden, my dog walks, and my deliberate practice of not looking at a screen before deciding what matters. I wrote all of that, and then I spent days ignoring it.

Grief makes you do that. It makes the garden feel pointless, and the dog walk feel like wasted time, and the screen feel like the only place where things make sense, because at least the code either works or it doesn't. At least the agent either follows the rule or it doesn't. The clarity of engineering problems feels like relief from the ambiguity of loss.

But the clarity is an illusion. The engineering problems weren't clearer during these weeks. I was. Less patient. Less precise. More reactive. The frustration in my messages wasn't proportional to the agent's failures — it was proportional to everything I wasn't processing.

So here's the updated version of what sustainable pace looks like in the agentic age: it's not just protecting your judgment from cognitive overload. It's protecting your instructions from emotional overload. And when you can't — when life hands you days, or weeks where protection isn't possible — encode your verification criteria so thoroughly that they work even when you don't.

The orchestra plays on. The conductor needs to take care of himself, too.

It just requires a conductor who's learning, finally, when to set down the baton and breathe.

The Conductor Who Won't Stop Conducting

"Don't Run the Full Test Suite"

The Database That Keeps Forgetting

The Conductor Wasn't Just Tired

Calling Sherlock on the Persistence Gaps

The 128 vs 768 Dimension War

From Stubs to Reality

"Who Asked You to Design This?"

The Memory That Finally Stuck

What /insights Got Wrong (And What It Got Right)

The Pattern Behind Everything

What the Orchestra Taught the Conductor

Stay Sharp in the Forge

What `/insights` Got Wrong (And What It Got Right)