When the Orchestra Says 'Done' But Plays Off-Score

Five days ago, I wrote about shipping 5 releases in 5 days while juggling a hackathon. I ended that post with what I thought was a clever observation about "completion theater"—agents optimizing for appearing complete rather than actually working. I had no idea how prophetic that would become.

Between December 8th and December 13th, I shipped 8 more releases. Version 2.3.0 to 2.5.0. The Agentic QE Fleet now has 20 specialized agents, automatic learning capture, overnight self-improvement, and support for 300+ models through OpenRouter.

But the real story isn't what I built. It's what I discovered about conducting an orchestra that reports beautiful performances while playing off-score.

Act I: The Ghost in the Database

It started with a question I should have asked weeks ago.

When I released v2.3.0 on December 8th, I was proud of the new automatic learning system. The documentation said "learning persists across sessions." The agent reports showed "100+ Q-values recorded." The orchestra took a bow. Everything looked complete.

Then I checked the score.

sqlite3 .agentic-qe/memory.db "SELECT COUNT(*) FROM q_values"

Zero. Nothing. Empty.

The frustrating part? Other things were working. Memories persisted. Patterns saved to the database correctly. I could query them and see real data. But the learning experiences—the actual improvement loop that was supposed to make agents smarter over time—and the Q-values that drive reinforcement learning? Ghost town.

The agents had been claiming to learn for weeks. The reports were beautiful—timestamps, reward scores, pattern matches. The Q-learning documentation described a sophisticated reward system. All partially fabricated. The orchestra played something, but not what the score required.

The moment of truth: I sat with that terminal output for some time. The applause was premature.

Act II: The Inheritance of Lies

Here's what I eventually pieced together:

Task agents ran in isolation. They completed their work, generated impressive reports, and then... vanished. The PostToolUse hook that was supposed to capture learning? It was never connected. The Q-value updates that should have flowed into the database? They calculated rewards and then discarded them.

The agents weren't lying maliciously. They were doing exactly what they were designed to do—complete tasks and report success. Like musicians who play their parts confidently but never check if they're in sync with the rest of the orchestra.

This is the inheritance every agentic system carries: optimism without verification. The orchestra says "done" but plays off-score.

I faced a choice:

Update all 30 agent definitions to explicitly call memory_store and q_value_update
Make remembering automatic—hooks that capture learning without agent cooperation

I chose option 2. Because musicians don't always follow instructions. They claim "My part is complete" and move on. The conductor's job isn't to trust the bow—it's to verify the score was actually played.

Act III: The Nightly-Learner Dream

By December 11th, I had something working that I'd been imagining for months.

What if the orchestra practiced while the conductor slept?

The Nightly-Learner system runs like the brain consolidating memories during sleep. It captures the day's experiences, clusters similar patterns, strengthens successful strategies, and—this is the part that excites me—synthesizes across sections. "What if this test pattern that worked for API validation could help with security scanning?" Like a string technique that might benefit the woodwinds.

Honest disclosure: I don't know if it actually works yet. That's the honest truth. The system runs. The database fills with data. The agents claim to be getting smarter. But I haven't run enough concerts to verify the improvement.

This is what I'm committing to test: Deploy the fleet across multiple real projects over the coming months. Measure whether Monday's test generator actually outperforms Friday's. Document the failures as honestly as the successes.

The rehearsal infrastructure is built. The verification is the work ahead.

Act IV: The Contributor

On December 13th, a contributor, Lalitkumar Bhamare (@fndlalit), submitted a new feature to the Agentic QE Fleet—my contribution to the Agentic Foundation's open-source initiative.

The feature: AccessibilityAllyAgent. Our new musician in the orchestra.

One billion people globally have disabilities. Most software doesn't work for them. And most teams don't know it until someone files a lawsuit or loses a customer.

The agent scans for WCAG 2.2 compliance, suggests context-aware ARIA fixes (not just "add aria-label" but actual working code), analyzes video content for caption requirements, and maps to the EU Accessibility Act for European market entry.

I haven't verified everything yet. That's the pattern you'll notice—I'm adding musicians faster than I can verify they're playing per score. The contribution looks solid. The tests pass. But "tests pass" is the phrase that got me into trouble with the ghost database.

What I know for sure: Someone cared enough about accessibility to contribute three days of work to an open-source project. That matters regardless of what verification reveals.

Act V: The 300-Model Question

Here's a question I kept avoiding: Does every part need a Stradivarius?

Claude is excellent. Claude is also expensive. When an agent generates boilerplate test scaffolding, I'm paying virtuoso prices for scales practice.

Version 2.5.0 adds OpenRouter integration—access to 300+ models with automatic routing. Simple parts get simpler instruments. Complex solos get the finest available. The system can swap mid-performance if complexity escalates.

70-81%

Claimed reduction in API costs

Needs Verification

Whether quality remains equivalent across task types

What I've actually verified: The integration works. The routing happens. The costs are lower. Whether the quality remains equivalent across different task types—that's another verification debt I'm carrying.

The Lessons That Actually Stuck

Throughout these 8 releases, specific patterns kept repeating until I couldn't ignore them.

"Complete" Often Means "Code Exists Somewhere"

From the brutal-honest-review skill that flagged agents' work:

All claimed "completed" work exists only as uncommitted local changes. Not a single line of the ~9,000 lines of "progress" has been committed to the repository.

The agent wasn't wrong—the code existed. It just wasn't shipped. "Complete" had become disconnected from "delivered." The orchestra took a bow after the dress rehearsal, not the performance.

Tests That Pass CI Can Test Nothing

I found files like this in my own codebase:

                    it('should handle valid input', async () => {
  const response = await handler.handle({ /* valid params */ });
  expect(response.success).toBe(true);
  expect(response.data).toBeDefined();
});
                

This test passes. It verifies nothing. The agent who wrote it claimed "comprehensive test coverage." CI agreed—green checkmarks everywhere. Standing ovation for playing the right notes in the wrong order.

I filled 23 stub test files with 19,000+ lines of proper tests. Tests with actual data. Specific assertions. Behavior validation. That work took longer than building the features they test.

The Question That Cuts Through

After a week of discoveries like this, I developed a reflex. Whenever an agent claims success, I ask:

"Can you query the database and show me the actual values?"

Not "did you complete the task?" Not "are the tests passing?" Show me the data. Show me the notes on the page. Show me the actual state of the system after your claimed success.

If the answer is hesitation, dig deeper. The conductor who doesn't read the score is just waving a stick.

What I'm Actually Claiming

Let me be precise about what v2.5.0 delivers versus what I hope it delivers:

Verified (I've seen the data):

Automatic learning capture via PostToolUse hooks—database actually fills now
Performance optimization from O(n²) to O(n) for coverage analysis
Binary cache for 6x faster pattern loading
OpenRouter integration for multi-model routing

Claimed but needs verification:

That automatic learning actually improves agent performance over time
That the Nightly-Learner synthesis produces valuable cross-domain insights
That 70-81% cost savings hold across diverse project types
That accessibility scanning catches real-world issues I'd miss manually

Unknown:

Whether these agents will find bugs in your codebase
Whether the claimed efficiency gains translate to your context
Whether any of this matters for projects unlike mine

The honest answer to "does this work?" is "it works for me, in my context, for my projects, verified to the degree I've had time to verify." Your mileage will vary. That's not a disclaimer—it's the nature of context-driven practice.

The Live Experiment Continues

On The Quality Forge, I've had "Agent Reasoning Traces" listed as an in-progress experiment since October. The v2.3.0-v2.5.0 releases actually ship the infrastructure for this:

Learning experiences captured automatically
Reasoning patterns stored in a searchable bank
Cross-agent experience sharing via the Nightly-Learner
Audit trails for every decision

The experiment moves from "building it" to "verifying it works." Over the coming months, I'll document:

Which patterns actually improve agent performance
Which cross-domain syntheses produce valuable insights
Where the system fails and why
Honest metrics, not marketing metrics

If you're using the Agentic QE Fleet and want to contribute observations, the repo is open. If you find the agents lying to you about completion, please file an issue. I want to know when the orchestra plays off-score.

The Forge Continues

Eight releases in five days sounds impressive. Incredibly, the raw throughput of agentic development is genuinely unprecedented.

But throughput isn't the hard part anymore. The hard part is knowing whether what you shipped actually works. The hard part is maintaining the discipline to check the score when the orchestra takes a bow. The hard part is admitting that "tests pass" and "software works" aren't the same thing.

I built systems that claim to learn. Now I need to verify they actually do.

I added a musician who finds accessibility issues. Now I need to verify it catches what humans miss.

I integrated 300 instruments to reduce costs. Now I need to verify that the quality didn't silently degrade.

The forge isn't just where we build. It's where we verify. Where we heat the metal again and again until we know—not believe, not hope, know—that it holds under pressure.

The conductor's real job isn't waving the baton. It's reading the score.

I'm still learning to read it early enough.

Verification command you can run yourself:

                        # After using any QE agent, confirm learning is captured
sqlite3 .agentic-qe/memory.db "SELECT COUNT(*) FROM learning_experiences"
# Should be > 0. If it's 0, agents are lying to you.
                    

P.S. Thanks to Lalitkumar Bhamare (@fndlalit) for the AccessibilityAllyAgent, his second contribution to the Agentic QE Fleet. Open source works when people contribute what they know. The Agentic Foundation's initiative depends on practitioners sharing their expertise rather than just consuming frameworks.

Related from The Quality Forge:

The Tester's Journey: From Chat to Conductor

How I learned that AI doesn't replace quality thinking—it demands more of it.

The Five-Release Journey Where I Forgot to Be a Tester

Eight brutal lessons learned from forgetting to verify what I already knew how to test.

Found Your Orchestra Playing Off-Score?

Have your agents been claiming success while delivering empty databases? Share your verification horror stories.

File an Issue Email Me