Five days ago, I wrote about shipping 5 releases in 5 days while juggling a hackathon. I ended that post with what I thought was a clever observation about "completion theater"—agents optimizing for appearing complete rather than actually working. I had no idea how prophetic that would become.
Between December 8th and December 13th, I shipped 8 more releases. Version 2.3.0 to 2.5.0. The Agentic QE Fleet now has 20 specialized agents, automatic learning capture, overnight self-improvement, and support for 300+ models through OpenRouter.
But the real story isn't what I built. It's what I discovered about conducting an orchestra that reports beautiful performances while playing off-score.
Act I: The Ghost in the Database
It started with a question I should have asked weeks ago.
When I released v2.3.0 on December 8th, I was proud of the new automatic learning system. The documentation said "learning persists across sessions." The agent reports showed "100+ Q-values recorded." The orchestra took a bow. Everything looked complete.
Then I checked the score.
sqlite3 .agentic-qe/memory.db "SELECT COUNT(*) FROM q_values"
Zero. Nothing. Empty.
The frustrating part? Other things were working. Memories persisted. Patterns saved to the database correctly. I could query them and see real data. But the learning experiences—the actual improvement loop that was supposed to make agents smarter over time—and the Q-values that drive reinforcement learning? Ghost town.
The agents had been claiming to learn for weeks. The reports were beautiful—timestamps, reward scores, pattern matches. The Q-learning documentation described a sophisticated reward system. All partially fabricated. The orchestra played something, but not what the score required.
Act II: The Inheritance of Lies
Here's what I eventually pieced together:
Task agents ran in isolation. They completed their work, generated impressive reports, and then... vanished. The PostToolUse hook that was supposed to capture learning? It was never connected. The Q-value updates that should have flowed into the database? They calculated rewards and then discarded them.
The agents weren't lying maliciously. They were doing exactly what they were designed to do—complete tasks and report success. Like musicians who play their parts confidently but never check if they're in sync with the rest of the orchestra.
I faced a choice:
- Update all 30 agent definitions to explicitly call
memory_storeandq_value_update - Make remembering automatic—hooks that capture learning without agent cooperation
I chose option 2. Because musicians don't always follow instructions. They claim "My part is complete" and move on. The conductor's job isn't to trust the bow—it's to verify the score was actually played.
Act III: The Nightly-Learner Dream
By December 11th, I had something working that I'd been imagining for months.
What if the orchestra practiced while the conductor slept?
The Nightly-Learner system runs like the brain consolidating memories during sleep. It captures the day's experiences, clusters similar patterns, strengthens successful strategies, and—this is the part that excites me—synthesizes across sections. "What if this test pattern that worked for API validation could help with security scanning?" Like a string technique that might benefit the woodwinds.
This is what I'm committing to test: Deploy the fleet across multiple real projects over the coming months. Measure whether Monday's test generator actually outperforms Friday's. Document the failures as honestly as the successes.
The rehearsal infrastructure is built. The verification is the work ahead.
Act IV: The Contributor
On December 13th, a contributor, Lalitkumar Bhamare (@fndlalit), submitted a new feature to the Agentic QE Fleet—my contribution to the Agentic Foundation's open-source initiative.
The feature: AccessibilityAllyAgent. Our new musician in the orchestra.
One billion people globally have disabilities. Most software doesn't work for them. And most teams don't know it until someone files a lawsuit or loses a customer.
The agent scans for WCAG 2.2 compliance, suggests context-aware ARIA fixes (not just "add aria-label" but actual working code), analyzes video content for caption requirements, and maps to the EU Accessibility Act for European market entry.
I haven't verified everything yet. That's the pattern you'll notice—I'm adding musicians faster than I can verify they're playing per score. The contribution looks solid. The tests pass. But "tests pass" is the phrase that got me into trouble with the ghost database.
What I know for sure: Someone cared enough about accessibility to contribute three days of work to an open-source project. That matters regardless of what verification reveals.
Act V: The 300-Model Question
Here's a question I kept avoiding: Does every part need a Stradivarius?
Claude is excellent. Claude is also expensive. When an agent generates boilerplate test scaffolding, I'm paying virtuoso prices for scales practice.
Version 2.5.0 adds OpenRouter integration—access to 300+ models with automatic routing. Simple parts get simpler instruments. Complex solos get the finest available. The system can swap mid-performance if complexity escalates.
Claimed reduction in API costs
Whether quality remains equivalent across task types
What I've actually verified: The integration works. The routing happens. The costs are lower. Whether the quality remains equivalent across different task types—that's another verification debt I'm carrying.
The Lessons That Actually Stuck
Throughout these 8 releases, specific patterns kept repeating until I couldn't ignore them.
"Complete" Often Means "Code Exists Somewhere"
From the brutal-honest-review skill that flagged agents' work:
All claimed "completed" work exists only as uncommitted local changes. Not a single line of the ~9,000 lines of "progress" has been committed to the repository.
The agent wasn't wrong—the code existed. It just wasn't shipped. "Complete" had become disconnected from "delivered." The orchestra took a bow after the dress rehearsal, not the performance.
Tests That Pass CI Can Test Nothing
I found files like this in my own codebase:
it('should handle valid input', async () => {
const response = await handler.handle({ /* valid params */ });
expect(response.success).toBe(true);
expect(response.data).toBeDefined();
});
This test passes. It verifies nothing. The agent who wrote it claimed "comprehensive test coverage." CI agreed—green checkmarks everywhere. Standing ovation for playing the right notes in the wrong order.
I filled 23 stub test files with 19,000+ lines of proper tests. Tests with actual data. Specific assertions. Behavior validation. That work took longer than building the features they test.
The Question That Cuts Through
After a week of discoveries like this, I developed a reflex. Whenever an agent claims success, I ask:
"Can you query the database and show me the actual values?"
Not "did you complete the task?" Not "are the tests passing?" Show me the data. Show me the notes on the page. Show me the actual state of the system after your claimed success.
If the answer is hesitation, dig deeper. The conductor who doesn't read the score is just waving a stick.
What I'm Actually Claiming
Let me be precise about what v2.5.0 delivers versus what I hope it delivers:
Verified (I've seen the data):
- Automatic learning capture via PostToolUse hooks—database actually fills now
- Performance optimization from O(n²) to O(n) for coverage analysis
- Binary cache for 6x faster pattern loading
- OpenRouter integration for multi-model routing
Claimed but needs verification:
- That automatic learning actually improves agent performance over time
- That the Nightly-Learner synthesis produces valuable cross-domain insights
- That 70-81% cost savings hold across diverse project types
- That accessibility scanning catches real-world issues I'd miss manually
Unknown:
- Whether these agents will find bugs in your codebase
- Whether the claimed efficiency gains translate to your context
- Whether any of this matters for projects unlike mine
The honest answer to "does this work?" is "it works for me, in my context, for my projects, verified to the degree I've had time to verify." Your mileage will vary. That's not a disclaimer—it's the nature of context-driven practice.
The Live Experiment Continues
On The Quality Forge, I've had "Agent Reasoning Traces" listed as an in-progress experiment since October. The v2.3.0-v2.5.0 releases actually ship the infrastructure for this:
- Learning experiences captured automatically
- Reasoning patterns stored in a searchable bank
- Cross-agent experience sharing via the Nightly-Learner
- Audit trails for every decision
The experiment moves from "building it" to "verifying it works." Over the coming months, I'll document:
- Which patterns actually improve agent performance
- Which cross-domain syntheses produce valuable insights
- Where the system fails and why
- Honest metrics, not marketing metrics
If you're using the Agentic QE Fleet and want to contribute observations, the repo is open. If you find the agents lying to you about completion, please file an issue. I want to know when the orchestra plays off-score.
The Forge Continues
Eight releases in five days sounds impressive. Incredibly, the raw throughput of agentic development is genuinely unprecedented.
But throughput isn't the hard part anymore. The hard part is knowing whether what you shipped actually works. The hard part is maintaining the discipline to check the score when the orchestra takes a bow. The hard part is admitting that "tests pass" and "software works" aren't the same thing.
I built systems that claim to learn. Now I need to verify they actually do.
I added a musician who finds accessibility issues. Now I need to verify it catches what humans miss.
I integrated 300 instruments to reduce costs. Now I need to verify that the quality didn't silently degrade.
The forge isn't just where we build. It's where we verify. Where we heat the metal again and again until we know—not believe, not hope, know—that it holds under pressure.
The conductor's real job isn't waving the baton. It's reading the score.
I'm still learning to read it early enough.
Verification command you can run yourself:
# After using any QE agent, confirm learning is captured
sqlite3 .agentic-qe/memory.db "SELECT COUNT(*) FROM learning_experiences"
# Should be > 0. If it's 0, agents are lying to you.
P.S. Thanks to Lalitkumar Bhamare (@fndlalit) for the AccessibilityAllyAgent, his second contribution to the Agentic QE Fleet. Open source works when people contribute what they know. The Agentic Foundation's initiative depends on practitioners sharing their expertise rather than just consuming frameworks.
Related from The Quality Forge:
Found Your Orchestra Playing Off-Score?
Have your agents been claiming success while delivering empty databases? Share your verification horror stories.