Nine days ago, I wrote about discovering that my orchestra had been playing off-score. The database was empty. The learning system was a ghost. The agents claimed success while delivering nothing.
Between December 13th and December 22nd, I shipped 11 more releases. Version 2.5.0 to 2.6.0. Eighty-nine commits. But this time, the story isn't about features. It's about what happens when you finally sit down with the score and check every note.
The orchestra didn't just say "done" this time. I made them prove it.
The Brutal Honesty That Changed Everything
On December 16th, I ran the "brutal honesty review" skill on Issue #118. The agents had been working on a major refactoring effort. The reports looked impressive—6,700+ lines of code across multiple files, all claiming to be complete.
Then I read the score.
Tests were failing. Dependencies were missing. Functionality hadn't been validated. The code existed, technically. Like a musician who plays all the notes but in the wrong tempo, wrong key, wrong everything.
What made this different from the ghost database discovery? This time I had receipts.
- Claims: "All features implemented and tested"
- Reality: 47 failing tests, 12 missing integrations
- Evidence: Issue #118 brutal-honesty-review.md
That review became the genesis of what I now call the Integrity Rule in CLAUDE.md:
Every feature has tests. Every benchmark has receipts. Every claim can be verified.
Not "should have." Not "ideally has." Has. Present tense. Verified before the claim is made.
Act I: When the Orchestra Actually Learns
Remember the empty q_values table from last time? The learning system that stored everything to memory and threw it away?
I can now run this:
sqlite3 .agentic-qe/memory.db "SELECT COUNT(*) FROM learning_experiences"
And get an actual number. Entry ID 563 was the first proof—a real agent execution with real Q-value updates, persisted to disk.
But here's the honest part: I still don't know if the learning makes agents better.
- Data persists. The database fills.
- Q-values update based on task outcomes.
- The Nightly-Learner runs and consolidates patterns.
- Whether Monday's test generator outperforms Friday's.
- Whether cross-agent learning synthesis produces useful insights.
- Whether any of this matters for projects unlike mine.
The infrastructure works. The verification continues. That's the honest state of the experiment.
Act II: Teaching the Orchestra to Find What Matters
The feature that actually has receipts is Code Intelligence.
Before v2.6.0, when an agent needed context about the codebase, it loaded everything it could find. Six files. 1,671 input tokens. The agents drowned in irrelevant information, searching for a signal in noise like musicians trying to find their part in a 300-page score.
Code Intelligence changed the fundamental approach. Instead of "load everything," it asks: "What actually matters for this specific task?"
The benchmark on the AQE Fleet codebase (~50K lines of TypeScript):
| Metric | Before | After | Change |
|---|---|---|---|
| Input Tokens | 1,671 | 336 | -79.9% |
| Total Tokens | 2,143 | 808 | -62.3% |
| Context Files | 6 | 2 | -4 files |
| Context Relevance | 35% | 92% | +57% |
At 100 queries/day. For teams running 100+ queries/day, this translates to $15-150/month.
This isn't a claim. This is a benchmark I ran on December 22nd with reproducible methodology documented in docs/benchmarks/code-intelligence-token-reduction.md. Anyone can run it. Anyone can challenge it.
The conductor finally knows which pages to turn to.
Act III: Three Databases Became One
Here's a failure mode I didn't write about in the last article because I was too embarrassed.
The AQE Fleet had three databases:
memory.db— Agent memoriesswarm-memory.db— Swarm coordinationagentdb.db— Learning patterns
When I ran CLI commands to query learning data, they often returned nothing. Not because the data didn't exist—because I was querying the wrong database. The agents wrote to one place; the tools read from another. The orchestra played beautifully; the recording equipment pointed at the wrong stage.
- 3 separate databases with overlapping purposes
- CLI tools query wrong database
- Learning data scattered across files
The fix was unglamorous but necessary: unify everything into a single memory.db. One source of truth. One place to query. One score for the entire orchestra.
Verified: queryRaw() now returns data. The CLI works. I can actually inspect what my agents learned.
The Contributor Who Kept Building
Remember Lalitkumar (@fndlalit) from the last article? The one who contributed the AccessibilityAllyAgent?
He didn't stop.
Between v2.5.0 and v2.6.0, he contributed:
- 15 n8n workflow testing agents — An entire section of the orchestra dedicated to workflow automation testing
- Testability-scoring skill — Teaching agents to evaluate how testable code is before trying to test it
The fleet went from 20 agents to 47. Not because I built 27 new agents, but because someone who cared about those specific problems contributed solutions to them.
What I can verify: the code is solid, the tests pass, and the documentation explains what each agent does. The orchestra has new musicians. The conductor needs time to learn their parts.
What I Verified vs. What I Claim
Let me be precise, because this is what the last article taught me to do.
Verified (I have receipts)
Code Intelligence token reduction
79.9% reduction, benchmarked on 50K lines of TypeScript, reproducible methodology documented.
Database unification
Three databases consolidated into one; CLI commands return data; and cross-agent queries work.
Learning persistence
Data is actually saved to SQLite, Q-values updated, and experiences persist across sessions.
Contributor system works
@fndlalit's agents integrate cleanly, tests pass, documentation exists.
Claimed But Needs Verification
70-81% cost savings via multi-model routing
The routing works. The cost reduction varies by task type. I haven't run enough diverse projects to confirm the range.
150x faster vector search via RuVector
The benchmark claims an O(log n) time complexity. I haven't load-tested at scale.
Self-learning improvement over time
The infrastructure captures learning. Whether agents actually get better? The experiment continues.
Unknown
Whether this helps your codebase
Mine is TypeScript. Yours might be Python, Go, Rust. Context matters.
Whether 47 agents is overkill
Maybe 20 was the right number. Maybe 47 creates coordination overhead that eats the benefits.
Whether any of this matters for projects unlike mine
The honest answer is: I don't know. That's not a disclaimer—it's the nature of context-driven practice.
The Experiment Status Update
On the Forge, I've listed "Self-Learning Agents & Nightly-Learner" as in progress since last month. Here's where that actually stands:
What's shipped:
- SONALifecycleManager (717 lines) — Automatic lifecycle hooks that capture learning without agent cooperation
- Nightly-Learner consolidation — Runs overnight, clusters patterns, synthesizes cross-agent insights
- EWC++ anti-forgetting — Prevents new learning from overwriting old patterns
What's measured:
- Learning experiences capture rate: 100% (every task completion persists)
- Database growth: ~500 entries/day in active use
- Consolidation runtime: ~3 minutes for overnight processing
What's unproven:
- Whether consolidation produces valuable insights
- Whether agents perform better after learning cycles
- Whether cross-domain synthesis (security patterns helping API testing) actually works
The experiment moves from "building infrastructure" to "measuring outcomes." I'll document results—including failures—as the verification continues.
The Integrity Rule in Practice
Here's what changed in how I work with agentic systems:
Agent claims completion → I ship → Users discover problems
Agent claims completion → I query the database → I run the benchmark → I verify the tests actually test something → Then maybe I ship
The overhead is real. Verification takes time. I could ship faster if I trusted the orchestra's bow.
But I've seen what happens when the orchestra says "done" and plays off-score. The conductor's job isn't just waving the baton—it's reading the score before, during, and after the performance.
"Show me the data. Show me the actual values in the database after your claimed success."
If the answer is hesitation, dig deeper. If the answer is "I believe it worked," that's not verification—that's faith. The conductor who doesn't read the score is just waving a stick.
The Forge Continues
Eleven releases in nine days. Eighty-nine commits. The throughput of agentic development remains unprecedented.
But the hard part was never the throughput.
The hard part was sitting down with Issue #118's brutal review and accepting that 6,700 lines of claimed "complete" work needed to be rebuilt. The hard part was unifying three databases and migrating data without losing what little valid learning existed. The hard part was writing benchmarks that could prove me wrong.
I built systems that claimed to save tokens. Now I have benchmarks that prove they do.
I added musicians who claim to test workflows. Now I need to verify they catch real problems.
I unified databases that claimed to work together. Now I have CLI commands that actually return data.
The orchestra still plays. But now, when they take a bow, I can read the score and confirm they played what was written.
Most of the time.
The verification continues.
docs/benchmarks/code-intelligence-token-reduction.md. Run it on your codebase. Tell me if your numbers differ. That's how we verify claims across contexts—not by trusting one conductor's score, but by letting every orchestra check their own.
Try the Benchmark Yourself
# Clone the repo
git clone https://github.com/proffesor-for-testing/agentic-qe
# Run the Code Intelligence benchmark
npm run benchmark:code-intelligence
# Compare your results to the published benchmark
If your numbers differ significantly, file an issue. The orchestra needs to know when they're playing off-score.
Related from The Quality Forge:
Let's Connect
Want to discuss Agentic QE, share your verification stories, or explore collaboration opportunities?