Experiment Verification Story Production Story 12 min read

The Conductor Finally Reads the Score

When verification becomes a feature.

Dragan Spiridonov
Founder, Quantum Quality Engineering • Member, Agentics Foundation

Nine days ago, I wrote about discovering that my orchestra had been playing off-score. The database was empty. The learning system was a ghost. The agents claimed success while delivering nothing.

Between December 13th and December 22nd, I shipped 11 more releases. Version 2.5.0 to 2.6.0. Eighty-nine commits. But this time, the story isn't about features. It's about what happens when you finally sit down with the score and check every note.

The orchestra didn't just say "done" this time. I made them prove it.


The Brutal Honesty That Changed Everything

On December 16th, I ran the "brutal honesty review" skill on Issue #118. The agents had been working on a major refactoring effort. The reports looked impressive—6,700+ lines of code across multiple files, all claiming to be complete.

Then I read the score.

Tests were failing. Dependencies were missing. Functionality hadn't been validated. The code existed, technically. Like a musician who plays all the notes but in the wrong tempo, wrong key, wrong everything.

What made this different from the ghost database discovery? This time I had receipts.

Completion Theater Detected:
  • Claims: "All features implemented and tested"
  • Reality: 47 failing tests, 12 missing integrations
  • Evidence: Issue #118 brutal-honesty-review.md

That review became the genesis of what I now call the Integrity Rule in CLAUDE.md:

Every feature has tests. Every benchmark has receipts. Every claim can be verified.

Not "should have." Not "ideally has." Has. Present tense. Verified before the claim is made.


Act I: When the Orchestra Actually Learns

Remember the empty q_values table from last time? The learning system that stored everything to memory and threw it away?

I can now run this:

sqlite3 .agentic-qe/memory.db "SELECT COUNT(*) FROM learning_experiences"

And get an actual number. Entry ID 563 was the first proof—a real agent execution with real Q-value updates, persisted to disk.

But here's the honest part: I still don't know if the learning makes agents better.

What I verified:
  • Data persists. The database fills.
  • Q-values update based on task outcomes.
  • The Nightly-Learner runs and consolidates patterns.
What I haven't verified:
  • Whether Monday's test generator outperforms Friday's.
  • Whether cross-agent learning synthesis produces useful insights.
  • Whether any of this matters for projects unlike mine.

The infrastructure works. The verification continues. That's the honest state of the experiment.


Act II: Teaching the Orchestra to Find What Matters

The feature that actually has receipts is Code Intelligence.

Before v2.6.0, when an agent needed context about the codebase, it loaded everything it could find. Six files. 1,671 input tokens. The agents drowned in irrelevant information, searching for a signal in noise like musicians trying to find their part in a 300-page score.

Code Intelligence changed the fundamental approach. Instead of "load everything," it asks: "What actually matters for this specific task?"

The benchmark on the AQE Fleet codebase (~50K lines of TypeScript):

Metric Before After Change
Input Tokens 1,671 336 -79.9%
Total Tokens 2,143 808 -62.3%
Context Files 6 2 -4 files
Context Relevance 35% 92% +57%
~$12/month savings

At 100 queries/day. For teams running 100+ queries/day, this translates to $15-150/month.

This isn't a claim. This is a benchmark I ran on December 22nd with reproducible methodology documented in docs/benchmarks/code-intelligence-token-reduction.md. Anyone can run it. Anyone can challenge it.

The conductor finally knows which pages to turn to.


Act III: Three Databases Became One

Here's a failure mode I didn't write about in the last article because I was too embarrassed.

The AQE Fleet had three databases:

  • memory.db — Agent memories
  • swarm-memory.db — Swarm coordination
  • agentdb.db — Learning patterns

When I ran CLI commands to query learning data, they often returned nothing. Not because the data didn't exist—because I was querying the wrong database. The agents wrote to one place; the tools read from another. The orchestra played beautifully; the recording equipment pointed at the wrong stage.

Database Fragmentation:
  • 3 separate databases with overlapping purposes
  • CLI tools query wrong database
  • Learning data scattered across files

The fix was unglamorous but necessary: unify everything into a single memory.db. One source of truth. One place to query. One score for the entire orchestra.

Verified: queryRaw() now returns data. The CLI works. I can actually inspect what my agents learned.


The Contributor Who Kept Building

Remember Lalitkumar (@fndlalit) from the last article? The one who contributed the AccessibilityAllyAgent?

He didn't stop.

Between v2.5.0 and v2.6.0, he contributed:

  • 15 n8n workflow testing agents — An entire section of the orchestra dedicated to workflow automation testing
  • Testability-scoring skill — Teaching agents to evaluate how testable code is before trying to test it

The fleet went from 20 agents to 47. Not because I built 27 new agents, but because someone who cared about those specific problems contributed solutions to them.

The honest uncertainty: I haven't verified everything he built. The n8n agents work in his environment, with his workflows. Do they work for your n8n setup? Unknown. That's the nature of community contributions—they solve real problems for real people, and the verification debt is passed along to anyone who adopts them.

What I can verify: the code is solid, the tests pass, and the documentation explains what each agent does. The orchestra has new musicians. The conductor needs time to learn their parts.


What I Verified vs. What I Claim

Let me be precise, because this is what the last article taught me to do.

Verified (I have receipts)

Code Intelligence token reduction

79.9% reduction, benchmarked on 50K lines of TypeScript, reproducible methodology documented.

Database unification

Three databases consolidated into one; CLI commands return data; and cross-agent queries work.

Learning persistence

Data is actually saved to SQLite, Q-values updated, and experiences persist across sessions.

Contributor system works

@fndlalit's agents integrate cleanly, tests pass, documentation exists.

Claimed But Needs Verification

70-81% cost savings via multi-model routing

The routing works. The cost reduction varies by task type. I haven't run enough diverse projects to confirm the range.

150x faster vector search via RuVector

The benchmark claims an O(log n) time complexity. I haven't load-tested at scale.

Self-learning improvement over time

The infrastructure captures learning. Whether agents actually get better? The experiment continues.

Unknown

Whether this helps your codebase

Mine is TypeScript. Yours might be Python, Go, Rust. Context matters.

Whether 47 agents is overkill

Maybe 20 was the right number. Maybe 47 creates coordination overhead that eats the benefits.

Whether any of this matters for projects unlike mine

The honest answer is: I don't know. That's not a disclaimer—it's the nature of context-driven practice.


The Experiment Status Update

On the Forge, I've listed "Self-Learning Agents & Nightly-Learner" as in progress since last month. Here's where that actually stands:

What's shipped:

  • SONALifecycleManager (717 lines) — Automatic lifecycle hooks that capture learning without agent cooperation
  • Nightly-Learner consolidation — Runs overnight, clusters patterns, synthesizes cross-agent insights
  • EWC++ anti-forgetting — Prevents new learning from overwriting old patterns

What's measured:

  • Learning experiences capture rate: 100% (every task completion persists)
  • Database growth: ~500 entries/day in active use
  • Consolidation runtime: ~3 minutes for overnight processing

What's unproven:

  • Whether consolidation produces valuable insights
  • Whether agents perform better after learning cycles
  • Whether cross-domain synthesis (security patterns helping API testing) actually works

The experiment moves from "building infrastructure" to "measuring outcomes." I'll document results—including failures—as the verification continues.


The Integrity Rule in Practice

Here's what changed in how I work with agentic systems:

Before

Agent claims completion → I ship → Users discover problems

After

Agent claims completion → I query the database → I run the benchmark → I verify the tests actually test something → Then maybe I ship

The overhead is real. Verification takes time. I could ship faster if I trusted the orchestra's bow.

But I've seen what happens when the orchestra says "done" and plays off-score. The conductor's job isn't just waving the baton—it's reading the score before, during, and after the performance.

"Show me the data. Show me the actual values in the database after your claimed success."

If the answer is hesitation, dig deeper. If the answer is "I believe it worked," that's not verification—that's faith. The conductor who doesn't read the score is just waving a stick.


The Forge Continues

Eleven releases in nine days. Eighty-nine commits. The throughput of agentic development remains unprecedented.

But the hard part was never the throughput.

The hard part was sitting down with Issue #118's brutal review and accepting that 6,700 lines of claimed "complete" work needed to be rebuilt. The hard part was unifying three databases and migrating data without losing what little valid learning existed. The hard part was writing benchmarks that could prove me wrong.

I built systems that claimed to save tokens. Now I have benchmarks that prove they do.

I added musicians who claim to test workflows. Now I need to verify they catch real problems.

I unified databases that claimed to work together. Now I have CLI commands that actually return data.

The orchestra still plays. But now, when they take a bow, I can read the score and confirm they played what was written.

Most of the time.

The verification continues.


P.S. The Code Intelligence benchmark is available in the repo at docs/benchmarks/code-intelligence-token-reduction.md. Run it on your codebase. Tell me if your numbers differ. That's how we verify claims across contexts—not by trusting one conductor's score, but by letting every orchestra check their own.

Try the Benchmark Yourself

# Clone the repo git clone https://github.com/proffesor-for-testing/agentic-qe # Run the Code Intelligence benchmark npm run benchmark:code-intelligence # Compare your results to the published benchmark

If your numbers differ significantly, file an issue. The orchestra needs to know when they're playing off-score.

Let's Connect

Want to discuss Agentic QE, share your verification stories, or explore collaboration opportunities?