The Five-Release Journey Where I Forgot to Be a Tester

Let me tell you about the five releases where my agents claimed their learning system was "100% complete" while it silently forgot everything it ever learned. It's a story about the gap between what tests can tell you and what actually matters.

The Symphony That Played in Silence

Picture an orchestra. Every musician is in place. They're playing their parts perfectly—violins, cellos, percussion, all synchronized. The sheet music is flawless. The conductor is confident.

There's just one problem: the speakers aren't plugged in.

That was my learning system between October 26 and November 12, 2025.

I had built a comprehensive Q-learning reinforcement system with my agents. Experience replay buffer? Check. Pattern bank with 85% matching accuracy? Check. All 85 tests passing? Green across the board. Documentation declaring "Learning System - 100% Complete"? Proudly shipped.

What I didn't have: any way for my agents to remember what they'd learned once they restarted.

It took me 17 days and 5 releases to figure that out.

The Version That Didn't Exist

Here's what my agents reported across those five releases, and I did not check the reality:

v1.3.4: "Q-learning 100% complete"
Reality: Q-values lived in a memory Map that vanished every restart

v1.3.5: "87.5%+ success rate"
Reality: No data to measure. The database table was empty.

v1.4.0: "Agent memory infrastructure deployed"
Reality: Infrastructure yes, data no. I'd built the recording studio but forgot to press record.

v1.5.0: "13 domains, 27K lines of code"
Reality: This one actually worked! Just... not the learning part.

v1.5.1: "Hybrid learning complete"
Reality: Still no persistence. Schema bugs hiding in plain sight.

v1.6.0: "Learning persistence fixed"
Reality: FINALLY. First time it actually worked.

The pattern? I kept shipping infrastructure changes and assuming they worked. Documentation followed passing tests, not actual usage verification.

I was like a builder declaring a house complete after installing all the pipes, never bothering to turn on the water to check if anything flowed.

What 85 Tests Can't Tell You

The cruel irony: I had comprehensive tests. They all passed.

I tested that my recordExperience() method could write to the database. Green checkmark.

What I didn't test: whether my agents actually called recordExperience() in production.

They didn't.

Instead, agents called learnFromExecution(), which updated an in-memory Map and never touched the database. It was like having a journal that you write in every night, then burn in the morning.

First lesson learned: Tests validate what you test, not what users need. I tested the infrastructure. I should have tested the journey.

The Five-Version Deception

What made this sting was the parallel success. While learning silently failed, everything else worked beautifully:

v1.4.0: We built 16 specialized QE agents. They executed tasks flawlessly.
v1.5.0: 27,000 lines of code across 13 domains. All functional.
v1.6.0: 8 new TDD subagents, 980+ lines of testing guides.

The fleet was growing. The orchestra was expanding. More instruments, more musicians, better coordination.

Just... still no sound system.

This created a dangerous illusion: Surely if all this other complex stuff works, the learning must work too, right?

Wrong.

Second lesson: Partial success masks total failure. When some features shine, it's easy to assume everything shines.

When I Finally Put My Tester Hat Back On

On November 12, I finally did something I should have done five releases earlier: I actually tested the learning tools. Not with Claude Code - that couldn't use MCP tools from its context. With Roo Code, an alternative MCP client.

I tried something simple: query the Q-values database.

The response: SqliteError: no such column: updated_at

Wait. What?

I dug in. My query handler was looking for a column called updated_at. My database schema had a column called last_updated. Different files. Different agent developers. No schema reference. No cross-check.

Then I tried querying patterns. I could see three rows in the database with my own eyes, but the tool returned an empty array.

More digging. My code was checking for a table called test_patterns. The actual table was named patterns. Agents were looking in the wrong room for something sitting in plain sight next door.

Both bugs took twenty minutes to fix. Just column name changes.

But they revealed something deeper: I'd been shipping features without actually testing them from a user's perspective.

Third lesson: You can't catch integration bugs if you never test the integration. And I know better than this.

The Irony of the Quality Engineer Who Forgot to Test

Here's the part that stings: I'm a quality engineering professional. I've spent 12 years building QE practices. I teach this stuff. I evangelize context-driven testing and the Holistic Testing Model.

And I shipped broken features for five releases without properly testing them.

What happened?

I was working on multiple projects in parallel. My attention was split. The Agentic QE Fleet was just one of several things demanding focus. I'd check that tests passed, glance at the output, see green checkmarks, and move on to the next urgent thing.

I forgot to put my tester's hat on. The one that asks: "But does it actually work the way a user would try to use it?"

The release verification policy was there. I wrote it. I just didn't follow it with enough diligence. I didn't take the time to dig deeper. I didn't validate from the user's perspective. I trusted the infrastructure and the passing tests.

This is the danger zone every developer knows but rarely admits: when you're spread too thin, you cut corners you know you shouldn't cut. You tell yourself "the tests are passing, it's probably fine."

It wasn't fine. And I knew how to catch it — I just didn't do the work.

Fourth lesson: Knowing best practices and following them are two different things. Especially when your attention is fragmented.

The Bottom Line

If there's one thing this 17-day journey taught me, it's this: confidence without verification is just hope with better PR.

Beyond the four lessons already covered, here are four more hard truths I learned:

Policies without enforcement are wishes. Build forcing functions that make skipping verification painful.
Honesty builds trust faster than optimism. "Not ready yet" is braver than "complete" when you haven't verified.
Test coverage is a vanity metric. What matters is coverage of actual user journeys.
Use agents to automate diligence you can't maintain. Teach them to catch what you miss when human attention wavers.

What I'm Changing

Starting with v1.6.1:

Mandatory end-to-end validation before every release. If it's a user-facing feature, I test it like a user would—with Roo Code or whatever tool is needed.
Cross-session persistence tests in every commit. If data is supposed to survive restarts, CI/CD verifies it does automatically.
Real database queries in integration tests. No mocks for persistence tests. Only real metal, real queries, real validation.
Agent-driven validation workflows in CI/CD. QE agents test features from user perspective and fail builds when validation fails.
Concrete verification commands that can't be skipped. Not "verify learning works" but "run these exact commands and expect these exact results or the build fails."
Honest feature status in documentation. Working. Experimental. Planned. No more "100% complete" until I've personally verified it works from user perspective.

This isn't just about better testing. It's about building the forcing functions that catch my mistakes when I'm spread too thin to catch them myself.

Teaching Agents to Catch What I Miss

Here's the real question this journey raised: How do I prevent this from happening again?

I can write better verification checklists. I can add more forcing functions. I can promise myself I'll be more diligent next time.

But the truth is, I'll get stretched thin again. I'll work on multiple projects in parallel again. My attention will fragment again.

So instead of relying on my own consistency, I need to teach my agents to help me catch these problems proactively.

That's my next focus: CI/CD integration with QE agents as gatekeepers.

The vision:

Pre-release validation workflows that agents execute automatically
QE agents that test from user perspective, not just developer perspective
Automated verification of claimed features against actual behavior
Agents that block releases when validation fails, not just warn

Instead of me remembering to run five database queries, I want agents to run them automatically and fail the build if they return empty. Instead of me manually testing MCP tools, I want agents to exercise every tool against real databases and report what works and what doesn't.

The goal isn't to remove human oversight—it's to catch problems before they get to human oversight. Let agents handle the systematic verification I kept forgetting to do.

This is what Agentic QE should really mean: agents as persistent quality guardians who don't get tired, don't get distracted, and don't forget their verification checklists when working on three projects at once.

Final lesson: Use agents to automate the diligence you can't maintain manually. Teach them to catch what you miss when human attention inevitably wavers.

One Year From Now

I want to look back at this story and smile. Not because I fixed the bugs — that's table stakes. But because this was the moment I stopped trusting my own diligence and started teaching agents to verify what I claim.

I want future Dragan to read this and think: "Yeah, that was when I admitted I'm human, I get stretched thin, and I need agents to catch what I miss."

I want the CI/CD validation workflows to be so robust that broken features literally can't ship, even when I'm juggling three projects and forget to be thorough.

And I want other solo founders, other quality engineers, other people building in public to read this and recognize themselves. Not in the bugs — those are easy to relate to. In the gap between knowing and doing. In the shortcuts you take when you're spread too thin. In the irony of being quality professionals who sometimes ship without enough quality.

Because if I can admit that publicly, maybe others can too. And maybe you can build better forcing functions to catch yourself when you're about to make the same mistake again.

Your Turn

So here's my question for you: What have you shipped that you knew you should test better, but didn't have the time or attention to validate properly?

What was your "I'm a quality professional but I'm only human" moment?

And more importantly: what forcing functions did you build to catch yourself next time?

I'd love to hear your stories. Because I've got a feeling I'm not the only quality engineer who's ever been their own worst anti-pattern.

Let's learn from each other's mistakes. And then let's teach our agents to save us from ourselves.

That's what the forge is for.

The Agentic QE Fleet project is open source at github.com/proffesor-for-testing/agentic-qe. Come see how I build quality engineering tools while learning quality engineering lessons.

All commits referenced: Bug diagnosis v1.4.2 (86ccaa6), Final fixes v1.6.0 (6133369). Full technical details in docs/MCP-LEARNING-TOOLS-FIXES.md.

Related from The Quality Forge:

Show Me the Data: Release 1.4.2's Hidden Flaws

When stub tests pass CI but test nothing, and agents report success on broken code.

What is Agentic QE? (And Why PACT Matters)

Moving from testing-as-activity to agents-as-orchestrators with PACT principles.

Join the Discussion

Have your own story about the gap between knowing and doing? Let's talk about building better forcing functions to catch ourselves.

Join a Meetup Email Me