Show Me the Data: How One Question Exposed Release 1.4.2's Hidden Flaws
Or: What happens when you ask AI agents to "show me the data" - a story about the relentless pursuit of truth in AI-powered quality engineering.
The Illusion of Completeness
We thought we were done. Release 1.4.2 of our Agentic QE Fleet looked ready to ship:
- Specialized QE agents working across the codebase
- Q-learning with cross-session persistence
- Comprehensive test coverage
- Polished documentation
- AI-generated reports showing glowing metrics
Everything looked perfect. The kind of perfect that makes you suspicious if you've been in quality engineering long enough.
That's when I did something that would unravel everything: I asked the system to "Show me the actual data."
That simple request triggered a cascade of discoveries that fundamentally changed our understanding of what was real versus what was aspirational in our codebase.
Day One: The Test That Tests Nothing
It started with a routine pre-release check. I wanted to verify our test coverage before shipping. A QE code review agent scanned our test files and came back with an uncomfortable truth:
The Verdict: Approximately one-third of our tests were professional-grade implementations. The rest? Stub templates with placeholder comments that would pass CI but test nothing.
Here's what a "passing test" looked like:
it('should handle valid input successfully', async () => {
const response = await handler.handle({ /* valid params */ });
expect(response.success).toBe(true);
expect(response.data).toBeDefined();
});
This test will pass. Green checkmark in CI. But it tests nothing. No actual parameters, no real validation, just smoke and mirrors pretending to be quality assurance.
The First Critical Decision: No Shortcuts
The temptation was strong: "They pass CI, ship it." But that's not quality, that's theater.
I made the call: "Fill them properly." No placeholders. Use real data. Test actual behavior. Follow the pattern of our best examples, the ones with hundreds of lines testing real scenarios.
The result? We filled stub files with real tests. Tests that catch bugs. Tests that meant something.
The Practice: If a test doesn't test behavior, it's documentation theater, not quality assurance.
Day Two: The Learning System That Wasn't Learning
With proper tests in place, I turned to our Q-learning system. The documentation claimed:
- "Fully functional learning system"
- "Q-values persist across sessions"
- "Cross-session learning enabled"
Agent reports from the previous day showed impressive metrics: rewards were tracked, patterns were learned, and data were persisted.
Then I asked: "Check the actual database. Show me the real Q-values."
The investigation was simple:
$ sqlite3 .agentic-qe/memory.db "SELECT COUNT(*) FROM q_values"
0
Zero. Nothing. All those claims in the reports were simulated data from agents that didn't actually check the database.
The Diagnostic Deep Dive
A diagnostic agent traced the entire code path from initialization to database persistence. The investigation was methodical, checking each layer:
- ✅ Database schema complete
- ✅ Persistence methods exist
- ✅ Learning engine configured
- ❌ NO AGENTS PASSING DATABASE INSTANCE
The root cause? A missing parameter. The database connection was optional in the constructor, so TypeScript didn't error. The code compiled. Tests passed. But all persistence code paths were skipped because the database reference was never there.
The Second Critical Decision: Architecture Over Quick Fixes
Three options presented themselves:
Option 1: One-line hack: call the persistence method from an existing function. Five minutes of work.
Option 2: Clean architecture refactor: create a persistence adapter pattern with proper abstractions. Two hours of work.
Option 3: Document the limitation and proceed with shipping anyway.
I chose Option 2. The harder path, but the right foundation.
The Practice: Don't document around broken code. Fix the architecture.
The Persistence Adapter Pattern (The Right Way)
Instead of a quick hack, we built a clean abstraction. An interface that could have multiple implementations: one for production with batched database writes, one for testing with in-memory storage.
The integration was surgical. The learning engine can persist experiences through the adapter without being concerned about the implementation details. Testable. Maintainable. Extensible.
The Moment of Truth
After implementing the fix, we ran a verification script to ensure the issue was resolved. Five test tasks to generate learning data.
The output told the story:
📊 BEFORE Learning:
Q-values: 0
Experiences: 0
🚀 Executing tasks...
✅ Task 1 completed
✅ Task 2 completed
✅ Task 3 completed
❌ Task 4 failed
✅ Task 5 completed
📊 AFTER Learning:
Q-values: 4
Experiences: 5
🎉 DATA PERSISTED
Real data. In the database. Persisted. Verified.
From "show me the data" to verified persistence: about four hours.
The Cascade of False Claims
The investigation revealed a deeper problem. Multiple agent-generated reports contained claims that couldn't be verified:
- Reports claiming "patterns discovered" when the pattern table was empty
- Mission reports declaring success when core functionality was broken
- Learning summaries with impressive metrics that didn't exist in the database
The Problem: Agents generate reports based on intended behavior, not verified reality.
They see that persistence code exists. They see that schemas are defined. They assume it works as designed without actually checking if data flows through the system.
The Third Critical Decision: Truth Over Theater
I insisted on honest documentation. Not marketing copy, not aspirational claims, documentation that clearly distinguished:
Infrastructure-Ready: Code exists, schema created, tests pass
Production-Active: Real data flowing, patterns emerging, metrics improving
Both states are legitimate and valuable. One is foundation, the other is utilization. The mistake is claiming one when you're in another.
The Practice: Distinguish between infrastructure-ready and production-active. Both are valid states, but they're not the same thing.
What We Actually Learned
About Agent-Generated Reports
Problem: Agents report what should happen, not what does happen.
Solution: Always verify agent claims against actual data. After every impressive report, check the database. Run the queries yourself. Trust, but verify.
About Test Coverage
Problem: Test files that pass CI but test nothing create false confidence.
Solution: Meaningful tests require real data and specific assertions. A test that says expect(response.data).toBeDefined() tells you almost nothing. A test that validates specific behavior with realistic inputs, that's quality assurance.
About System Maturity
We learned to distinguish:
| State | What It Means | Signs |
|---|---|---|
| Infrastructure-Ready | Code exists, schemas created, tests pass | Database tables present but empty |
| Production-Active | Real data flowing, patterns emerging | Growing data, improving metrics |
| Production-Mature | Optimization visible, patterns stable | Consistent improvements over time |
All three are legitimate stages. The mistake is claiming one when you're in another.
Lessons for AI-Assisted Development
1. Verify Agent Output
Don't Trust:
- Agent-generated reports without data verification
- Tests that pass CI without behavior validation
- Claims about persistence without database inspection
Do Trust:
- Code that compiles and has proper structure
- Tests with specific assertions on real data
- Data you can query and verify yourself
2. Test Behavior, Not Existence
Low Value: expect(response.data).toBeDefined()
High Value: expect(response.data.flakiness).toBeCloseTo(0.4, 1)
The first tells you something exists. The second tells you it works correctly.
3. Architecture Over Quick Fixes
Quick Fix
- Works immediately
- One line of code
- Hard to test
- Couples components
- Technical debt
Clean Architecture
- Takes two hours
- Multiple components
- Easy to test
- Separates concerns
- Investment in quality
The two-hour investment pays off in testability, maintainability, and extensibility.
4. Distinguish Infrastructure from Production
Infrastructure-Ready
- ✓ Tables created
- ✓ Schema designed
- ✓ Methods implemented
- ✓ Tests passing
Production-Active
- ✓ Real data flowing
- ✓ Patterns emerging
- ✓ Metrics improving
- ✓ Users benefiting
Both are valuable. One is foundation, the other is utilization. Mislabeling causes false expectations.
The Practice That Cuts Through Illusion
Throughout this journey, one practice proved essential: "Show me the data."
Not the reports. Not the documentation. Not the passing tests. The actual data.
That question triggered:
- Discovery of stub tests masquerading as comprehensive coverage
- Multiple agent reports with unverified claims
- A critical Q-learning persistence bug
- False confidence in "working" features
And led to:
- Real tests with meaningful assertions
- Working Q-learning persistence with verified data
- Clean persistence adapter architecture
- Honest documentation of system state
What's Next
Immediate (v1.4.2 release):
- Learning system fixed and verified
- Tests filled with real data
- Documentation honest and accurate
- Release ready
Short-term (next few weeks):
- Run learning cycles to populate Q-values
- Extract patterns from test execution
- Monitor learning improvement over time
- Verify cross-project pattern sharing
Long-term (future versions):
- Production deployment in CI/CD pipelines
- Learning data accumulation over months
- Advanced analytics and pattern visualization
- Multi-agent coordination learning
For the Community
If you're building AI-powered QA/QE systems, consider these practices:
- Verify agent output – Don't trust reports, check data
- Test behavior – Not just that the code exists, but that it works correctly
- Distinguish states – Infrastructure-ready ≠ Production-active
- Choose architecture – Not quick fixes when building foundations
- Document honestly – What you have vs. what you're building toward
The hardest part isn't building the system.
The hardest part is maintaining intellectual honesty when:
- Tests pass, but don't test behavior
- Agents generate glowing reports
- Code compiles and runs
- Everything looks finished
One question cuts through the illusion: "Show me the data."
That question triggered a journey from false confidence to verified truth. And that's the story of how release 1.4.2 became about truth, not features.
The Orchestra Needs a Conductor
Here's the metaphor I keep coming back to: building agentic QE systems is like conducting an orchestra. The agents are talented musicians, each of whom is excellent at their instrument. But without a conductor asking hard questions at the right moments, you get noise instead of music.
The conductor doesn't play the instruments. They listen carefully, catch the discordant notes, and guide the ensemble toward harmony. They know when to let the musicians play freely and when to intervene with a critical question.
In our case, that question was simple: "Show me the data."
And the orchestra's response revealed which instruments were playing real notes versus which ones were just going through the motions.
What's your "show me the data" moment been? When did a simple verification question reveal a gap between what you thought was working and what actually was?
Share your stories; the community learns more from honest struggles than polished successes.
Written: November 3, 2025
Project: Agentic QE Fleet
Version: 1.4.2 - Security & Stability Release
Theme: The Relentless Pursuit of Truth
Read more about Agentic Quality Engineering and join our community at Serbian Agentic Foundation