Show Me the Data: How One Question Exposed Release 1.4.2's Hidden Flaws

Or: What happens when you ask AI agents to "show me the data" - a story about the relentless pursuit of truth in AI-powered quality engineering.

The Illusion of Completeness

We thought we were done. Release 1.4.2 of our Agentic QE Fleet looked ready to ship:

Specialized QE agents working across the codebase
Q-learning with cross-session persistence
Comprehensive test coverage
Polished documentation
AI-generated reports showing glowing metrics

Everything looked perfect. The kind of perfect that makes you suspicious if you've been in quality engineering long enough.

That's when I did something that would unravel everything: I asked the system to "Show me the actual data."

That simple request triggered a cascade of discoveries that fundamentally changed our understanding of what was real versus what was aspirational in our codebase.

Day One: The Test That Tests Nothing

It started with a routine pre-release check. I wanted to verify our test coverage before shipping. A QE code review agent scanned our test files and came back with an uncomfortable truth:

The Verdict: Approximately one-third of our tests were professional-grade implementations. The rest? Stub templates with placeholder comments that would pass CI but test nothing.

Here's what a "passing test" looked like:

it('should handle valid input successfully', async () => {
  const response = await handler.handle({ /* valid params */ });
  expect(response.success).toBe(true);
  expect(response.data).toBeDefined();
});

This test will pass. Green checkmark in CI. But it tests nothing. No actual parameters, no real validation, just smoke and mirrors pretending to be quality assurance.

The First Critical Decision: No Shortcuts

The temptation was strong: "They pass CI, ship it." But that's not quality, that's theater.

I made the call: "Fill them properly." No placeholders. Use real data. Test actual behavior. Follow the pattern of our best examples, the ones with hundreds of lines testing real scenarios.

The result? We filled stub files with real tests. Tests that catch bugs. Tests that meant something.

The Practice: If a test doesn't test behavior, it's documentation theater, not quality assurance.

Day Two: The Learning System That Wasn't Learning

With proper tests in place, I turned to our Q-learning system. The documentation claimed:

"Fully functional learning system"
"Q-values persist across sessions"
"Cross-session learning enabled"

Agent reports from the previous day showed impressive metrics: rewards were tracked, patterns were learned, and data were persisted.

Then I asked: "Check the actual database. Show me the real Q-values."

The investigation was simple:

$ sqlite3 .agentic-qe/memory.db "SELECT COUNT(*) FROM q_values"
0

Zero. Nothing. All those claims in the reports were simulated data from agents that didn't actually check the database.

The Diagnostic Deep Dive

A diagnostic agent traced the entire code path from initialization to database persistence. The investigation was methodical, checking each layer:

✅ Database schema complete
✅ Persistence methods exist
✅ Learning engine configured
❌ NO AGENTS PASSING DATABASE INSTANCE

The root cause? A missing parameter. The database connection was optional in the constructor, so TypeScript didn't error. The code compiled. Tests passed. But all persistence code paths were skipped because the database reference was never there.

The Second Critical Decision: Architecture Over Quick Fixes

Three options presented themselves:

Option 1: One-line hack: call the persistence method from an existing function. Five minutes of work.

Option 2: Clean architecture refactor: create a persistence adapter pattern with proper abstractions. Two hours of work.

Option 3: Document the limitation and proceed with shipping anyway.

I chose Option 2. The harder path, but the right foundation.

The Practice: Don't document around broken code. Fix the architecture.

The Persistence Adapter Pattern (The Right Way)

Instead of a quick hack, we built a clean abstraction. An interface that could have multiple implementations: one for production with batched database writes, one for testing with in-memory storage.

The integration was surgical. The learning engine can persist experiences through the adapter without being concerned about the implementation details. Testable. Maintainable. Extensible.

The Moment of Truth

After implementing the fix, we ran a verification script to ensure the issue was resolved. Five test tasks to generate learning data.

The output told the story:

📊 BEFORE Learning:
  Q-values: 0
  Experiences: 0

🚀 Executing tasks...
  ✅ Task 1 completed
  ✅ Task 2 completed
  ✅ Task 3 completed
  ❌ Task 4 failed
  ✅ Task 5 completed

📊 AFTER Learning:
  Q-values: 4
  Experiences: 5

🎉 DATA PERSISTED

Real data. In the database. Persisted. Verified.

From "show me the data" to verified persistence: about four hours.

The Cascade of False Claims

The investigation revealed a deeper problem. Multiple agent-generated reports contained claims that couldn't be verified:

Reports claiming "patterns discovered" when the pattern table was empty
Mission reports declaring success when core functionality was broken
Learning summaries with impressive metrics that didn't exist in the database

The Problem: Agents generate reports based on intended behavior, not verified reality.

They see that persistence code exists. They see that schemas are defined. They assume it works as designed without actually checking if data flows through the system.

The Third Critical Decision: Truth Over Theater

I insisted on honest documentation. Not marketing copy, not aspirational claims, documentation that clearly distinguished:

Infrastructure-Ready: Code exists, schema created, tests pass

Production-Active: Real data flowing, patterns emerging, metrics improving

Both states are legitimate and valuable. One is foundation, the other is utilization. The mistake is claiming one when you're in another.

The Practice: Distinguish between infrastructure-ready and production-active. Both are valid states, but they're not the same thing.

What We Actually Learned

About Agent-Generated Reports

Problem: Agents report what should happen, not what does happen.

Solution: Always verify agent claims against actual data. After every impressive report, check the database. Run the queries yourself. Trust, but verify.

About Test Coverage

Problem: Test files that pass CI but test nothing create false confidence.

Solution: Meaningful tests require real data and specific assertions. A test that says expect(response.data).toBeDefined() tells you almost nothing. A test that validates specific behavior with realistic inputs, that's quality assurance.

About System Maturity

We learned to distinguish:

State	What It Means	Signs
Infrastructure-Ready	Code exists, schemas created, tests pass	Database tables present but empty
Production-Active	Real data flowing, patterns emerging	Growing data, improving metrics
Production-Mature	Optimization visible, patterns stable	Consistent improvements over time

All three are legitimate stages. The mistake is claiming one when you're in another.

Lessons for AI-Assisted Development

1. Verify Agent Output

Don't Trust:

Agent-generated reports without data verification
Tests that pass CI without behavior validation
Claims about persistence without database inspection

Do Trust:

Code that compiles and has proper structure
Tests with specific assertions on real data
Data you can query and verify yourself

2. Test Behavior, Not Existence

Low Value: expect(response.data).toBeDefined()

High Value: expect(response.data.flakiness).toBeCloseTo(0.4, 1)

The first tells you something exists. The second tells you it works correctly.

3. Architecture Over Quick Fixes

Quick Fix

Works immediately
One line of code
Hard to test
Couples components
Technical debt

Clean Architecture

Takes two hours
Multiple components
Easy to test
Separates concerns
Investment in quality

The two-hour investment pays off in testability, maintainability, and extensibility.

4. Distinguish Infrastructure from Production

Infrastructure-Ready

✓ Tables created
✓ Schema designed
✓ Methods implemented
✓ Tests passing

Production-Active

✓ Real data flowing
✓ Patterns emerging
✓ Metrics improving
✓ Users benefiting

Both are valuable. One is foundation, the other is utilization. Mislabeling causes false expectations.

The Practice That Cuts Through Illusion

Throughout this journey, one practice proved essential: "Show me the data."

Not the reports. Not the documentation. Not the passing tests. The actual data.

That question triggered:

Discovery of stub tests masquerading as comprehensive coverage
Multiple agent reports with unverified claims
A critical Q-learning persistence bug
False confidence in "working" features

And led to:

Real tests with meaningful assertions
Working Q-learning persistence with verified data
Clean persistence adapter architecture
Honest documentation of system state

What's Next

Immediate (v1.4.2 release):

Learning system fixed and verified
Tests filled with real data
Documentation honest and accurate
Release ready

Short-term (next few weeks):

Run learning cycles to populate Q-values
Extract patterns from test execution
Monitor learning improvement over time
Verify cross-project pattern sharing

Long-term (future versions):

Production deployment in CI/CD pipelines
Learning data accumulation over months
Advanced analytics and pattern visualization
Multi-agent coordination learning

For the Community

If you're building AI-powered QA/QE systems, consider these practices:

Verify agent output – Don't trust reports, check data
Test behavior – Not just that the code exists, but that it works correctly
Distinguish states – Infrastructure-ready ≠ Production-active
Choose architecture – Not quick fixes when building foundations
Document honestly – What you have vs. what you're building toward

The hardest part isn't building the system.

The hardest part is maintaining intellectual honesty when:

Tests pass, but don't test behavior
Agents generate glowing reports
Code compiles and runs
Everything looks finished

One question cuts through the illusion: "Show me the data."

That question triggered a journey from false confidence to verified truth. And that's the story of how release 1.4.2 became about truth, not features.

The Orchestra Needs a Conductor

Here's the metaphor I keep coming back to: building agentic QE systems is like conducting an orchestra. The agents are talented musicians, each of whom is excellent at their instrument. But without a conductor asking hard questions at the right moments, you get noise instead of music.

The conductor doesn't play the instruments. They listen carefully, catch the discordant notes, and guide the ensemble toward harmony. They know when to let the musicians play freely and when to intervene with a critical question.

In our case, that question was simple: "Show me the data."

And the orchestra's response revealed which instruments were playing real notes versus which ones were just going through the motions.

What's your "show me the data" moment been? When did a simple verification question reveal a gap between what you thought was working and what actually was?

Share your stories; the community learns more from honest struggles than polished successes.

Written: November 3, 2025

Project: Agentic QE Fleet

Version: 1.4.2 - Security & Stability Release

Theme: The Relentless Pursuit of Truth

Read more about Agentic Quality Engineering and join our community at Serbian Agentic Foundation