It was Monday evening, November 3rd, and I had a simple goal: to understand LionAGI by building something real with it.
By Tuesday morning, I had a production-ready multi-agent testing system. By Tuesday afternoon, I learned why "production-ready" and "framework-aligned" aren't the same thing.
This is the story of building the LionAGI QE Fleet in 22 hours, and why the most valuable part wasn't the building.
Act I: Monday Evening - Fast Start (19:00 CET)
I opened my terminal and started a conversation with Claude Code:
"I would like to try to build a new version of my Agentic QE fleet using LionAGI. Here is the link to my repo. Can you help me with the setup and guide me on how to proceed?"
Within 30 minutes, I had more than just architecture: I had the first working agent implemented in Python.
The Claude Code agent had analyzed my Agentic QE Fleet TypeScript implementation and created:
- Complete project structure with proper Python packaging
- BaseQEAgent foundation using LionAGI's Branch pattern
- Core framework classes (QEOrchestrator, QEMemory, ModelRouter)
- First specialized agent (TestGeneratorAgent) fully functional
- pyproject.toml with all dependencies
- README with quick start guide
I pushed to GitHub with the commit message: "Initial commit: LionAGI QE Fleet foundation"
What I felt: This is incredibly fast.
What I didn't see yet: Fast execution doesn't guarantee the right decisions.
Act II: The Swarm (19:30-20:00)
Time to implement all 18 agents. Rather than doing them sequentially, I spawned 6 parallel agent tasks:
19:35 - Swarm launched:
- Core Testing agents (TestGenerator, TestExecutor, FleetCommander)
- Performance & Security agents
- Strategic agents
- Advanced Testing agents
- Specialized agents
- MCP integration for Claude Code compatibility
I watched multiple terminal windows generating code in parallel. Test files appearing. Documentation growing. Dependencies installing.
19:50 - All tasks completed:
- ✅ Created private GitHub repository
- ✅ Implemented all 18 agents from the original fleet
- ✅ Added complete MCP integration for Claude Code
- ✅ Created comprehensive unit tests (175+ functions)
- ✅ Generated extensive documentation
- ✅ Committed and pushed to GitHub (2 commits, 75 files)
The system was "production-ready":
- Multi-model routing for cost optimization
- Q-learning integration for continuous improvement
- Comprehensive test coverage
- Hooks system for observability
Total time: Less than 1 hour from concept to working system.
I went to bed thinking about what I'd just built.
Act III: Using My Own Tools (Tuesday Morning)
Tuesday morning started with a decision: use my existing Agentic QE fleet (the TypeScript one based on Claude Code subagents) to analyze what I'd just built.
This is the meta-moment in agentic engineering: using your previous generation of agents to evaluate your current generation.
I pointed my TypeScript fleet at the new Python LionAGI codebase. Six specialized agents ran in parallel:
- qe-coverage-analyzer: Examined test coverage
- qe-code-complexity: Analyzed code quality
- qe-quality-analyzer: Comprehensive quality assessment
- qe-security-scanner: Security vulnerability scan
- qe-requirements-validator: Requirements validation
- qe-deployment-readiness: Production readiness check
The agents produced an 822-line analysis report:
Overall Assessment: B+ (85/100)
Status: Production-Ready with Minor Improvements Recommended
The report identified strengths:
- ✅ Excellent Architecture (90/100)
- ✅ Strong Type Safety (95/100)
- ✅ Good Documentation (82/100)
- ✅ 85% Requirements Met (18 of 19 agents with full MCP integration)
But also gaps:
- Test Coverage: 47% (target: 80%)
- Security Score: 72/100 (target: 90/100)
- 16 missing agent tests
- Critical blockers: pytest not configured, security scans not run, dependency resolution issues
The interesting part: My existing tools could evaluate the new system's technical quality.
The missing part: They couldn't evaluate whether I'd used the LionAGI framework correctly.
Act IV: Iterating with Existing Tools (Tuesday Afternoon)
By Tuesday afternoon, I was using my existing TypeScript-based QE agents to drive improvements in the new Python system.
I directed my qe-test-generator agent to create comprehensive tests for the missing coverage:
- 378 lines for test generation agents
- 295 lines for test execution agents
- Property-based testing with hypothesis
- Async patterns with pytest-asyncio
- Mock external dependencies
- Coverage tracking
The progress felt productive:
✨ Generating tests for 13 missing agent test files...
⏱️ Next: Generate tests for MCP server and tools
The workflow: My proven tools were instrumental in building the new system. One generation of agents scaffolding the next.
The blind spot: Both generations of agents made the same architectural assumption - that wrapping LionAGI primitives was the right approach. Neither questioned whether we should be building those abstractions at all.
Act V: The Feedback (Tuesday Afternoon, 15:15)
On Tuesday around noon, I reached out to Ocean Li, the creator of LionAGI, for feedback on what I'd built.
His response arrived that afternoon at 15:15. It was thorough, honest, and exactly what I needed to hear.
What Ocean Said I Got Right ✅
"Real LionAGI Integration"
"You're actually using our framework properly: Branch for agent state isolation, Session for lifecycle management, Builder for workflow orchestration, ReAct reasoning with tools. This is sophisticated usage—not toy project level."
"Observability System"
"Your hooks system (<1ms overhead, cost tracking) is production-grade. The architecture for monitoring agent operations is well thought out."
"Test Culture"
"581 tests (not just the 128+ you mentioned!) with solid async patterns. You clearly care about quality."
"Documentation"
"The onboarding experience is exceptional—clear examples, progressive learning path. New users can actually get started."
I felt validated. The system was substantial.
What Ocean Said I Got Wrong ⚠️
Then came the critical section: "Where You Can Simplify"
"Main Issue: You're Wrapping LionAGI Too Much"
Ocean showed me my architecture:
# My approach (3 layers)
QEFleet
→ QEOrchestrator (reimplements workflow logic)
→ Session (LionAGI's built-in orchestration)
The problem: LionAGI already provides Session (lifecycle management), Builder (workflow orchestration), and ReAct (reasoning patterns). By creating QEFleet → QEOrchestrator → Session, I'd duplicated what the framework does better.
His example:
# Instead of wrapping:
- Delete QEFleet → Use Session + Builder
- Delete QEMemory → Use Session.context (built-in)
- Delete QETask → Use LionAGI's Instruct pattern
Why this matters:
"When LionAGI evolves (and it will - we have 5 years of patterns in v0), your custom wrappers will drift from best practices."
Every abstraction layer I created was a form of technical debt.
"Study LionAGI v0"
"Study v0 patterns at /users/lion/projects/open-source/lionagi/ - 5 years of multi-agent patterns. See how Ocean actually orchestrates agents at scale."
The Missing Pieces
Ocean identified what I needed but didn't have:
- Live examples: "Can you show this working on a real codebase? Record a demo of agents finding actual bugs?"
- Cost analysis: "You claim 80% savings - can you benchmark this? Show actual API costs with/without routing?"
- Integration story: "How does this fit into CI/CD? GitHub Actions? Pre-commit hooks?"
- Memory persistence: "Right now, everything's in-memory (lost on restart). For production, you'll need Redis or PostgreSQL."
The Summary Hit Hard
"You built something real." The LionAGI usage is sophisticated (not toy project level). The architecture is clean. The test culture is mature.
"Path forward": Simplify by trusting LionAGI more. You don't need to rebuild what we already provide—just use it directly. Study v0 patterns to see how Ocean orchestrates agents at scale.
"Expected outcome": ~2000-3000 fewer lines of code to maintain, better alignment with framework evolution.
Act VI: The Reflection (Tuesday Evening)
I sat with Ocean's feedback for a while.
Here's what 22 hours of rapid building got me:
The Impressive Part:
- Comprehensive multi-agent system operational
- All testing infrastructure working
- 3000+ lines of documentation
- MCP integration functional
- Community interest validated (contributor PRs already coming in)
The Problematic Part:
- Over-engineered abstractions
- Framework misalignment
- Future maintenance burden
- Missing production requirements
The Verification Reality
Here's the ratio that AI productivity metrics don't capture:
Generation: 10% of the time (parallel execution)
Verification: 90% of the time
- Does it actually work?
- Does it follow framework patterns?
- Will it survive framework evolution?
- Does it solve real problems?
Ocean's feedback took 30 seconds to read. But it represented the most valuable part of the entire project: Understanding what not to do.
The Lessons: What This Journey Taught Me
1. Speed Creates Impressive Demos, Not Sustainable Systems
The metric I celebrated: "Built in 22 hours!"
The metrics that matter:
- How many bugs does it find?
- What's the actual cost savings?
- Can someone else maintain it?
- Will it survive framework evolution?
2. Framework Understanding Matters More Than Rapid Implementation
I spent hours creating abstractions that duplicated framework primitives.
Why: I didn't investigate the framework deeply enough before building. I delegated architectural decisions to my agents, who also lacked in-depth LionAGI knowledge.
Cost: Technical debt, maintenance burden, and drift from best practices.
Learning: When using a new framework, invest time in understanding its primitives and patterns first. Agents can generate code fast, but they need human guidance on framework-appropriate architecture.
3. Expert Feedback Is Worth More Than Rapid Iteration
Ocean's 5-minute review saved me weeks of maintenance work.
The most valuable feedback isn't just "this is great" or "this is broken" - it's "here's what you don't see yet."
4. Building in Public Accelerates Learning
Monday evening: Solo rapid development
Tuesday morning: Using existing tools for verification
Tuesday afternoon: Expert feedback revealing blind spots
Wednesday-Thursday: Iterative refactoring
Friday: Community presentation and validation
The pattern: Build → Verify → Share → Learn → Refactor → Share again
5. The 10-90 Rule Still Applies
10%: AI generates an impressive codebase
90%: Human verifies, aligns, validates, and improves
This ratio hasn't changed. What has changed is how quickly we can reach the 10% mark. The 90% still requires human judgment.
The Orchestra Metaphor Returns
I call my approach "The Conductor's Method" - orchestrating specialized agents, much like musicians in an orchestra.
What I learned building LionAGI QE Fleet:
Inexperienced Conductor: Walks into the orchestra without studying the score, asks the musicians what they think the composition should be, then tries to coordinate based on an incomplete understanding.
Experienced Conductor: Studies the score deeply first, understands the composer's intent and patterns, then guides the orchestra based on that knowledge.
I was the inexperienced conductor. I let my agents make architectural decisions without first learning what LionAGI's "composition" actually required. Ocean's feedback and my existing agents' verification revealed where that lack of preparation led me astray.
What Happens Next
The repository is public: github.com/proffesor-for-testing/lionagi-qe-fleet
Current status (Saturday, November 8th):
- ✅ Initial implementation complete (the fast part)
- ✅ Verification by existing tools (technical quality validated)
- ✅ Expert feedback received (framework alignment issues identified)
- ✅ Community presentation delivered (AI Hackerspace Friday)
- ✅ First contributor engaged (community validation)
- 🔄 Refactoring in progress (the real work continues)
Next steps:
- Continue framework alignment improvements
- Document refactoring journey with before/after comparisons
- Add real-world validation with live bug-finding examples
- Create a cost analysis with actual benchmarks
- Build CI/CD integration examples
The commitment: Show the messy middle, not just the polished outcomes.
For the Agentic Engineering Community
If you're exploring multi-agent frameworks:
- Invest in framework understanding first, not rapid implementation
- Guide your agents with architectural knowledge, don't delegate framework decisions to them
- Use verification layers (both automated agents and expert humans) to catch misalignments
- Measure real impact, not impressive metrics
- Build in public, even when it reveals your mistakes
Closing: The Value of Five Days
Monday evening: "Look how fast I can build!"
Tuesday afternoon: "Look where I went wrong!"
Wednesday-Thursday: "Let me fix this iteratively!"
Friday: "Let me share what I learned!"
Saturday onward: "Let me keep improving!"
The impressive part: AI can generate sophisticated systems rapidly.
The hard part: Humans must guide with framework knowledge, not delegate architectural decisions.
The valuable part: Multiple verification layers, automated tools, AND expert humans that reveal what you missed.
Building the LionAGI QE Fleet in 22 hours was easy.
Learning from Ocean's feedback—and from my own verification tools—why it needed refactoring: that was the real achievement.
Now comes the actual work: making it sustainable.
Join the conversation:
- Have you delegated framework architectural decisions to AI agents?
- How do you balance rapid prototyping with deep framework understanding?
- What verification strategies do you use beyond just code generation?
Share your experiences at our next Serbian Agentic Foundation meetup, or start a discussion in the LionAGI QE Fleet repository.
Dragan Spiridonov
Founder, Quantum Quality Engineering
Member, Agentics Foundation | Founder, Serbian Agentic Foundation Chapter
Learning in public since 2016. Investigating deeply since Tuesday.
quantum-qe.dev | forge-quality.dev | @github/proffesor-for-testing
P.S. from Ocean: "Your project inspired me to improve LionAGI's multi-agent orchestration docs, thanks for pushing the framework!"
Sometimes the best outcome isn't what you built, but what you helped others improve.