Multi-Agent Testing: Orchestra or Chaos?

Real story of building two testing platforms with specialized agent swarms. What worked, what failed spectacularly (54 TypeScript errors from "improvements"), and lessons learned from going solo with AI orchestration.

The Morning Everything Broke

October 7th, 2025. I sat down to do some final polishing on my Agentic QE Fleet project before announcing it publicly. The build was green. Zero TypeScript errors. Ready for npm publication.

Four hours later? 54 TypeScript errors. The build completely broken.

The paradox: We introduced 54 errors while attempting to "fix" problems that didn't actually exist.

This is what happens when you let a 5-agent QE swarm convince you that a working build has "CRITICAL P0 BLOCKING ISSUES." The agents weren't wrong—they identified legitimate improvements. But they miscategorized severity, and I didn't challenge it fast enough.

The False Crisis:

• Agents reported: "207 ESLint errors blocking build"
• Reality: 3 errors, 204 warnings, build SUCCESS
• One type definition change broke 48 locations (89% of errors)
• Time to break: 4 hours. Time to fix: 5 minutes (once I asked the right question)

But let me back up. This disaster is actually the perfect lens to understand what multi-agent testing really looks like—both the orchestra moments and the chaos ones.

The Spark: From Question to Blueprint

It started in late July 2025. My colleague Predrag asked a simple question: "Could we automatically generate API tests from specifications?"

I threw the question at Gemini 2.5 Pro, expecting a simple answer. Instead, I got a PhD-level blueprint. That blueprint sparked a challenge: Could I build production testing platforms using AI agents to write the code?

Three months later, I have two open-source platforms:

The Two Platforms:

Sentinel - AI Agentic API Testing Platform

8 specialized agents for comprehensive API testing: deterministic algorithms + LLM creativity, ephemeral workers for specific tasks, production architecture ready for scale.

Code frequency: Steady development July-October with major spike in late August

Agentic QE Fleet - Multi-Agent Orchestration

17 specialized QE agents covering full SDLC: test generation, execution, security, performance, chaos engineering, flaky test hunting, regression risk analysis, and fleet orchestration.

Code frequency: Explosive development late September-early October (200k+ additions in one week)

Both platforms were built entirely by AI agents. I edited only .env files and made minor modifications where needed. All coding—thousands of lines of TypeScript—was done by orchestrating specialized agents through Claude Code + Claude Flow in VS Code.

But here's what makes this different from traditional solo development: I discarded four complete architectural attempts before the current one worked. With agents, I could iterate through ideas fast enough to avoid the sunk cost fallacy.

The Meta Moment: Agents Testing Themselves

Early in the project, I had an idea: What if I deployed the QE agents to test their own codebase?

So I initialized the QE agent fleet within the agentic-qe project itself and asked them to analyze the code for problems. This became a fascinating lesson in both agent capabilities and limitations.

What the Agents Found (Before I Even Ran the Code):

✓ Real Security Issues

Hardcoded credentials, insecure API endpoints that needed immediate fixing

✓ Memory Leaks

Improper cleanup in agent lifecycle management—would have caused production issues

✓ Production Anti-Patterns

Mocks being used in production code, test dependencies leaking into runtime

✗ But Also: Over-Estimated Severity

Treated minor style issues as critical bugs, created false sense of urgency

The agents caught real issues in the analysis phase, not the deployment phase. That's proactive quality. But their severity assessments were wildly off—this became a pattern I'd see again.

I went back and calibrated how they assess risk and severity. This is the hidden work of agent orchestration: constant tuning, validation, and teaching them the difference between "actually broken" and "could be prettier."

The Rapid Iteration Reality

Working solo with agent swarms gave me something traditional development doesn't: the ability to fail fast without team overhead or sunk cost fallacy.

Each architectural attempt I discarded taught me something. Each failure happened in days, not weeks. Each pivot required no coordination meetings, no convincing stakeholders, no explaining why we need to start over.

With agents, I could:

→ Prototype new architectures in days instead of weeks
→ Test ideas with real code, not just diagrams
→ Throw away approaches that didn't work without guilt
→ Learn from each iteration and apply it immediately

The code frequency graphs tell this story: 200,000+ line additions in late September/early October. That's not humanly possible working solo. That's the velocity of finding the right solution by failing fast.

This is the hidden power of agent orchestration: not writing code faster, but iterating faster to find what actually works.

The Orchestra: Meet the 17 Agents

Version 5 of the Agentic QE Fleet has 17 specialized agents. Think of it like a testing orchestra where each agent has a specific role, specialized capabilities, and collaborates through shared memory.

Core Testing Agents (The Foundation)

test-generator

AI-powered test creation with property-based testing. Uses LLMs to generate comprehensive test cases from requirements, specifications, or existing code patterns.

test-executor

Multi-framework execution with parallel processing. Runs tests across Jest, Mocha, Playwright, and other frameworks simultaneously.

coverage-analyzer

Real-time gap analysis with O(log n) algorithms. Identifies untested code paths, edge cases, and missing scenarios using both static analysis and runtime data.

quality-gate

ML-driven validation and risk assessment. Decides if code is ready for next stage based on test results, coverage, and historical quality patterns.

quality-analyzer

ESLint, SonarQube, Lighthouse integration. Aggregates quality metrics from multiple tools and provides unified quality scores.

Performance & Security Agents

performance-tester

Load testing with k6, JMeter, Gatling. Generates realistic load profiles and identifies performance bottlenecks before they hit production.

security-scanner

SAST, DAST, dependency scanning. Combines static analysis, dynamic testing, and vulnerability databases to catch security issues early.

Strategy & Intelligence Agents

requirements-validator

Testability analysis, BDD generation. Analyzes requirements for testability issues, ambiguities, and missing acceptance criteria. Generates Gherkin scenarios automatically.

production-intelligence

Incident replay, anomaly detection. Monitors production, detects anomalies, and automatically generates tests that reproduce production issues.

fleet-commander

Hierarchical coordination of 50+ agents. The "conductor" that orchestrates other agents, manages priorities, resolves conflicts, and ensures coherent quality strategy.

Advanced Testing Agents

regression-risk-analyzer

Smart test selection using ML patterns. Predicts which tests are most likely to catch regressions based on code changes and historical failure patterns.

test-data-architect

Generates 10k+ realistic records/second. Creates production-like test data that respects referential integrity, business rules, and privacy constraints.

api-contract-validator

Breaking change detection (OpenAPI, GraphQL). Compares API schemas across versions and flags breaking changes before they reach consumers.

flaky-test-hunter

Statistical detection with auto-fix. Runs tests multiple times, identifies non-deterministic behavior, and suggests (or implements) fixes for common flakiness patterns.

Deployment & Resilience Agents

deployment-readiness

Multi-factor release validation. Checks test coverage, quality gates, security scans, performance benchmarks, and production readiness criteria before allowing deployments.

visual-tester

AI-powered UI regression testing. Captures screenshots, uses computer vision to detect meaningful visual changes (ignoring dynamic content), and flags regressions.

chaos-engineer

Fault injection for resilience testing. Randomly introduces failures (network latency, service crashes, resource exhaustion) to verify system resilience and recovery.

Each agent combines deterministic algorithms with LLM creativity. They're not pure AI—they're hybrid systems that use traditional algorithms where appropriate and leverage LLMs for reasoning, pattern recognition, and creative problem-solving.

The Evolution: From Chaos to Coordination

Building these platforms taught me that agent orchestration evolves through distinct phases. Each phase has different challenges, different tools, and different human roles.

Phase 1: Wrestling (late July)

Starting with Claude Code and single agents (Gemini in 'plan mode', Claude Sonnet 4 in 'act mode'), I got Sentinel's first runnable version. But it was a true wrestling match.

The Chaos Symptoms:

• Agents got stuck in loops, repeating the same failed approaches
• Lost context mid-conversation, forgot what they were building
• Confidently implemented wrong solutions without validation
• Claimed tasks were complete when they weren't (the worst habit)
• Required constant redirection and refinement

The breakthrough came from initializing a Memory Bank on day one. This context persistence became the glue holding everything together—more important than which LLM I used.

I also enforced a strict rule: Always run the code and tests. No claiming completion without a smoke test. This simple rule caught countless issues that agents missed.

Phase 2: Co-Piloting (Early August)

Upgrading to Roo Code transformed the experience. The AI felt less like a raw tool and more like a junior developer with memory, residing in my editor, executing terminal commands, analyzing files.

It handled architecture with Gemini and coding with Sonnet. When I tasked it with rewriting a core service from Python to Rust for Sentinel, it simply... complied. The code compiled. It worked.

Key lesson from this phase: AI agents work best with clear boundaries and specific expertise, not as general-purpose problem solvers. Give them a domain, clear constraints, and persistent context—they'll excel.

Phase 3: Swarm Orchestration (Aug-Oct)

Everything changed with Claude Flow. I went from directing one agent to unleashing swarms working in parallel: functional testers, security specialists, E2E bots, integration experts—even small armies tackling missing test coverage.

The Swarm Pattern:

I typically run 4-8 agents in parallel with shared memory (SQLite Memory Banks). Each agent has specialized capabilities and accesses the same context store.

For complex challenges, I use hive-mind coordination—a meta-agent that provides overview and guardrails to the swarm. This prevents agents from working at cross-purposes or duplicating effort.

The code frequency graphs tell the story: 200,000+ line additions in late September/early October. That's not humanly possible working solo. That's swarm velocity.

But velocity without validation is dangerous. Which brings us back to that October morning...

The Cautionary Tale: 54 TypeScript Errors

Let me walk you through exactly what happened, because this failure taught me more than any success.

Act I: The False Urgency

After successfully cleaning up stub files (a genuine improvement), I deployed a 5-agent QE swarm to analyze the codebase for quality issues. They reported back with urgent findings:

Agent Report:

• "CRITICAL: 207 ESLint errors blocking build"
• "CRITICAL: Type safety issues across codebase"
• "2 P0 blocking issues"

It seemed like crisis mode. But here's the reality:

✓ Build status: SUCCESS (0 TypeScript errors)
✓ Actual ESLint errors: 3 (the rest were warnings)
✓ Production status: Ready to publish

Mistake #1: I treated warnings as errors, creating false urgency. The QE agents weren't wrong—they identified legitimate improvements. But they miscategorized severity, and I didn't challenge it.

Act II: The Type System Cascade

In response to the "critical" issues, Claude Code proposed adding strict TypeScript types to improve code safety. The agents added this type definition to the base agent class:

export type MemoryValue =
  | string
  | number
  | boolean
  | null
  | Date
  | MemoryValue[]
  | { [key: string]: MemoryValue | undefined };

This single type definition broke 48 of 54 errors (89% of our problems).

Why? Because our agents store complex domain objects—FlakyTestResult, LoadTestResult, PerformanceBaseline—none of which fit this restrictive type. The type was designed for simple key-value storage, but applied to a system requiring rich domain modeling.

Mistake #2: We made an architectural change without understanding its impact radius. One git grep would have shown 50+ locations affected.

Act III: The "Fix" Spiral

Once the MemoryValue type was in, we triggered a cascade of "fixes":

• Phase 4: Add MemoryValue type → Breaks 100+ locations
• Phase 5: Fix Date serialization → Fixes 30, but 70 remain
• Phase 5: Fix iterator errors → Fixes 50, but 54 still remain

Each fix masked the real problem: the architectural decision itself was flawed. We also changed all Date objects to ISO string serialization—a good change in isolation, but introduced inconsistencies when combined with the MemoryValue restrictions.

Mistake #3: We didn't test incrementally. Changes were made in large batches without validation between each step.

Act IV: The Five-Minute Fix

After hours of attempted fixes, I asked Claude a direct question:

"Check if Date serialization change is causing these problems"

That insight led to the solution: a complete surgical rollback of the src/ directory to the baseline while keeping the legitimate improvements:

✓ Kept:

• tsconfig.json updates
• ESLint configuration
• Stub file cleanup
• Test improvements

✗ Reverted:

• All MemoryValue changes
• Date serialization
• Type "improvements"
• Agent implementations

Result: 54 errors → 0 errors. Build success restored in 5 minutes.

Five Hard-Earned Lessons

This disaster, combined with three months of agent orchestration, taught me patterns that work—and critical mistakes to avoid.

Lesson 1: Perfect Is the Enemy of Good

The build was already working. Zero errors. Production-ready. The "improvements" were solutions looking for problems.

Takeaway: Before "fixing" anything, ask: "Is the build actually failing?" If not, be extremely careful about architectural changes. Working code has value beyond metrics.

Lesson 2: AI Agents Exaggerate Severity

My QE agents reported "CRITICAL P0 BLOCKING ISSUES" for ESLint warnings. This created false urgency that drove poor decisions.

Takeaway: Human judgment must filter AI severity assessments. Warnings ≠ Errors. Always verify the actual impact. Agents see patterns but lack business context to assess true priority.

Lesson 3: Understand Impact Radius Before Changing

One type definition affected 50+ locations. We didn't check this before committing.

Takeaway: Before any architectural change, run git grep "YourType" | wc -l. If there are more than 10 results, treat it as high-risk. Agents see local correctness but miss global consequences.

Lesson 4: Incremental Validation Is Non-Negotiable

We made changes in batches, then ran tests. Should have been: change → typecheck → fix or rollback → repeat.

Takeaway: With AI-assisted development, the feedback loop must be even tighter. Validate every change immediately. Make regular commits before starting new batches. When agents go off track, you can revert fast.

Lesson 5: Embrace the Iteration Mindset

I discarded four complete versions of Agentic QE Fleet before finding the right architecture. Agents make iteration cheap—take advantage of it.

Takeaway: Don't fall for sunk cost fallacy. If an approach isn't working, pivot. With agents, you can prototype new architectures in days. Use that velocity to fail fast and learn faster.

What Actually Works: Orchestration Patterns

Beyond the lessons from failure, here are patterns that consistently work in production:

Memory Banks > Model Selection

Context persistence matters more than which LLM you use. Initialize Memory Banks (I use SQLite) on day one. This becomes the glue holding agent coordination together.

Agents can query shared context, learn from each other's work, and maintain continuity across sessions. Without this, you're constantly re-explaining context.

Specialized Agents > Generalists

Don't build one "super agent" that does everything. Build specialized agents with clear domains: test generation, security scanning, performance testing, etc.

Each agent combines deterministic algorithms (where appropriate) with LLM reasoning (where needed). The combination is more reliable than pure AI or pure algorithms alone.

Human Judgment for WHAT/WHY, AI for HOW/SCALE

You decide what needs to be built and why it matters (strategy, priorities, context). Agents decide how to implement it and handle the scale of execution.

This division of labor is critical. Agents excel at execution but struggle with business context and strategic priorities. Don't abdicate the "what" and "why" decisions.

Hive-Mind for Complex Coordination

For complex challenges, use a meta-agent (hive-mind) that provides overview and guardrails to the swarm. This prevents agents from working at cross-purposes.

The hive-mind doesn't do the work—it coordinates, resolves conflicts, and ensures agents stay aligned with the overall goal. Think of it as the conductor in the orchestra.

Always Run the Code and Tests

Never trust agents when they claim completion. Enforce smoke testing for every change. This simple rule catches countless issues.

I learned this in July when agents would confidently say "done" when the code didn't even compile. Now it's non-negotiable: change → run → verify → commit.

The Reality Check: What Still Doesn't Work

Let me be honest about the limitations, because you won't hear this from tool vendors:

Current Limitations (October 2025):

• Context understanding is still weak - Agents struggle with business rules requiring deep domain knowledge
• API reliability varies - Anthropic API overloads and errors cause agents to lose track (happened frequently in July)
• Over-complication is common - Agents tend to over-engineer when simpler solutions exist
• Severity calibration is hard - Initial risk assessments can be wildly inaccurate (as we saw)
• Maintenance overhead is real - Agents need training, monitoring, and frequent correction
• Guidance still required - Less than July, but not zero. You're conducting, not delegating completely

The improvements from July to October are significant. Claude Sonnet 4.5 and Reuven's Claude Flow enhancements made agents more reliable. But we're still in early days.

This isn't magic. It's engineering—which means trade-offs, iterations, and honest assessment of what works versus what's still aspirational.

The Verdict: Orchestra or Chaos?

After three months and two production platforms, here's my answer: It depends on the conductor.

With the right patterns—Memory Banks, specialized agents, hive-mind coordination, tight feedback loops, human judgment for strategy—you get an orchestra. Each agent plays its part, coordination is smooth, and the output is greater than the sum of parts.

Without those patterns—or when you ignore agent limitations, trust severity assessments blindly, skip incremental validation—you get chaos. Like 54 TypeScript errors from trying to "improve" a working build.

The Human Role Has Changed:

You're no longer a solo coder writing every line. You're a swarm conductor— shepherding context, providing direction, steering coordination, making strategic calls, and knowing when to ignore the orchestra's suggestions.

The agents are instruments. Some play louder than they should (severity exaggeration). Some get lost in complex passages (API errors). Some want to improvise when the score works fine (unnecessary "improvements").

Your job is conducting, not playing every instrument yourself.

Try It Yourself: The Code Is Open

Both platforms are open-source and production-ready:

Agentic QE Fleet

Clone it. Deploy the agents in your CI/CD. Use them to test your own projects. Break them. Improve them. That's how we learn together.

Sentinel API Testing

Point it at your APIs. Let the agents generate tests. See what they catch that your manual tests missed. Report what doesn't work.

I'm using the QE Fleet to test Sentinel right now. Meta-testing: agents testing the platform that tests other systems. The findings drive improvements to both projects.

This isn't a finished product—it's a living foundation. The orchestra is growing, and I'd love others to join in.

Where to Start

If you're thinking "okay, this makes sense, but where do I start?"—here's what I recommend:

1. Start with Self-Analysis

Have agents analyze your existing codebase first. Let them find low-hanging fruit. You'll learn how they think and where they struggle before committing to bigger changes.

2. Initialize Memory Banks Day One

Don't skip this. Context persistence is more important than which LLM you use. SQLite works great for this. Your agents will coordinate better and maintain continuity across sessions.

3. Work in Small Batches with Frequent Commits

Don't let agents run wild for hours. Make commits before each major change. When they go off track, you can revert fast. This saved me countless times.

4. Always Run the Code and Tests

Never trust agent completion claims. Enforce smoke testing for every change. This simple rule catches issues that agents confidently miss.

5. Embrace the Iteration Mindset

Be willing to throw away versions. I ditched four complete versions before finding the right architecture. Agents make iteration cheap—use that velocity to fail fast and learn faster.

Join the Journey

I'm building this in public as part of founding the Serbian Agentic Foundation—the first Agentic QE community in the Balkans. First meetup is October 28, 2025 at StartIt Novi Sad.

This article is part of The Quality Forge launch series, where I'm sharing everything I learn about Agentic QE—both successes and spectacular failures.

Questions I'm curious about:

• Have you tried multi-agent orchestration? What surprised you?
• Have agents ever "improved" your working code into broken code?
• Which testing agents would you deploy first in your SDLC?
• What patterns have you found for filtering AI severity assessments?

Get weekly insights on Agentic QE and multi-agent orchestration:

Join The Forge Newsletter

Connect & Collaborate

Connect with me on LinkedIn or join our Serbian Agentic Foundation meetups.

Clone the repositories. Try the agents. Break them. Improve them. Share what you learn. That's how we build the future of quality engineering together.

Next article: "The Conductor's Framework: Orchestration Patterns for Multi-Agent Testing" — Coming in October