When the Load Doubled

The last article closed with a line about the room, quoting back. The room did not stop. It got wider. In the three weeks since, I started a new collaboration that put hardware in my hands for the first time in decades. I shipped seven fleet releases. I made a decision I had been circling for months — to open-source the self-learning system I have been building alongside every project this year. I sat on a panel with practitioners who are building the agentic stack nobody has written a textbook for yet. I got confirmed as a speaker at two European conferences. And somewhere around week two, I hit a wall I should have seen coming.

This is the article about all of it. Not a highlight reel. A field report from the week the bandwidth ran out, and the work kept arriving.

What Changed on April 14

The week after the last article, I started working with Reuven — Ruv — on his new platform, Cognitum [cognitum.one]. Cognitum is an agentic operating system built around what Ruv calls Seeds: small programmable hardware units based on Raspberry Pi and ESP32S, connected to sensors, designed to run AI agents at the edge. Ruv’s open-source projects already sit at a serious scale — Ruview has over 50,000 GitHub stars, Ruflo over 30,000. My name was about to sit next to code already in production for tens of thousands of users.

The quality engineering part was familiar. Draft a test strategy. Analyze code and existing tests. Run the AQE fleet to analyze security, accessibility patterns, and architecture. The unfamiliar part was the hardware. Flashing firmware. Connecting sensors over USB. Watching a device boot, fail, then boot again while I try to reproduce a timing issue that only appears on the third power cycle.

That last part taught me something I should have known but had to learn by doing. My AQE security agents flagged USB device connections as a risk. They were right to flag it — connecting an unknown device to a computer is a legitimate attack surface. But they were wrong about the context. When you are the developer, and the device is the product, and USB is the documented secure path for provisioning, then flagging it as a threat is noise. The security analysis was technically correct and practically useless.

I filed the finding. The fleet needs a way to distinguish connecting a device you are building from connecting an unknown device somebody handed you. That distinction is context. Context is what makes a test useful. This is not a new lesson. It is the oldest lesson in context-driven testing, applied to a domain I had not worked in before.

Seven Releases in Three Weeks

The Cognitum work consumed most of my bandwidth. Deploy a new version, test it, find bugs, fix them, create a new build, retest, increase coverage to prevent regressions, repeat. Every day. A new system needed a lot of learning. Most of my focus went there.

What that meant for the fleet was that I had less time for it than at any point since the project started. Seven releases still landed, but each one was smaller and more targeted than the ambitious multi-feature releases of the weeks before. The constraint forced discipline. Ship the fix. Ship the hardening. Ship the migration. Move on.

v3.9.12 — A Practical Annoyance

Running aqe init after Ruflo’s own init duplicated hooks and hung for 2 minutes on a redundant pretraining step. Users who work across both tools now get a clean init in seconds instead of staring at a frozen terminal.

v3.9.13 — The Opus 4.7 Migration

Anthropic released a new model, and the fleet needed to survive the transition without breaking. Sonnet 4.6 became the fleet-wide default. Opus 4.7 became the opt-in escalation target. Every hard-coded model reference to the retiring Sonnet 4 was removed. The work was not glamorous. It was the kind of release designed to ensure nothing changes for users on the day the old model goes away.

What made it harder was the week leading up to the release. Claude had stability problems before and after the Opus 4.7 launch. Sessions would degrade. Responses would drift. The fleet’s behavior became harder to predict, not because the fleet changed but because the foundation under it was shifting. This is now a pattern I am learning to recognize — when your tool depends on a model provider, the provider’s release cycle becomes part of your release risk. You cannot test your way out of somebody else’s deployment.

v3.9.14 — Security and Supply-Chain Hardening

Fifteen critical npm vulnerabilities eliminated. A command-injection path in aqe learning repair closed. The shipped tarball dropped from twenty megabytes to under ten. An eleven-agent QE swarm had audited the prior release and surfaced five P0 blockers. Every one of them was the kind of thing that would not show up in a feature test. They show up in a security review, a supply-chain scan, a what-happens-when-somebody-crafts-a-malicious-path test.

v3.9.15 — Browser Skill to Production

The eval file became a real, runnable evaluation. A CI workflow gates changes. Linux ARM64 users — including Raspberry Pi — get a working browser path after the init process. That last part was important to the Cognitum work. The hardware I was testing on needed the same tooling I was shipping.

v3.9.16 — Three CLI Commands for Inspecting the Fleet’s Knowledge

aqe brain diff to compare two exports. aqe brain search for offline filtered search. aqe upgrade to diagnose which optional native deps are present on your platform. The last one exists because users kept reporting slow performance, and the answer was always the same — a missing native binding. Now the tool tells them before they have to ask.

v3.9.17 — The One-Line Fix That Closed a Loop I Had Left Open for Weeks

The routing hook was reading $PROMPT from the shell environment, but Claude Code does not export prompts as environment variables. Every prompt was routed as empty. The learning loop was recording nothing. The fix was small. The consequence was not. The fleet’s ability to learn which agent handles which kind of prompt was silently broken, and I had not noticed because the tests passed. The tests were not testing the integration. The tests were testing the code.

v3.9.18 — Four MCP Contract Issues

Governance blocking legitimate tool calls. Generated jest tests that would not run. Coverage errors that blamed the wrong input. Test imports pointing at temp paths that do not exist on the user’s machine. It also shipped the agentic-qe-fleet Claude Code plugin — one command to install eleven QE agents, nine skills, and nine slash commands.

Seven releases. The headline is not the count. It is that every one of them was a response to a real user running into a real wall. The Cognitum work put me on the other side of that wall. I was the user who needed the ARM64 browser path. I was the user whose security agent flagged a legitimate workflow. I was the user who did not notice that the routing loop was dead. Dogfooding at this intensity is uncomfortable. It is also the fastest feedback loop I have ever had.

The System Gets a Name

For months on this blog, I have referred to a self-learning system I built alongside my own work. A personal knowledge base that captures patterns from my consultancy, my reading, and my projects, so they are searchable next time I need them. I described it by function, not by name, because the function was the point, and the name was not ready.

The name is Nagual.

It comes from Carlos Castaneda’s books, which I started reading when I was nineteen in Novi Sad. Castaneda wrote about two domains — the Tonal, the island of everything you can name and measure, and the Nagual, the vast ocean of potential surrounding it. That duality has shaped how I think about systems for over thirty years. What you can name and what you cannot. What fits in a box and what does not. The system I built lives at that boundary. It stores patterns as living hypotheses. Each one must earn its place through repeated success in the real world. Patterns that stop being useful undergo temporal decay. The island stays lean.

Two weeks ago, I made the decision to open-source it as Nagual-QE — the same core engine, shipped with 515 quality-engineering seed patterns, so a new user gets value on day one. Rust-native, local-first, your data stays on your machine. A browser dashboard, an HTTP API, and a CLI that fits into any workflow.

The seed patterns come from my own practice — over 12 years of quality engineering, condensed into reusable hypotheses covering test strategy, agentic architecture, security patterns, failure-mode classification, and the MAST taxonomy I use to record why things fail. Every pattern starts at a baseline score. Apply it, and it works; the score goes up. Apply it, and it fails; the score drops harder. Being confidently wrong costs more than being uncertain. The patterns that survive are the ones that keep proving themselves in the field.

I shared the launch on LinkedIn, and the response surprised me. Not the volume — I have no illusions about reach. What surprised me was who responded. Practitioners I respect asked concrete questions about the architecture. People I had never spoken to wanted to know how to connect it to their own workflow. The question that came up most was: how do I wire this into my development environment so the learning happens automatically?

The answer is hooks. In Claude Code, you configure hooks in .claude/settings.json — events like session-start, user-prompt, post-task, post-bash. Each hook calls the Nagual API to search for relevant patterns, store new learnings, or record outcomes. When I start a session, the hook loads patterns relevant to the current domain. When a task completes, it records whether the approach worked or failed, with a MAST failure classification if it failed. When I run a build, it tracks the result. The learning happens in the background. The knowledge compounds without me having to remember to write things down.

For Claude Desktop or any LLM that supports tool use, the same API works through direct HTTP calls. You give the agent the API guide and credentials. It searches before solving a problem. It stores the solution after. The system argues back — it does not silently accept a pattern. It scores it, decays it if it stops working, and flags it for review if it scores too high on novelty. The constitution I wrote for it enforces eight principles mechanically, not inspirationally. Try to delete a pattern without a backup? Blocked. Try to promote something to the highest tier without sufficient evidence? Blocked.

Since April 13, I have stored over two hundred patterns. Research papers on agentic failure modes. Empirical findings on multi-agent reliability. Cognitum architecture notes. ATD 2026 session outlines. Consultancy frameworks. Security scoring models. The system cross-references them through hybrid search — full-text for the precise and known, neural embeddings for the semantic and hidden — and sometimes surfaces connections I would not have found manually.

That is the practice working. Not the tool. The practice.

When the Hard Thinking Happens

There is a question underneath every memory system: when does the hard thinking happen? At write time, or at query time?

Andrej Karpathy’s wiki approach — which went viral, with over 40,000 bookmarks — compiles understanding at write time. You process information into structured wiki entries. The LLM synthesizes once. The wiki becomes a static artifact that the model reads as context. It is clean. It is readable. It is also frozen the moment you write it. If your understanding changes, the wiki does not update itself. If you were wrong when you compiled it, the error is baked in.

Nate B. Jones built Open Brain, which takes the opposite approach — storing raw material and synthesizing at query time. Every question gets a fresh synthesis from the full corpus. Nothing is pre-digested. The advantage is that the answer always reflects the current state of everything you have stored. The disadvantage is that each query incurs the full cost of reasoning, and quality depends entirely on retrieval.

Nagual does neither and both. Patterns are structured at write time — problem, solution, domain, tags, confidence — but they are not compiled into a static synthesis. They are living hypotheses with Bayesian scores that update on every outcome. The hard thinking happens three times: at store time (when you articulate what you learned), at outcome time (when reality votes on whether it worked), and at query time (when the system retrieves, ranks, and presents the patterns most relevant to what you are doing now). The write-time structure gives you the precision of a wiki. The outcome tracking gives you what neither a wiki nor a raw store provides — the ability to tell which of your patterns are actually reliable and which are confidently wrong.

This is not a technology argument. It is a testing argument. A wiki that does not track outcomes is an oracle without validation. A raw store that synthesizes on every query is an oracle without calibration. A system that records outcomes and decays unreliable patterns is an oracle that tests itself.

What I Am Thinking About

The Cognitum work put a hundred patterns into the system in three weeks. The research reading put in another hundred. The Nagual timeline since April 13 shows patterns on agentic security scoring, multi-agent failure taxonomies, benchmark validity, prompt-injection surfaces, EU AI Act compliance deadlines, enterprise adoption metrics, JIT test generation, and a production case study from a Slovak company deploying agents in document processing.

Three threads stand out.

The first is about evaluation validity. Multiple papers from April 2026 — Shah et al. at MSR, Li and Storhaug at FSE, the BenchGuard auditing framework — converge on the same finding: the way we evaluate agentic systems is structurally unsound. Benchmarks reward gaming. Judge reliability is unvalidated. Aggregate pass rates hide failure distributions. The baseline crisis is real — most published evaluations do not include one. I stored nine patterns from this thread alone. The practical consequence for my work is that every claim I make about AQE fleet performance needs a defensible evaluation, not a benchmark number.

The second is about fatigue. I noticed it around week two. The Cognitum work is intensive — deploy, test, fix, retest, all day, every day. The fleet releases continued. The Nagual public launch needed attention. The Agentics Foundation Board Secretary’s duties needed attention. Three confirmed sessions for Agile Testing Days 2026. Hustef’s confirmation came through. The Agentics Foundation panel — with Ofer Shaal and Scott McMillan, hosted by Anne Cantera and Mahnaz Hajesmaeili — was recorded on April 25.

The load doubled, and my bandwidth did not. I started making mistakes I normally do not make. Missing things in reviews. Skipping verification steps I know better than to skip. The routing-loop bug in v3.9.17 is the clearest example — the learning system was recording nothing, and I did not notice for weeks because the tests passed. That is not a tooling failure. That is a bandwidth failure. The tests were not wrong. I was not looking at what the tests were not testing.

The third thread is about identity. Nate B Jones made an observation in his comparison of memory architectures that landed harder than the technical content. The question is not which architecture is better. The question is: what kind of practitioner are you becoming? A wiki compiler processes information into static knowledge. A query-time synthesizer delegates thinking to the retrieval moment. A system that tracks outcomes is building a practitioner who treats their own knowledge as provisional — testable, falsifiable, subject to decay.

I built Nagual because I wanted to be the third kind. The weeks since the public launch have been the first real test of whether the system works when someone other than me looks at it.

What I Tell the Room

At the Agentics Foundation panel on April 25, the conversation turned to a question nobody on stage had a rehearsed answer for. What skills does this work actually require? Not what the job posting says. Not what the vendor pitch promises. What it takes to sit in front of a system that has no textbook and make a decision about whether it is working.

I gave the same answer I have been giving at meetups and on podcasts for months — the five lines. Master the basics before the agents. Invest in thinking, not just tools. Design systems, not just tests. Build your own projects. Verify everything.

What I would add after these three weeks is a sixth line, different from the one I added in the last article. That one was about being worthy of being quoted. This one is about being honest when the load exceeds the bandwidth.

When the work is arriving faster than you can verify it, slow down. The cost of shipping unverified is always higher than the cost of shipping late.

I know this. I have written it. I have told rooms full of people. And in week two of the Cognitum work, I almost stopped doing it myself. The routing loop was silent for weeks. The security agents were wrong about context, and I filed the finding instead of fixing it immediately. The test strategy was drafted, but not all of it was executed.

The system I built, Nagual, recorded all of this. Every pattern stored, every outcome tracked, every failure classified. The island stays lean because the constitution does not let me pretend the gaps are not there.

The room is wider than it was three weeks ago. The bandwidth is the same. The discipline is the only thing that scales.

When the load doubles, slow down.

Yesterday, regular May 1st B&B with friends (BBQ & Beer), this morning for a retrospective and a new article, and afternoon for manual work around the house. Tomorrow, resting, walking in nature, recharging my body, mind, and soul for the new challenges ahead.

This is the twenty-eighth article in The Quality Forge series. Previous: “The Room That Quoted Back” described the week the community started using my words, and the weight that came with them. This one describes the three weeks after — when the room got wider, the load doubled, and the bandwidth did not. The releases described are public on github.com/proffesor-for-testing/agentic-qe. Cognitum is at cognitum.one. Nagual-QE is open-source and shipping with 515 quality-engineering seed patterns.

Dragan Spiridonov is the Founder of Quantum Quality Engineering, an Agentic Quality Engineer, Secretary of the Agentics Foundation Board, and lead of the Education and Certification Chapter. He is currently building the Serbian Agentic Foundation Chapter in partnership with StartIt centers across Serbia.

What Changed on April 14

Seven Releases in Three Weeks

v3.9.12 — A Practical Annoyance

v3.9.13 — The Opus 4.7 Migration

v3.9.14 — Security and Supply-Chain Hardening

v3.9.15 — Browser Skill to Production

v3.9.16 — Three CLI Commands for Inspecting the Fleet’s Knowledge

v3.9.17 — The One-Line Fix That Closed a Loop I Had Left Open for Weeks

v3.9.18 — Four MCP Contract Issues

The System Gets a Name

When the Hard Thinking Happens

What I Am Thinking About

What I Tell the Room

Stay Sharp in the Forge