<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://outloop.blog/feed.xml" rel="self" type="application/atom+xml" /><link href="https://outloop.blog/" rel="alternate" type="text/html" hreflang="en" /><updated>2026-06-17T18:48:55-04:00</updated><id>https://outloop.blog/feed.xml</id><title type="html">outloop.blog</title><subtitle>Technical writing by Dr. Nobel Khandaker on building reliable distributed systems with multiagents — architecture, reliability, and operational excellence.</subtitle><author><name>Dr. Nobel Khandaker</name></author><entry><title type="html">Agentic Engineering: From Architecture Document to Delivery Plan</title><link href="https://outloop.blog/2026/04/24/agentic-project-planning-from-architecture-to-delivery.html" rel="alternate" type="text/html" title="Agentic Engineering: From Architecture Document to Delivery Plan" /><published>2026-04-24T09:00:00-04:00</published><updated>2026-05-25T09:27:23-04:00</updated><id>https://outloop.blog/2026/04/24/agentic-project-planning-from-architecture-to-delivery</id><content type="html" xml:base="https://outloop.blog/2026/04/24/agentic-project-planning-from-architecture-to-delivery.html"><![CDATA[<p>Architecture documents are often treated as the end of design work. In an effective engineering organization, they are the beginning of delivery work. The architecture document and the developer design decisions usually get converted to concrete executable task backlog for the engineering team. The engineering team lead and the program or product manager work together to perform this conversion.</p>

<!--more-->

<p>Recently, I worked on creating a task rewriting a complex industrial software. The source design describes a replacement for a legacy SignalR/ASP.NET edge server with a Go-based middleware service that coordinates field device systems, operator controller apps, and admin dashboards at remote industrial sites. It specifies the architectural style, transport strategy, runtime behavior, safety constraints, persistence rules, observability expectations, security model, test strategy, and rollout approach.</p>

<p>We started with a strong architecture document but it did not tell a team which work must happen first, which work can happen in parallel, which ambiguities must be resolved before sprint planning, or how to turn a safety requirement like “E-Stop p99 &lt; 100 ms” (Emergency stop signal) into stories, acceptance criteria, test gates, and release evidence.</p>

<p>This is where agentic planning became useful for us. Agents can read the architecture document, extract its delivery-relevant facts, challenge gaps, and synthesize a backlog that preserves the architecture’s intent. Our goal was not to have an agent invent a project plan. The goal was to have agents compile, cross-check, and structure the plan from the design.</p>

<p>This article walks through that process using the <strong>anonymized</strong> Jira delivery plan as the case study.</p>

<hr />

<h2 id="the-source-material">The Source Material</h2>

<p>The architecture document defines an edge-management middleware service with these major characteristics:</p>

<ul>
  <li>A Go modular monolith using hexagonal architecture, also known as Ports and Adapters.</li>
  <li>A domain core responsible for device group management, field-device finite state machines, ownership rules, E-Stop propagation, and device group’s limit computation.</li>
  <li>ZeroMQ for real-time communication with field devices and controller apps.</li>
  <li>REST APIs for admin dashboard and auth sidecar communication.</li>
  <li>Watermill for cold-path command routing.</li>
  <li>Go channels for the hot sensor and safety path.</li>
  <li>PostgreSQL with an ORM for durable state.</li>
  <li>Authentication and authorization through a sidecar.</li>
  <li>Configuration through a sidecar.</li>
  <li>Hosted error tracking and metrics for operational observability.</li>
  <li>Docker Compose as the per-site deployment target.</li>
</ul>

<p>It also defines constraints that are not optional:</p>

<ul>
  <li>E-Stop must reach all field devices within 100 ms at p99.</li>
  <li>Safety and control messages must not be silently dropped.</li>
  <li>Restarts must be safe by default.</li>
  <li>Devices must be offline until reconnection proves otherwise.</li>
  <li>Controller ownership after restart is only a logical association until an operator explicitly reconfirms motion-affecting operations.</li>
  <li>Partial network partition inside a device group causes disband and operator alert.</li>
  <li>The system must support up to 100 field devices, 10 controller apps, and 10 admin dashboards.</li>
  <li>Production rollout starts at Customer A, then Customer B, with rollback to the legacy system.</li>
</ul>

<p>An agent cannot ignore any of that. A useful plan has to preserve those constraints and turn them into executable work.</p>

<hr />

<h2 id="the-planning-problem">The Planning Problem</h2>

<p>The hard part is that architecture documents are organized for understanding, while delivery plans are organized for execution.</p>

<p>The architecture has sections like:</p>

<ul>
  <li>Transport strategy by message criticality.</li>
  <li>Hexagonal architecture design.</li>
  <li>Runtime messaging.</li>
  <li>Hot path versus cold path.</li>
  <li>Graceful startup, restart, and shutdown.</li>
  <li>Network partition and state recovery.</li>
  <li>Data and persistence.</li>
  <li>Observability and SLOs.</li>
  <li>Security architecture.</li>
  <li>Testing strategies.</li>
  <li>Rollout and migration.</li>
</ul>

<p>A delivery plan needs different questions answered:</p>

<ul>
  <li>What must be built first so the rest of the team can work?</li>
  <li>Which requirements are safety-critical and need stronger evidence?</li>
  <li>Which architecture decisions imply reusable work tracks?</li>
  <li>Which external dependencies can block progress?</li>
  <li>Where should quality gates live?</li>
  <li>What belongs in sprint stories versus milestones versus release criteria?</li>
  <li>What assumptions must be resolved before implementation starts?</li>
</ul>

<p>Agentic planning is useful because agents are good at repeatedly transforming structured information across levels of abstraction. In this case, the process turned a 30-page architecture design into a six-milestone delivery plan plus a scoped placeholder for audio/video streaming.</p>

<p><img src="/assets/img/agentic_planning_workflow.png" alt="Agentic planning workflow: architecture document passes through extraction, risk analysis, and milestone structuring with human review gates" /></p>

<p>The important point here is the review loop. Agents accelerate the conversion, but the human team members remain responsible for whether the plan is coherent, safe, and aligned with real team constraints.</p>

<hr />

<h2 id="step-1-extract-delivery-relevant-facts">Step 1: Extract Delivery-Relevant Facts</h2>

<p>The first agent task is not to generate stories. It is to extract facts.</p>

<p>For the anonymized service, the extraction pass organized the architecture document into planning inputs:</p>

<ul>
  <li>Product scope: fleet management, user management, state management, message routing, testing, observability, perception, automation hooks, and audio/video streaming.</li>
  <li>Non-goals: dashboard UI design, controller app architecture, legacy gateway connection, controller app logging pipeline, wireless and physical E-Stop systems handled outside the middleware service, and strict compliance implementation.</li>
  <li>Runtime components: Go service, ZeroMQ transport, Watermill router, PostgreSQL, an ORM, structured logging, hosted observability, and Docker Compose.</li>
  <li>Domain responsibilities: FSM, device group management rules, ownership eligibility, most-restrictive limit computation, E-Stop propagation, idempotency.</li>
  <li>Message classes: safety, control, telemetry, administrative, peripheral.</li>
  <li>State categories: checkpointed, reconstructable, and not persisted.</li>
  <li>SLOs: command success rate, E-Stop latency, availability, dead-letter rate, error budget.</li>
  <li>Testing obligations: unit tests, mutation tests, fuzz tests, integration tests, contract tests, E2E simulation, chaos, performance, soak.</li>
  <li>Rollout obligations: data migration, red-green cutover, rollback, and Customer A / Customer B site acceptance testing.</li>
</ul>

<p>This fact extraction is where agents prevent a common planning failure: treating all sections of the architecture document as equal. They are not equal. Some sections describe implementation mechanics. Some describe business behavior. Some describe operational proof. Some describe release risk.</p>

<p>For example, “ZeroMQ client” is an implementation fact. “E-Stop must reach all field devices within 100 ms” is a safety constraint. “legacy document layout to ORM entities” is a migration risk. A good plan needs all three, but it should not handle them at the same level.</p>

<hr />

<h2 id="step-2-convert-architecture-boundaries-into-work-boundaries">Step 2: Convert Architecture Boundaries into Work Boundaries</h2>

<p>The architecture chose hexagonal boundaries because the middleware service sits between multiple transport protocols and a safety-critical domain. That choice is also a planning gift.</p>

<p>The domain core can be built and tested separately from adapters. The ZeroMQ and REST adapters can be developed against ports. The persistence layer can implement outbound ports without leaking ORM details into domain code. The hot path and warm/cold path can become separate implementation tracks.</p>

<p>The delivery plan used those boundaries directly:</p>

<p><img src="/assets/img/agentic_planning_system_boundary.png" alt="System boundary diagram showing hexagonal architecture with transport adapters, domain core, and persistence ports" loading="lazy" /></p>

<p>This gave the plan a natural decomposition:</p>

<ul>
  <li>M1 establishes repository, CI, composition root, local stack, health, logging, and port skeletons.</li>
  <li>M2 builds core domain and messaging MVP.</li>
  <li>M3 adds persistence, authentication, admin APIs, peripheral integration, and configuration.</li>
  <li>M4 completes safety, reliability, recovery, and security behavior.</li>
  <li>M5 proves the system through testing, observability, performance, and chaos.</li>
  <li>M6 handles migration and production cutover.</li>
  <li>M7 holds audio/video streaming as in-scope but underspecified.</li>
</ul>

<p>That milestone order did not copy the architecture document section-by-section. It translated architecture dependency into delivery dependency.</p>

<hr />

<h2 id="step-3-separate-the-hot-path-from-the-warmcold-path">Step 3: Separate the Hot Path from the Warm/Cold Path</h2>

<p>The architecture makes a crucial runtime distinction:</p>

<ul>
  <li>Hot path: <code class="language-plaintext highlighter-rouge">ZeroMQ -&gt; Go worker -&gt; channel -&gt; batch worker -&gt; domain ports</code>.</li>
  <li>Cold path: <code class="language-plaintext highlighter-rouge">ZeroMQ/REST -&gt; Watermill -&gt; command handler -&gt; domain ports</code>.</li>
</ul>

<p>That distinction affects planning. The hot path exists because sensor and safety traffic need low latency and predictable allocation behavior. The warm/cold path exists because ownership, device configuration, mode switches, admin operations, and peripheral commands benefit from validation, logging, retry, and workflow-style handling.</p>

<p>The plan therefore split messaging work into separate epics:</p>

<ul>
  <li>ZeroMQ pub/sub adapter.</li>
  <li>Message envelope and protobuf codec.</li>
  <li>Watermill router and bounded class queues.</li>
  <li>Outbound publisher and retry.</li>
  <li>Golden-message fixture validation.</li>
  <li>Go-channel ingest worker.</li>
  <li>Batch processor with fan-out.</li>
  <li>Entity-partitioned sequential processing.</li>
  <li>Hot-path backpressure metrics.</li>
  <li>Safety timing smoke benchmark.</li>
  <li>Cold-path command handlers.</li>
</ul>

<p>This is a good example of agentic planning preserving design intent. A weaker plan might have created a single “Implement messaging” epic. That would hide the highest-risk part of the architecture inside a broad bucket. The agent-generated plan instead kept the hot and cold paths visible.</p>

<p><img src="/assets/img/agentic_planning_message_processing_paths.png" alt="Message processing paths: hot path for safety-critical signals under 100ms, warm path for control commands, cold path for admin operations" loading="lazy" /></p>

<p>The diagram above is more than technical documentation. It is a delivery planning device. It tells the team which work can proceed independently and where integration risk will appear.</p>

<hr />

<h2 id="step-4-promote-safety-constraints-into-delivery-gates">Step 4: Promote Safety Constraints into Delivery Gates</h2>

<p>Safety requirements should not sit passively in a requirements section. They need to become acceptance criteria, benchmarks, tests, and milestone exit conditions.</p>

<p>The architecture says:</p>

<ul>
  <li>E-Stop has a 100 ms end-to-end timing budget.</li>
  <li>Safety messages bypass normal queues.</li>
  <li>If no ACK is received before deadline, the sender transitions to fail-safe behavior.</li>
  <li>E-Stop propagation applies across the devices.</li>
  <li>Restart must recover or surface unconfirmed safety commands.</li>
</ul>

<p>The delivery plan turns that into M4, “Safety, Reliability &amp; Recovery,” with concrete stories:</p>

<ul>
  <li>High-priority E-Stop lane.</li>
  <li>All-members ACK aggregator.</li>
  <li>CurveZMQ encryption and device identity.</li>
  <li>Unconfirmed E-Stop integration contract.</li>
  <li>Controller key revocation.</li>
  <li>E-Stop benchmark smoke gate.</li>
  <li>Key/cert lifecycle and replacement.</li>
  <li>Outbox schema and write-before-publish.</li>
  <li>Startup outbox scanner.</li>
  <li>Replay policy for unfinished safety commands.</li>
  <li>Recovery surfacing for unreplayed commands.</li>
</ul>

<p>It also resolves the timing budget into per-hop gates:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Controller -&gt; Service           30 ms
Service -&gt; device fan-out       40 ms
ACK path back                   30 ms
-------------------------------------
Total                          100 ms
</code></pre></div></div>

<p>That budget matters because it changes the shape of the stories. “Implement E-Stop” is not a useful story. “All-members ACK aggregator with a 70 ms site-publish-to-ACK window and operator-visible unconfirmed state” is useful.</p>

<p>Agents are particularly helpful here because they can propagate one safety constraint into multiple artifacts:</p>

<ul>
  <li>Story acceptance criteria.</li>
  <li>Metrics.</li>
  <li>Benchmark thresholds.</li>
  <li>Chaos test scenarios.</li>
  <li>Dead-letter behavior.</li>
  <li>Restart recovery rules.</li>
  <li>Milestone exit criteria.</li>
</ul>

<p>The result is that safety becomes an execution structure, not just a paragraph in the design.</p>

<hr />

<h2 id="step-5-build-state-matrices-before-writing-stories">Step 5: Build State Matrices Before Writing Stories</h2>

<p>The Jira plan contains two authoritative matrices that are more important than they may look:</p>

<ul>
  <li>Persistence matrix.</li>
  <li>Message-class matrix.</li>
</ul>

<p>These matrices force the plan to make operational semantics explicit.</p>

<p>The persistence matrix classifies state as:</p>

<ul>
  <li>Checkpointed: survives restart in PostgreSQL.</li>
  <li>Reconstructable: rebuilt from live transport, heartbeats, telemetry, or recomputation.</li>
  <li>Not persisted: process memory or runtime metrics.</li>
</ul>

<p>For example, device membership to a group, controller ownership assignments, command-key watermarks, trackfiles, revocation lists, and safety outbox entries are checkpointed. Vehicle connectivity, current sensor values, and derived device group limits are reconstructable. Hot-path queue contents and in-flight worker buffers are not persisted.</p>

<p>That classification drives implementation work:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Runtime state
     |
     +--&gt; checkpointed ---------&gt; PostgreSQL, migrations, recovery tests
     |
     +--&gt; reconstructable ------&gt; startup reconciliation, reconnect behavior
     |
     +--&gt; not persisted --------&gt; metrics, safe loss, no recovery promise
</code></pre></div></div>

<p>Without that matrix, the team would discover restart semantics piecemeal during implementation. With it, agents can generate persistence stories, recovery tests, and acceptance criteria from a shared source of truth.</p>

<p>The message-class matrix performs the same function for transport behavior. Safety, control, telemetry, administrative, and peripheral messages do not share the same retry, queueing, timeout, deduplication, or dead-letter semantics. The plan makes those differences explicit so stories do not accidentally apply one policy to every message class.</p>

<p>For example:</p>

<ul>
  <li>Safety bypasses normal queues and uses strict ACK deadlines.</li>
  <li>Control uses bounded wait, at-least-once behavior, command keys, and dead-letter handling.</li>
  <li>Telemetry can drop oldest or coalesce by entity because freshness matters more than replay.</li>
  <li>Administrative work must return explicit errors or retry hints.</li>
  <li>Peripheral commands can be last-write-wins where appropriate.</li>
</ul>

<p>This is one of the most valuable outputs of the agentic planning process. The agents did not just produce tickets. They produced intermediate planning artifacts that make the tickets safer.</p>

<hr />

<h2 id="step-6-resolve-ambiguity-into-a-decisions-log">Step 6: Resolve Ambiguity into a Decisions Log</h2>

<p>Architecture documents often contain open questions. Some are harmless. Some are project blockers disguised as implementation details.</p>

<p>The source architecture had open questions around partial partition behavior, clock synchronization, E-Stop timing budget, revocation, and MAC-versus-key identity. The delivery plan records those as resolved decisions:</p>

<ul>
  <li>Protobuf contract source of truth is an internal IPC contract repository on the active v2 development branch.</li>
  <li>Field-device FSM is final and documented.</li>
  <li>E-Stop budget is 30/40/30 ms.</li>
  <li>Partial partition means disband the device group and alert the operator.</li>
  <li>Clock synchronization uses external NTP on all nodes.</li>
  <li>Controller key revocation uses an admin-triggered revocation list.</li>
  <li>MAC is a human label only; the key is the authoritative identity.</li>
  <li>RPO is 24 hours and RTO is 4 hours.</li>
  <li>Deployment target is Docker Compose per customer site.</li>
  <li>Observability hosting uses managed error tracking and metrics platforms.</li>
  <li>TLS certificates come from an internal CA owned by the team.</li>
  <li>SAT completion requires 14-day soak, zero SEV-1, SLO targets met, and customer sign-off.</li>
  <li>Audio/video streaming is in scope for v2 but needs more PRD detail before story breakdown.</li>
</ul>

<p>This is a critical habit. Agents can draft around ambiguity, but delivery cannot safely proceed if important ambiguity remains hidden. A decision log gives every story a traceable foundation.</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Open question
      |
      v
Clarification with sponsor / tech lead / dependency team
      |
      v
Decision log row
      |
      v
Stories, acceptance criteria, tests, release gates
</code></pre></div></div>

<p>The best agentic plans make assumptions visible. They do not bury them in prose.</p>

<hr />

<h2 id="step-7-turn-test-strategy-into-continuous-quality-tracks">Step 7: Turn Test Strategy into Continuous Quality Tracks</h2>

<p>The architecture document includes a test pyramid and detailed test categories. A typical project plan might move all of that to the end under a “Testing” milestone. That is too late for this system.</p>

<p>The delivery plan instead creates continuous quality and observability swimlanes from M1 through M5.</p>

<p>Quality starts in M1 with CI, fakes, and reproducible local development. It continues in M2 with domain invariants, fixture validation, and smoke benchmarks. M3 extends integration coverage across persistence, auth, and admin APIs. M4 validates recovery and safety-critical mechanisms. M5 completes the full proof with contract tests, E2E simulation, chaos, fuzzing, mutation testing, performance, and soak.</p>

<p>The CI tiers reinforce that structure:</p>

<ul>
  <li>PR: lint, vet, unit tests, fast domain tests, small adapter tests, golden-message contract smoke, build verification.</li>
  <li>Merge/main: broader integration suites, migration checks, targeted benchmarks, image publish, staging-oriented verification.</li>
  <li>Nightly: full E2E simulation, fuzz corpus expansion, mutation testing, broader contract verification, scripted chaos.</li>
  <li>Pre-release: 24-hour soak, full performance suite, rollout rehearsals, restore rehearsal, and site-specific pre-cutover validation.</li>
</ul>

<p>That tiering is important. It prevents the plan from pretending every test should run on every pull request. Agents can help here by matching test cost to trigger frequency.</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Fast feedback                         Deep proof
     |                                    |
     v                                    v
PR checks -&gt; main checks -&gt; nightly checks -&gt; pre-release gates
</code></pre></div></div>

<p>The same pattern appears in observability. M1 establishes logs, error tracking, health endpoints, and correlation IDs. Later milestones add queue metrics, hot-path metrics, benchmark output, safety/recovery signals, dashboards, alert routing, and SLO burn-rate surfaces.</p>

<hr />

<h2 id="step-8-model-parallelization-explicitly">Step 8: Model Parallelization Explicitly</h2>

<p>A delivery plan is only useful if it accounts for the team that will execute it. This anonymized plan assumes three to four engineers, two-week sprints, and roughly nine months of work.</p>

<p>After M1 lands, the plan maps work to parallel tracks:</p>

<ul>
  <li>Engineer A: domain core, E-Stop, outbox, restart, dead-letter and recovery.</li>
  <li>Engineer B: ZeroMQ, hot path, cold path, E2E harness.</li>
  <li>Engineer C: persistence, auth, admin APIs, config, migration tooling.</li>
  <li>Engineer D, where available: local stack, observability, contract tests, chaos, performance, dashboards, runbooks.</li>
</ul>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>M1 foundation
     |
     +--&gt; Domain track --------&gt; Safety / recovery
     |
     +--&gt; Messaging track -----&gt; Hot path / cold path / E2E
     |
     +--&gt; Integration track ---&gt; Persistence / auth / admin / migration
     |
     +--&gt; Quality track -------&gt; Contracts / chaos / SLOs / release proof
</code></pre></div></div>

<p>The plan also identifies blockers to parallelization:</p>

<ul>
  <li>FSM definition must finish before meaningful command handler wiring.</li>
  <li>Auth sidecar wiring blocks authenticated admin APIs.</li>
  <li>E-Stop priority lane touches both domain and messaging code, so domain and messaging engineers should pair.</li>
</ul>

<p>This is the kind of information agents can infer from dependencies, but humans should review carefully. Parallelization is not just about keeping everyone busy. It is about reducing queueing time without creating integration chaos.</p>

<hr />

<h2 id="step-9-keep-release-work-in-the-plan">Step 9: Keep Release Work in the Plan</h2>

<p>The architecture document includes rollout and migration. The delivery plan preserves that as M6 rather than treating it as an operations afterthought.</p>

<p>M6 includes:</p>

<ul>
  <li>Data migration tooling from the legacy document layout to ORM-backed entities.</li>
  <li>Red-green rollout runbooks.</li>
  <li>Rollback plan to the legacy app/gateway/Edge server setup.</li>
  <li>Customer A activation.</li>
  <li>Customer B activation.</li>
  <li>Site acceptance testing.</li>
  <li>Post-launch retrospective and on-call handoff.</li>
</ul>

<p>That matters because architecture migration is not done when the code compiles. It is done when production data is migrated, the site is cut over, rollback is rehearsed, support can operate the system, and customer acceptance criteria are met.</p>

<p>The plan’s SAT (site acceptance testing) criteria are concrete: 14-day soak, zero SEV-1, SLO targets met, and customer sign-off. Those criteria prevent “done” from becoming subjective at the end of the program.</p>

<hr />

<h2 id="what-agents-did-well">What Agents Did Well</h2>

<p>The strongest parts of the plan are the places where agents used the architecture document as a constraint system.</p>

<p>They mapped architecture sections into delivery tracks. Hexagonal architecture became foundation, domain, adapter, and integration work. Hot and cold paths became separate messaging epics. Safety constraints became timing gates, outbox work, and recovery stories. Testing strategy became CI tiers and a continuous quality swimlane. Rollout notes became migration and cutover milestones.</p>

<p>They also created useful intermediate artifacts:</p>

<ul>
  <li>Persistence matrix.</li>
  <li>Message-class matrix.</li>
  <li>Availability SLI definition.</li>
  <li>Cross-milestone dependency map.</li>
  <li>Decisions log.</li>
  <li>Story artifact policy.</li>
  <li>CI tiers.</li>
</ul>

<p>Those artifacts make the plan auditable. A reviewer can trace a story back to a design section, a decision, or a risk.</p>

<hr />

<h2 id="where-our-human-team-members-still-matters">Where Our Human Team Members Still Matters</h2>

<p>Agents can structure the plan, but they cannot own the consequences.</p>

<p>Humans still need to validate:</p>

<ul>
  <li>Whether the 30/40/30 ms E-Stop budget is realistic on actual site networks.</li>
  <li>Whether simulated field-device behavior is representative enough before SAT.</li>
  <li>Whether the field-device contract will stabilize in time.</li>
  <li>Whether Docker Compose per site is operationally sufficient for 99.99 percent availability.</li>
  <li>Whether the team has enough Go, ZeroMQ, Watermill, auth, observability, and field deployment experience.</li>
  <li>Whether audio/video streaming can remain a placeholder without undermining the release.</li>
  <li>Whether the plan’s sequencing fits staffing, procurement, and customer-site constraints.</li>
</ul>

<p>Agentic planning reduces planning labor. It does not remove technical accountability.</p>

<hr />

<h2 id="a-repeatable-agentic-planning-pattern">A Repeatable Agentic Planning Pattern</h2>

<p>The agent-created plan accurately captures that the service is not simply a rewrite from C#/SignalR to Go/ZeroMQ. It is a safety-sensitive architecture migration with strict latency, restart, identity, idempotency, observability, and rollout requirements. It also makes clear where the team can parallelize and where it must serialize work. That is what good agentic project planning should produce. Not a bigger backlog. A clearer one. The following figure shows the steps that led to the final executable plan.</p>

<p><img src="/assets/img/agentic_planning_repeatable_pattern.png" alt="Agentic planning pattern" loading="lazy" /></p>

<p>Agents are most valuable when they help teams preserve architectural intent all the way down to executable work. In this case, the architecture document defined the system. The planning agents turned that definition into delivery structure: milestones, risk controls, quality gates, and release evidence.</p>

<blockquote>
  <p>That is the difference between an architecture document that is admired and an architecture document that ships.</p>
</blockquote>]]></content><author><name>Dr. Nobel Khandaker</name></author><category term="practices" /><category term="agentic-planning" /><category term="project-management" /><category term="architecture" /><category term="delivery" /><category term="jira" /><summary type="html"><![CDATA[How agentic planning turns a 30-page architecture design into milestones, epics, stories, risk matrices, and release gates — without losing architectural intent.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://outloop.blog/assets/img/agentic_planning_workflow.png" /><media:content medium="image" url="https://outloop.blog/assets/img/agentic_planning_workflow.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Agentic Engineering: Spec-Driven Development</title><link href="https://outloop.blog/2026/04/23/spec-driven-development-with-coding-agents.html" rel="alternate" type="text/html" title="Agentic Engineering: Spec-Driven Development" /><published>2026-04-23T09:00:00-04:00</published><updated>2026-05-25T09:27:23-04:00</updated><id>https://outloop.blog/2026/04/23/spec-driven-development-with-coding-agents</id><content type="html" xml:base="https://outloop.blog/2026/04/23/spec-driven-development-with-coding-agents.html"><![CDATA[<p><strong>Spec-Driven Development (SDD)</strong> is the process of creating a living contract between human developers and coding agents where the <em>Specification</em> (the <em>what</em> and <em>why</em>) is deliberately decoupled from the <em>Implementation</em> (the <em>how</em>).  SDD allows a human developer to become an architect who guides the agent to build and ship high quality software. In this blog I summarize my experience of using the SDD in software engineering.  The prompts and the skills are from Paul’s SDD course — see the <a href="https://github.com/https-deeplearning-ai/sc-spec-driven-development-files">DeepLearning.AI course repo on GitHub</a>.</p>

<!--more-->

<hr />

<h2 id="why-spec-driven-development">Why Spec-Driven Development?</h2>

<p>The problems I usually face with vibe coding are 1) lost chat histories and context and 2) lack of a shared architecture/dev contract. These usually result in poor coordination among our team members for complex and long development projects.</p>

<blockquote>
  <p><strong>SDD is appropriate for projects with significant complexity</strong> if you can accomplish what you need in one short prompt, SDD will not provide any advantages.</p>
</blockquote>

<hr />

<h2 id="my-sdd-workflow-at-a-glance">My SDD Workflow at a Glance</h2>

<p>The SDD workflow has two major layers: a one-time <strong>project initialization</strong> step (the Constitution) and a <strong>repeating feature loop</strong>.</p>

<p><img src="/assets/img/sdd_workflow.png" alt="Spec-Driven Development workflow: Constitution followed by the Specify → Implement → Validate → Replan feature loop" /></p>

<hr />

<h2 id="phase-0-create-the-project-constitution">Phase 0: Create the Project Constitution</h2>

<p>The Constitution is the <strong>agent-agnostic and structured foundation of the entire project</strong>. It is a global, high-level set of documents that captures the agreement between the developers in our team and the agent stored in a <code class="language-plaintext highlighter-rouge">specs/</code> directory:</p>

<p><img src="/assets/img/sdd_constitution.png" alt="Constitution components: mission.md, tech-stack.md, and roadmap.md in a specs/ directory" loading="lazy" /></p>

<h3 id="step-by-step-drafting-the-constitution">Step-by-Step: Drafting the Constitution</h3>

<h4 id="step-1--create-the-knowledge-base">Step 1 — create the knowledge base</h4>

<p>Before talking to the agent, I create a knowledge base for the project:</p>

<ul>
  <li>Any existing READMEs, stakeholder notes, or product requirements</li>
  <li>Architecture documents, dev design documents</li>
  <li>Technical constraints from your organization (preferred languages, deployment targets)</li>
  <li>Your opinions on tech stack, testing frameworks, architecture patterns</li>
</ul>

<h4 id="step-2--create-the-constitution">Step 2 — create the constitution</h4>

<p>I use a prompt similar to the following to build the constitution:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>I am building a new web application. Help me create a Project
Constitution with three files in a specs/ directory:
  - mission.md: vision, audience, scope, guiding principles
  - tech-stack.md: frameworks, deployment, technical constraints
  - roadmap.md: phases and features, organized in small steps

Please read README.md for background context, then ask me
questions — one at a time — to clarify what you need.
Use the AskUserQuestion tool if available.
</code></pre></div></div>

<h4 id="step-3--qa-with-agents">Step 3 — Q&amp;A with agents</h4>

<p>I usually find that the agent clarifies a few decisions that have been missed in the knowledge base, e.g., architecture patterns, external packages, speed-vs-fidelity trade-offs.</p>

<h4 id="step-4--human-in-the-loop-review">Step 4 — Human-in-the-loop review</h4>

<p>In this stage, I review all three files and <strong>ask the agent</strong> to fix any gaps. I avoid making any changes manually to ensure that the whole constitution remains in sync.</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>The mission left out our target audience. Please add:
"The primary audience is internal engineering teams at 
our organization." Also, use SQLite for this prototype.
</code></pre></div></div>

<h4 id="step-5--commit-the-constitution">Step 5 — Commit the Constitution</h4>

<p>I commit the constitution to the repo.</p>

<blockquote>
  <p><strong>Key insight:</strong> The Constitution is a <em>living document</em>. Version it and update it as the project evolves — always via the agent, in its own branch, so you can track which Constitution version produced which code.</p>
</blockquote>

<hr />

<h2 id="phase-1-feature-specification">Phase 1: Feature Specification</h2>

<p>For every feature, I create the <strong>feature spec</strong>. This is the most important step in the loop — the key is not to rush it, but don’t micro-manage it either.</p>

<h3 id="the-feature-spec-files">The Feature Spec Files</h3>

<p>Each feature lives on its own branch and produces three spec files:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>specs/feature-XX/
├── plan.md          ← Approach, task groups, sequence of work
├── requirements.md  ← Functional &amp; non-functional requirements
└── validation.md    ← Scorecard: concrete success criteria
</code></pre></div></div>

<h3 id="step-by-step-feature-specification">Step-by-Step: Feature Specification</h3>

<h4 id="step-1--a-clean-context-and-a-new-git-branch">Step 1 — a clean context and a new git branch</h4>

<p>I always clear the agent’s context before starting which forces the agent loads everything it needs from the spec files — not from the memory of a previous session.</p>

<h4 id="step-2--create-a-feature-spec">Step 2 — create a feature spec</h4>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Find the next phase on specs/roadmap.md and make a branch, ask me about the feature spec. Create:

A new directory YYYY-MM-DD-feature-name under specs for this feature work
In there:
plan.md as a series of numbered task groups.
requirements.md for the scope, decisions, context
validation.md for how to know the implementation succeeded and can be merged
Refer to specs/mission.md and specs/tech-stack.md for guidance.

Important: You must use your AskUserQuestion tool, grouped on these 3, before writing to disk.
</code></pre></div></div>

<h4 id="step-3--make-decisions-at-the-right-altitude">Step 3 — Make decisions at the right altitude</h4>

<p>The agent usually asks for key architectural and product decisions. The key is to steer at a high level (goals, missions) and not to over-specify (e.g., variable name, detail implementation steps).</p>

<h4 id="step-4--review-and-refine-all-three-spec-files">Step 4 — Review and refine all three spec files</h4>

<p>In this stage, I review the files: <code class="language-plaintext highlighter-rouge">plan.md</code>, <code class="language-plaintext highlighter-rouge">requirements.md</code>, and <code class="language-plaintext highlighter-rouge">validation.md</code> carefully. If something is wrong, I ask the agent to correct — this keeps requirements and the validation scorecard in sync.</p>

<h4 id="step-5--commit-the-feature-spec">Step 5 — commit the feature spec</h4>

<p>Only after this commit I ask the agent to start implementation.</p>

<hr />

<h2 id="phase-2-feature-implementation">Phase 2: Feature Implementation</h2>

<p>With the feature spec committed, let the agent implement it.</p>

<h3 id="step-by-step-feature-implementation">Step-by-Step: Feature Implementation</h3>

<h4 id="step-1--clear-the-agents-context">Step 1 — Clear the agent’s context</h4>

<p>Start each implementation session fresh:</p>

<h4 id="step-2--send-the-implementation-prompt">Step 2 — Send the implementation prompt</h4>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Read specs/feature-01/ and implement all task groups defined
in plan.md, following requirements.md.
Work in small commits, one task group at a time.
</code></pre></div></div>

<h4 id="step-3--choosing-the-step-size">Step 3 — choosing the step size</h4>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌───────────────────────────────────────────────────────────────┐
│                IMPLEMENTATION STEP SIZE OPTIONS               │
├───────────────────────────────────────────────────────────────┤
│                                                               │
│  ALL TASK GROUPS   Faster, more to review at once             │
│  AT ONCE           Best when you trust the spec fully         │
│                                                               │
│  ONE TASK GROUP    Smaller, easier to review                  │
│  PER PROMPT        Best for: security, auth, DB schema work   │
│                    Small mistakes compound less               │
│                    Especially useful in new codebases         │
│                                                               │
└───────────────────────────────────────────────────────────────┘
</code></pre></div></div>

<h4 id="step-4--observe-progress-and-review">Step 4 — observe progress and review</h4>

<p>I usually read the agent’s summary of its work (individual task groups) and review the diff of the changes.</p>

<h4 id="step-5--running-the-app">Step 5 — running the app</h4>

<p>I ask the agent to self-validate against <code class="language-plaintext highlighter-rouge">validation.md</code> at the end of implementation and run the app.</p>

<hr />

<h2 id="phase-3-validation-of-the-feature-as-a-whole">Phase 3: validation of the feature as a whole</h2>

<blockquote>
  <p><strong>Note on cognitive debt:</strong> Because agents generate code so fast, developers can accumulate <em>cognitive debt</em> — the mental load of tracking what the code is doing and how it has evolved. Keeping changes manageable and reviewing incrementally is how you keep this debt under control.</p>
</blockquote>

<h3 id="step-by-step-feature-validation">Step-by-Step: Feature Validation</h3>

<h4 id="step-1--start-with-the-commit-view">Step 1 — Start with the commit view</h4>

<p>I use the diff/commit view in the IDE and review changes at a <strong>high level</strong>:</p>

<ul>
  <li>Does the feature work as described in the spec?</li>
  <li>Are the right patterns, components, and structures being used?</li>
  <li>Avoid reviewing CSS class names or variable names — focus on intent and architecture.</li>
</ul>

<h4 id="step-2--raising-issues-via-the-agent">Step 2 — raising issues via the agent</h4>

<p>If I find a code issue or a spec omission — I ask the agent to fix it, this keeps all artifacts in sync:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>The Home component puts all three sub-components in a single file.
Please split them into their own files and update any spec
documents or README mentions that reference the file structure.
</code></pre></div></div>

<p>If a code mistake traces back to something ambiguous in the spec, I ask the agent to fix the spec so that the issue does not reappear.</p>

<h4 id="step-3--run-tests">Step 3 — run tests</h4>

<p>Here review and run tests via IDE and use the IDE debugger to step through execution.  If the testing framework wasn’t configured during implementation, add it via a replanning step (see Phase 4).</p>

<h4 id="step-4--constitution-updates-within-a-feature-branch">Step 4 — constitution updates within a feature branch</h4>

<p>Small updates to the Constitution (e.g., checking off a roadmap step) can stay on the feature branch. If a larger constitutional change is needed, I create a separate branch for it so you can track which version of the spec produced which code.</p>

<h4 id="step-5--mark-validation-complete-and-merge">Step 5 — Mark validation complete and merge</h4>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌──────────────────────────────────────────────────────────────────┐
│                      VALIDATION CHECKLIST                        │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  [ ] Feature works as described in the spec                      │
│  [ ] All scorecard items in validation.md are satisfied          │
│  [ ] Code follows patterns established in the Constitution       │
│  [ ] Tests pass                                                  │
│  [ ] Related docs / specs updated for any scope changes found    │
│  [ ] Branch merged to main with a meaningful commit message      │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘
</code></pre></div></div>

<hr />

<h2 id="phase-4-project-replanning">Phase 4: Project Replanning</h2>

<p>After every feature merge, <strong>I am careful not to immediately jump to the next feature</strong>. The replanning step updates the constitution, roadmap, and workflow to keep the whole process in sync.</p>

<h3 id="what-replanning-covers">What Replanning Covers</h3>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌──────────────────────────────────────────────────────────────────┐
│                       THE REPLANNING STEP                        │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  A) CONSTITUTION UPDATES                                         │
│     • Add testing frameworks or tooling you settled on           │
│     • Record new architectural decisions made during impl        │
│     • Add responsive design requirements, new constraints, etc.  │
│     • Keep the living document current                           │
│                                                                  │
│  B) ROADMAP REVIEW                                               │
│     • Is the next roadmap item still the right thing to do?      │
│     • Can upcoming features be tackled together in one phase?    │
│     • Are there dependencies to re-order?                        │
│     • New info from stakeholders or product managers?            │
│                                                                  │
│  C) WORKFLOW IMPROVEMENT (Skills &amp; Automation)                   │
│     • Package repetitive prompts into Agent Skills               │
│     • Create or improve changelog automation                     │
│     • Add linting, formatting, test-writing to validation step   │
│     • Decide: is this skill project-specific or global?          │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘
</code></pre></div></div>

<h3 id="step-by-step-replanning">Step-by-Step: Replanning</h3>

<h4 id="step-1--create-a-replanning-branch">Step 1 — Create a replanning branch</h4>

<p>Keeping constitution updates <strong>on their own branch</strong> lets you track which version of the spec produced which code.</p>

<h4 id="step-2--update-the-constitution-with-what-you-learned">Step 2 — Update the Constitution with what you learned</h4>

<p>Example: adding a testing framework after the first feature revealed the gap:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Update specs/tech-stack.md to add Pytest as our testing
framework with these preferences: [...]

Also update specs/feature-01/requirements.md and the
implementation to add tests using this framework.
</code></pre></div></div>

<p>And if a product update comes in from stakeholders:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>We just learned that our product will run on desktop as well as smart phones.
Update the product specs, feature specs, and any existing
code to emphasize the responsive design.
</code></pre></div></div>

<blockquote>
  <p><strong>Guidance:</strong> If the new work triggered by a product update is small, implementing it during replanning is fine. If it’s large, I schedule it as its own feature on the roadmap instead.</p>
</blockquote>

<h4 id="step-3--agent-skills-for-repetitive-workflow-steps">Step 3 — agent Skills for repetitive workflow steps</h4>

<p>I have used Claude to write skills (<em>global</em> or <em>local per project</em>) that captures the common prompts:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>I want to stop repeating the feature spec prompt. Use your skill creator to help me write a "feature spec" local skill. Here is the previous prompt:

Find the next phase on specs/roadmap.md and make a branch, ask me about the feature spec. Create:

A new directory YYYY-MM-DD-feature-name under specs for this feature work
In there:
plan.md as a series of numbered task groups.
requirements.md for the scope, decisions, context
validation.md for how to know the implementation succeeded and can be merged
Refer to specs/mission.md and specs/tech-stack.md for guidance.

Important: You must use your AskUserQuestion tool, grouped on these 3, before writing to disk.
</code></pre></div></div>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>I want to keep a CHANGELOG.md in the project root, with headings for dates. If no changelog, examine git commits and add bullets for each date. Then, as we work, we will manually invoke this skill before merging. Help me write a skill for this.
</code></pre></div></div>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Create a validation skill to with the following steps:
  1. Update CHANGELOG.md                              
  2. Run linter &amp; auto-fix                            
  3. Run formatter                                    
  4. Run test suite, report failures                  
  5. Ask agent to fix any test failures               
  6. Update README if public API changed              
  7. Commit with a standardized message format        

</code></pre></div></div>

<h4 id="step-4--commit-and-merge-the-replanning-branch">Step 4 — Commit and merge the replanning branch</h4>

<hr />

<h2 id="managing-ai-fatigue">Managing AI Fatigue</h2>

<p>As you begin each new feature, I establish a clean <em>flow state</em> before diving in. Running through this checklist prevents AI fatigue and context contamination between features:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌──────────────────────────────────────────────────────────────────┐
│                  FEATURE KICKOFF CHECKLIST                       │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  [ ] All previous feature work committed and merged to main?     │
│  [ ] Constitution updated with learnings from the last feature?  │
│  [ ] Roadmap reviewed — is this still the right next feature?    │
│  [ ] Agent context cleared (/clear)?                             │
│      (Ensures specs capture intent, not memory snapshots,        │
│       and focuses the agent's limited context budget on          │
│       the next task only)                                        │
│  [ ] New feature branch created?                                 │
│  [ ] Fresh feature spec prompt ready to send?                    │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘
</code></pre></div></div>

<h3 id="strategies-to-combat-ai-fatigue">Strategies to Combat AI Fatigue</h3>

<p>Agents can generate massive amounts of code very quickly, making the human-in-the-loop review <strong>exhausting</strong>. Use these strategies:</p>

<ul>
  <li>Review at a <strong>high level</strong> — does it match the spec and reflect your intent?</li>
  <li>Don’t nitpick variable names, CSS classes, or minor style choices</li>
  <li>For complex areas (security, database schema), implement <strong>one task group at a time</strong></li>
  <li>Use the agent’s <strong>sub-agent review</strong> for a thorough second look: ask the agent to spawn several sub-agents to do a deep review of the entire project with the feature change. Sub-agents give the review more reasoning space and preserve the main agent’s context window rather than polluting it.</li>
  <li>When you find an omission (e.g., “prop types should be in a standalone TypeScript type file”), fix the code via the agent <em>and</em> update the spec — it will apply automatically to all future features.</li>
</ul>

<hr />

<h2 id="shipping-an-mvp">Shipping an MVP</h2>

<p>If I am confident in the constitution, I sometimes build the rest of the roadmap in a single pass to produce, for example, an MVP.</p>

<hr />

<h2 id="brownfield--legacy-projects">Brownfield / Legacy Projects</h2>

<p>I have used SDD for new and existing codebases:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌───────────────────────────────────┬───────────────────────────────────┐
│          GREENFIELD               │            BROWNFIELD             │
│       (New project)               │       (Existing codebase)         │
├───────────────────────────────────┼───────────────────────────────────┤
│ Draft Constitution via            │ Agent generates Constitution by   │
│ conversation with agent           │ reading existing code             │
│                                   │                                   │
│ Agent asks questions to           │ Agent extracts: file structure,   │
│ discover your preferences         │ framework versions, patterns,     │
│                                   │ then asks clarifying questions    │
│                                   │                                   │
│ Roadmap starts from scratch       │ Roadmap aligns to existing        │
│ based on your product vision      │ TODO.md, issue trackers, or docs  │
│                                   │                                   │
│ Feature loop begins immediately   │ Feature loop begins immediately   │
│ after Constitution is committed   │ after Constitution is committed   │
└───────────────────────────────────┴───────────────────────────────────┘
</code></pre></div></div>

<h3 id="step-by-step-recipe-for-onboarding-a-legacy-project">Step-by-Step: Recipe for Onboarding a Legacy Project</h3>

<h4 id="step-1--gather-existing-documentation">Step 1 — Gather existing documentation</h4>

<p>Collect <code class="language-plaintext highlighter-rouge">README.md</code>, <code class="language-plaintext highlighter-rouge">TODO.md</code>, issue tracker exports, any architecture docs, and existing product requirement documents. Your legacy project might have plans in spreadsheets, Word documents, or Jira — add as much context as you can.</p>

<h4 id="step-2--send-the-legacy-constitution-prompt">Step 2 — Send the legacy Constitution prompt</h4>

<p>The prompt is nearly the same as for a greenfield project, with one key addition: tell the agent to <strong>look for roadmap items in existing artifacts</strong>.</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>I am introducing Spec-Driven Development to an existing project.
Please read all files in this directory and generate a Constitution:

  specs/mission.md      — based on the README and any product context
  specs/tech-stack.md   — based on the actual frameworks, versions,
                          and file structure you find in the codebase
  specs/roadmap.md      — based on TODO.md and any outstanding work

The agent will discover and in a sense reverse-engineer the SDD
artifacts from the existing codebase. Ask me questions to fill
any gaps you cannot determine from the code.
</code></pre></div></div>

<h4 id="step-3--review-and-commit">Step 3 — Review and commit</h4>

<p>Review all three files and correct any incorrect assumptions the agent made to fill gaps and then commit.</p>

<h4 id="step-4--continue-with-the-standard-feature-loop">Step 4 — Continue with the standard feature loop</h4>

<p>From this point, the workflow is <strong>identical</strong> to a greenfield project. The spec is now the <em>memory of the project</em> — it does not fade. The Constitution helps align all future code changes made by the agent with what past developers have already created.</p>

<hr />

<h2 id="building-and-automating-your-own-workflow">Building and Automating Your Own Workflow</h2>

<p>Once you are comfortable with the core loop, begin automating and customizing using:</p>

<h3 id="mcp-servers">MCP servers</h3>

<p>I generally use the context7 MCP server to let the agent review the latest set of documentation.</p>

<h3 id="research-backlogs">Research Backlogs</h3>

<p>When you have an idea mid-feature that you want to explore without committing to it, use a research backlog:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>I want to explore switching to Turso for our database.
Research this topic with me, but do not add it to the roadmap yet.
When we are done, write a report to specs/research/turso-db.md.
</code></pre></div></div>

<p>You can later ask the agent to schedule the research on the roadmap with a link to the backlog file. As your research backlog grows, write a skill to automate the research workflow.</p>

<hr />

<h2 id="key-principles-summary">Key Principles Summary</h2>

<p>The complete SDD process distilled into one reference:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>╔══════════════════════════════════════════════════════════════════════╗
║           SPEC-DRIVEN DEVELOPMENT — COMPLETE REFERENCE               ║
╠══════════════════════════════════════════════════════════════════════╣
║                                                                      ║
║  CONSTITUTION  (Once per project — a living document)                ║
║  ──────────────────────────────────────────────────────              ║
║  mission.md     → The WHY   (vision, audience, scope)                ║
║  tech-stack.md  → The HOW   (frameworks, constraints, schema)        ║
║  roadmap.md     → The WHAT  (features, phases, sequence)             ║
║                                                                      ║
║  FEATURE LOOP  (Repeat for every feature)                            ║
║  ──────────────────────────────────────────────────────              ║
║  1. SPECIFY    New branch  →  /clear  →  interview agent             ║
║                Commit plan.md, requirements.md, validation.md        ║
║                                                                      ║
║  2. IMPLEMENT  /clear  →  implement prompt                           ║
║                Review diffs as the agent works                       ║
║                Small, frequent commits                               ║
║                                                                      ║
║  3. VALIDATE   Code review at a high level                           ║
║                Fix via the agent (keeps specs in sync)               ║
║                Run tests &amp; validation scorecard                      ║
║                Merge feature branch to main                          ║
║                                                                      ║
║  4. REPLAN     Update the Constitution with what you learned         ║
║                Review and adjust the roadmap                         ║
║                Package repeated prompts as Agent Skills              ║
║                                                                      ║
║  ALWAYS                                                              ║
║  ──────────────────────────────────────────────────────              ║
║  • Clear context (/clear) at the start of each major step            ║
║  • Dedicated branch per feature — and per replanning cycle           ║
║  • Human-in-the-loop: YOU decide, the agent elaborates               ║
║  • Steer at the right altitude — goals, not variable names           ║
║  • When you find a gap, fix the SPEC — it is the project memory      ║
║  • Commit often — small steps compound into great results            ║
║                                                                      ║
║  ANTI-PATTERNS TO AVOID                                              ║
║  ──────────────────────────────────────────────────────              ║
║  ✗ Skipping the spec and starting implementation directly            ║
║  ✗ Editing spec files manually instead of via the agent              ║
║  ✗ Carrying context forward across features without /clear           ║
║  ✗ Nitpicking low-level code instead of reviewing intent             ║
║  ✗ Rushing to the next feature without replanning                    ║
║  ✗ Implementing a large chunk on a weak Constitution                 ║
║                                                                      ║
╚══════════════════════════════════════════════════════════════════════╝
</code></pre></div></div>

<hr />

<h2 id="closing-thoughts">Closing Thoughts</h2>

<blockquote>
  <p>“The best code starts with a great spec.”</p>
</blockquote>

<p>The specs you craft, the Constitution you maintain, and the workflow you automate are what separate a thoughtfully engineered software product from a pile of AI-generated code that only one session ever understood.</p>

<p>Start small: pick your next feature, write a spec before writing any code, and see how much more confidently and consistently the agent delivers.</p>

<hr />]]></content><author><name>Dr. Nobel Khandaker</name></author><category term="practices" /><category term="spec-driven-development" /><category term="coding-agents" /><category term="ai" /><category term="claude-code" /><category term="workflow" /><summary type="html"><![CDATA[Spec-driven development with Claude Code — project constitutions, feature loops, validation, replanning, and agent-agnostic workflows.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://outloop.blog/assets/img/sdd_workflow.png" /><media:content medium="image" url="https://outloop.blog/assets/img/sdd_workflow.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Agentic Engineering: Taming a Legacy Codebase</title><link href="https://outloop.blog/2026/04/22/taming-legacy-codebase-with-claude.html" rel="alternate" type="text/html" title="Agentic Engineering: Taming a Legacy Codebase" /><published>2026-04-22T08:00:00-04:00</published><updated>2026-05-25T09:27:23-04:00</updated><id>https://outloop.blog/2026/04/22/taming-legacy-codebase-with-claude</id><content type="html" xml:base="https://outloop.blog/2026/04/22/taming-legacy-codebase-with-claude.html"><![CDATA[<h2 id="why-this-article-exists">Why This Article Exists</h2>

<p>Every engineering team eventually inherits a codebase that has outgrown its original design. Features were shipped, deadlines were met, and somewhere along the way the foundations quietly cracked. Hardcoded secrets found their way into source control. <code class="language-plaintext highlighter-rouge">async void</code> started creeping into timer callbacks. Collections got shared across threads without locks. A comment saying <code class="language-plaintext highlighter-rouge">// TODO: fix this properly</code> quietly turned into a permanent resident.</p>

<!--more-->

<p>This article documents how our team used Claude to audit and harden three legacy services over a multi-month effort. The services were modest in size — roughly fifteen to sixty source files each — but they sat on the critical path of a real-time control system. A crash in any of them meant downtime for physical equipment in the field. That made the usual “just rewrite it” advice completely unacceptable.</p>

<p>What follows is a practical playbook. We walk through how we used Claude for five distinct jobs:</p>

<ol>
  <li><strong>Refactoring</strong> sprawling services back into something understandable.</li>
  <li><strong>Fixing bugs and logging gaps</strong> that hid failures from operators.</li>
  <li><strong>Eliminating race conditions</strong> in collection access, timers, and state flags.</li>
  <li><strong>Improving code quality</strong> across naming, dead code, and error handling.</li>
  <li><strong>Reducing technical debt</strong> measurably, without freezing feature work.</li>
</ol>

<p>High-risk changes were verified with tests first or alongside the fix, especially around concurrency, startup behavior, and logging. Every code sample below has been obfuscated — names, domains, and specifics have been changed — but the patterns, shapes, and lessons are exactly what we encountered in production.</p>

<hr />

<h2 id="the-landscape-we-inherited">The Landscape We Inherited</h2>

<p>Three services sat at the heart of the system. A device-side proxy ran on embedded Linux hardware and bridged a local message bus to a central coordinator over SignalR. A server-side coordinator aggregated state from hundreds of connected devices and fanned out commands to operator consoles. A mobile controller gave field operators a touchscreen interface to issue commands.</p>

<p><img src="/assets/img/refactored_components.png" alt="Monolithic DataBridgeService split into transport, command dispatch, state, and peripheral control components" /></p>

<p>Across the audit documents we generated for these services, Claude surfaced well over <strong>140 issue instances</strong> across five categories:</p>

<table>
  <thead>
    <tr>
      <th>Category</th>
      <th style="text-align: right">Issues Found</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Bugs and logic errors</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td>Race conditions and thread safety</td>
      <td style="text-align: right">44</td>
    </tr>
    <tr>
      <td>Security and secret management</td>
      <td style="text-align: right">9</td>
    </tr>
    <tr>
      <td>Logging and observability gaps</td>
      <td style="text-align: right">38</td>
    </tr>
    <tr>
      <td>Code quality and tech debt</td>
      <td style="text-align: right">24</td>
    </tr>
  </tbody>
</table>

<p>Those counts came from multiple audits run on different services and at different points in time, so they are best read as a backlog inventory rather than a mathematically exact “open issue count” at one instant. That distinction matters. Engineers can smell inflated metrics immediately.</p>

<p>What the numbers did tell us, reliably, was where the risk clustered: concurrency, observability, and unsafe async usage. No single human could hold that entire list in working memory while also writing code. That is exactly the kind of problem where Claude earns its keep.</p>

<h2 id="what-claude-did-and-what-human-developers-still-owned">What Claude Did, and What Human Developers Still Owned</h2>

<p>Claude was most useful in four places:</p>

<ol>
  <li><strong>Inventory.</strong> Scanning a module and producing a structured issue list with file paths, line numbers, and candidate fixes.</li>
  <li><strong>Pattern matching.</strong> Finding every variation of the same bug class: <code class="language-plaintext highlighter-rouge">SafeFireAndForget()</code> without an error callback, mutable dictionaries shared across threads, <code class="language-plaintext highlighter-rouge">Task.Delay</code> calls with unclear units, timer callbacks implemented as <code class="language-plaintext highlighter-rouge">async void</code>.</li>
  <li><strong>Drafting small fixes.</strong> Once we had a failing test or a clearly defined target, Claude was effective at drafting the minimal change.</li>
  <li><strong>Summarizing conventions.</strong> It was surprisingly good at inferring the codebase’s implicit design rules from the parts that were already clean.</li>
</ol>

<p>Humans still owned the parts that actually determine whether a system stays safe:</p>

<ol>
  <li><strong>Severity.</strong> Claude could identify a race; engineers still had to decide whether it was a theoretical code smell, a real production risk, or already serialized by some upstream mechanism.</li>
  <li><strong>Merge readiness.</strong> Many source audits included issues that were fixed on a branch, partially fixed, or still open. Humans had to reconcile audit output with reality.</li>
  <li><strong>Test strategy.</strong> Claude could suggest a fix, but engineers still had to decide whether the defect needed a unit test, a stress harness, an integration test, or simply a code review plus a reproducer.</li>
  <li><strong>Stopping.</strong> We did not pursue zero issues at any cost. Some low-severity findings were documented and left alone because the churn was not worth it.</li>
</ol>

<hr />

<h2 id="our-working-loop">Our Working Loop</h2>

<p>Every change followed the same four-phase cycle. The loop became muscle memory within the first two weeks.</p>

<p><img src="/assets/img/refactoring_cycle.png" alt="Four-phase working loop: audit, plan, implement with TDD, verify, feeding back to audit" loading="lazy" /></p>

<p>The key insight was that Claude is excellent at the first two phases — the ones humans find tedious — and genuinely helpful at the third phase when guided by an explicit failing test. The fourth phase still belongs to humans, but it becomes fast when the diff is small and the test captures the intent.</p>

<p>Throughout the effort, we produced markdown audit documents. Each one listed every issue, its severity, its file and line numbers, a short explanation of the danger, and a proposed fix. These documents became the working backlog for each service and were regenerated regularly as the code changed.</p>

<hr />

<h2 id="part-1--refactoring">Part 1 — Refactoring</h2>

<p>We asked Claude for a refactoring proposal. The useful part was not “AI architecture.” It was the dependency map: which methods touched transport, which ones mutated shared state, and which timers or callbacks crossed those boundaries. From there, the split along transport, command dispatch, state, and peripheral control became fairly obvious.</p>

<p>Rather than inventing style guides from scratch, we had Claude summarize the <em>implicit</em> conventions it observed in the existing clean code, and then used those as the refactoring targets for the rest. Three principles emerged:</p>

<ul>
  <li><strong>One responsibility per class, one mutation site per field.</strong> If a field was being written in three places, we refactored so only one method could mutate it.</li>
  <li><strong>Pass cancellation through every async boundary.</strong> No exceptions.</li>
  <li><strong>Return snapshots, not references.</strong> Any public getter on shared mutable state returned a copy.</li>
</ul>

<p>These rules sound obvious, but before Claude cataloged every violation across <strong>sixty</strong> files, we had no idea how widespread the pattern breaks actually were.</p>

<hr />

<h2 id="part-2--fixing-bugs-and-logging-blind-spots">Part 2 — Fixing Bugs and Logging Blind Spots</h2>

<p>The audit surfaced bugs ranging from the embarrassing to the genuinely dangerous. We will walk through two representative examples and then describe the logging work, which turned out to have the highest operational return.</p>

<h3 id="the-task-is-always-faulted-bug">The “task is always faulted” bug</h3>

<p>A senior engineer on the team had written a background task launcher that was supposed to log a critical message if the task failed. It looked something like this:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">task</span><span class="p">.</span><span class="nf">ContinueWith</span><span class="p">(</span><span class="n">t</span> <span class="p">=&gt;</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="p">.</span><span class="n">IsFaulted</span> <span class="p">||</span> <span class="n">t</span><span class="p">.</span><span class="n">IsCompleted</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="n">_logger</span><span class="p">.</span><span class="nf">LogCritical</span><span class="p">(</span><span class="s">"Background task faulted! {ex}"</span><span class="p">,</span> <span class="n">t</span><span class="p">.</span><span class="n">Exception</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">});</span>
</code></pre></div></div>

<p>Claude flagged this immediately. <code class="language-plaintext highlighter-rouge">ContinueWith</code> only runs <em>after</em> the antecedent completes, which means <code class="language-plaintext highlighter-rouge">t.IsCompleted</code> is always true inside the callback. Every normal, successful completion was being logged as a critical failure with a null exception. Worse, the parent health-check loop was looking at <code class="language-plaintext highlighter-rouge">IsFaulted || IsCompleted</code> on the <em>returned</em> continuation task, which normalizes to completed regardless of the antecedent — so the health loop was restarting the task on every one-second tick. A silent restart storm had been running in production for months, masked by the “critical” log spam nobody read anymore.</p>

<p>The fix was small:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">task</span><span class="p">.</span><span class="nf">ContinueWith</span><span class="p">(</span><span class="n">t</span> <span class="p">=&gt;</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="p">.</span><span class="n">IsFaulted</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="n">_logger</span><span class="p">.</span><span class="nf">LogCritical</span><span class="p">(</span><span class="n">t</span><span class="p">.</span><span class="n">Exception</span><span class="p">,</span> <span class="s">"Background task faulted"</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">},</span> <span class="n">TaskContinuationOptions</span><span class="p">.</span><span class="n">OnlyOnFaulted</span><span class="p">);</span>

<span class="k">return</span> <span class="n">task</span><span class="p">;</span> <span class="c1">// return the original, not the continuation</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">★ Insight ─────────────────────────────────────</code></p>

<ul>
  <li>The key was returning the <em>original</em> task rather than the continuation. The health-check loop needs to observe the real task’s fault status, not a continuation that always completes normally.</li>
  <li><code class="language-plaintext highlighter-rouge">TaskContinuationOptions.OnlyOnFaulted</code> is belt-and-suspenders: even if someone later changes the predicate, the continuation will only fire on the fault path.</li>
</ul>

<p><code class="language-plaintext highlighter-rouge">─────────────────────────────────────────────────</code></p>

<h3 id="the-unit-mismatch-bug">The “unit mismatch” bug</h3>

<p>Another entry from the audit:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// waitTime is in seconds</span>
<span class="k">await</span> <span class="n">Task</span><span class="p">.</span><span class="nf">Delay</span><span class="p">(</span><span class="n">waitTime</span><span class="p">);</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">Task.Delay</code> expects milliseconds. A value intended to wait thirty seconds was waiting thirty milliseconds. This bug had lived in startup code for over a year. Nobody noticed because the system mostly worked — but a handful of intermittent initialization failures finally got attributed to it once Claude surfaced the issue alongside the documentation comment that explicitly said the unit was seconds.</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">await</span> <span class="n">Task</span><span class="p">.</span><span class="nf">Delay</span><span class="p">(</span><span class="n">TimeSpan</span><span class="p">.</span><span class="nf">FromSeconds</span><span class="p">(</span><span class="n">waitTime</span><span class="p">));</span>
</code></pre></div></div>

<p>We took the lesson and asked Claude to scan every <code class="language-plaintext highlighter-rouge">Task.Delay</code> call site. It found two more with ambiguous units and converted all of them to <code class="language-plaintext highlighter-rouge">TimeSpan</code>-based calls as a policy.</p>

<h3 id="the-logging-overhaul">The logging overhaul</h3>

<p>Logging was the area with the highest return on effort. The pattern was painfully consistent: important operations had no logs, unimportant operations had too many, and nothing was correlated. An operator investigating a failed command in the middle of the night had no way to trace a single request from the user interface through the coordinator into the device and back.</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    BEFORE                                   AFTER
    ─────────────────                        ────────────────────────────
    [10:42:01] INFO  Cmd received            [10:42:01] INFO  CorrelationId=a7f2...
    [10:42:01] INFO  Processing                              Cmd received
    [10:42:03] WARN  Something happened      [10:42:01] INFO  CorrelationId=a7f2...
    [10:42:05] INFO  Done                                    Processing {CommandId} for {DeviceId}
                                             [10:42:03] WARN  CorrelationId=a7f2...
    Which command? Which device?                             TransportError at step {Step}
    No way to join the dots.                 [10:42:05] INFO  CorrelationId=a7f2...
                                                             Completed {CommandId} in {ElapsedMs}ms
</code></pre></div></div>

<p>We asked Claude to produce a logging plan. The plan identified 38 gaps across seven categories: missing correlation IDs, unlogged services, <code class="language-plaintext highlighter-rouge">SafeFireAndForget</code> callbacks with no error handler, missing audit trails on user-facing operations, log level misuse, missing high-performance <code class="language-plaintext highlighter-rouge">LoggerMessage</code> source-generator usage, and minimal test coverage for logging behavior.</p>

<p>The most impactful change was introducing correlation IDs at the message boundary and threading them through the pipeline via <code class="language-plaintext highlighter-rouge">Serilog.LogContext</code>. A single command could now be followed from hub entry, through the command handler, into domain events, and back to the response — all tagged with the same identifier.</p>

<p>Just as important were the places where the system was failing silently. One startup path loaded the in-memory device cache using <code class="language-plaintext highlighter-rouge">SafeFireAndForget()</code>. If that task threw during initialization, the service would start “successfully” but every later gateway lookup would fail because the cache was empty. Another path fired safety-related operations without any durable audit trail. These were not style issues. They were production forensics failures.</p>

<p>A representative fix for silent fire-and-forget failures (obfuscated):</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Before — exception silently swallowed</span>
<span class="n">initializationTask</span><span class="p">.</span><span class="nf">SafeFireAndForget</span><span class="p">();</span>

<span class="c1">// After — critical startup failure is loud</span>
<span class="n">initializationTask</span><span class="p">.</span><span class="nf">SafeFireAndForget</span><span class="p">(</span><span class="n">ex</span> <span class="p">=&gt;</span>
    <span class="n">logger</span><span class="p">.</span><span class="nf">LogCritical</span><span class="p">(</span><span class="n">ex</span><span class="p">,</span>
        <span class="s">"Initial data fetch failed — service started with empty cache"</span><span class="p">));</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">★ Insight ─────────────────────────────────────</code></p>

<ul>
  <li>The <code class="language-plaintext highlighter-rouge">SafeFireAndForget</code> helper is popular in mobile and server .NET code because it lets you call async methods from sync contexts. But its default behavior — swallow everything — is a silent failure generator. Every high-value call site needed an <code class="language-plaintext highlighter-rouge">onException</code> handler, and some startup-path failures deserved <code class="language-plaintext highlighter-rouge">LogCritical</code>, not <code class="language-plaintext highlighter-rouge">LogError</code>.</li>
  <li>The biggest logging win was not “more logs.” It was better logs: correlation, identifiers, and audit trails on operations that operators actually care about.</li>
</ul>

<p><code class="language-plaintext highlighter-rouge">─────────────────────────────────────────────────</code></p>

<p>We added logging assertions around key paths and used them to catch regressions, but this was not a codebase with exhaustive log-level test coverage. The useful lesson was narrower: once logging becomes part of your operational contract, it deserves tests just like any other behavior.</p>

<hr />

<h2 id="part-3--eliminating-race-conditions">Part 3 — Eliminating Race Conditions</h2>

<p>This was the longest phase and the one with the most learning. Race conditions are the pathology that legacy .NET codebases are most prone to, because C# makes it very easy to share a <code class="language-plaintext highlighter-rouge">Dictionary</code> or a <code class="language-plaintext highlighter-rouge">bool</code> between threads without anything shouting at you.</p>

<p>One of the concurrency audits found 44 race conditions across a service and its related components. They fell into six archetypes:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  ARCHETYPE                                       COUNT
  ─────────────────────────────────────────────   ─────
  1. Non-thread-safe collection shared across      12
     SignalR handlers, timers, and UI thread
  2. Plain bool / enum flag read and written        9
     from different threads (no volatile)
  3. Timer callback racing with Dispose or          7
     reassignment of the same timer field
  4. async void in timer / event contexts           6
     with no try-catch
  5. Check-then-act on state that can change        6
     across an await (TOCTOU)
  6. Fire-and-forget background loops with          4
     no cancellation or restart monitoring
</code></pre></div></div>

<p>Those categories appeared across the broader set of audits as well. We address the first three below.</p>

<h3 id="archetype-1--unprotected-shared-collection">Archetype 1 — Unprotected shared collection</h3>

<p>Pattern we found, obfuscated:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">readonly</span> <span class="n">Dictionary</span><span class="p">&lt;</span><span class="n">DeviceId</span><span class="p">,</span> <span class="n">DeviceState</span><span class="p">&gt;</span> <span class="n">DeviceCollection</span> <span class="p">=</span> <span class="k">new</span><span class="p">();</span>

<span class="c1">// Written from hub callback thread</span>
<span class="n">DeviceCollection</span><span class="p">.</span><span class="nf">Add</span><span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">state</span><span class="p">);</span>

<span class="c1">// Enumerated from UI-bound property getter</span>
<span class="k">public</span> <span class="n">IEnumerable</span><span class="p">&lt;</span><span class="n">DeviceState</span><span class="p">&gt;</span> <span class="n">AllDevices</span> <span class="p">=&gt;</span> <span class="n">DeviceCollection</span><span class="p">.</span><span class="n">Values</span><span class="p">;</span>

<span class="c1">// Cleared from state machine event</span>
<span class="n">DeviceCollection</span><span class="p">.</span><span class="nf">Clear</span><span class="p">();</span>
</code></pre></div></div>

<p>Three threads, zero synchronization. The first time a user opened the device list while a state transition fired, we got an <code class="language-plaintext highlighter-rouge">InvalidOperationException</code> from the enumerator. In a test environment it was easy to reproduce; in production it took months.</p>

<p>The fix had two reasonable shapes. For dictionaries where we needed cheap concurrent access, <code class="language-plaintext highlighter-rouge">ConcurrentDictionary&lt;TKey, TValue&gt;</code> was often the right answer. For dictionaries where we wanted atomic “swap the whole thing” semantics — for example, replacing the entire device list at login — we used an <code class="language-plaintext highlighter-rouge">ImmutableDictionary</code> and a single-reference swap:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">private</span> <span class="n">ImmutableDictionary</span><span class="p">&lt;</span><span class="n">DeviceId</span><span class="p">,</span> <span class="n">DeviceState</span><span class="p">&gt;</span> <span class="n">_devices</span> <span class="p">=</span>
    <span class="n">ImmutableDictionary</span><span class="p">&lt;</span><span class="n">DeviceId</span><span class="p">,</span> <span class="n">DeviceState</span><span class="p">&gt;.</span><span class="n">Empty</span><span class="p">;</span>

<span class="k">public</span> <span class="n">ImmutableDictionary</span><span class="p">&lt;</span><span class="n">DeviceId</span><span class="p">,</span> <span class="n">DeviceState</span><span class="p">&gt;</span> <span class="n">Devices</span> <span class="p">=&gt;</span> <span class="n">_devices</span><span class="p">;</span>

<span class="k">public</span> <span class="k">void</span> <span class="nf">ReplaceAll</span><span class="p">(</span><span class="n">IEnumerable</span><span class="p">&lt;</span><span class="n">KeyValuePair</span><span class="p">&lt;</span><span class="n">DeviceId</span><span class="p">,</span> <span class="n">DeviceState</span><span class="p">&gt;&gt;</span> <span class="n">items</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">_devices</span> <span class="p">=</span> <span class="n">items</span><span class="p">.</span><span class="nf">ToImmutableDictionary</span><span class="p">();</span>
<span class="p">}</span>

<span class="k">public</span> <span class="k">void</span> <span class="nf">AddOrUpdate</span><span class="p">(</span><span class="n">DeviceId</span> <span class="n">id</span><span class="p">,</span> <span class="n">DeviceState</span> <span class="n">state</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">ImmutableInterlocked</span><span class="p">.</span><span class="nf">AddOrUpdate</span><span class="p">(</span><span class="k">ref</span> <span class="n">_devices</span><span class="p">,</span> <span class="n">id</span><span class="p">,</span> <span class="n">state</span><span class="p">,</span> <span class="p">(</span><span class="n">_</span><span class="p">,</span> <span class="n">_</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="n">state</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">★ Insight ─────────────────────────────────────</code></p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">ImmutableInterlocked.AddOrUpdate</code> gives you atomic single-writer-multiple-reader semantics without a lock. The reader gets a consistent snapshot because the dictionary reference they captured is literally immutable.</li>
  <li>This was not a universal replacement for locks. In some audited code, a plain lock was simpler and safer because readers and writers also needed to coordinate side effects like event raises or timer replacement.</li>
</ul>

<p><code class="language-plaintext highlighter-rouge">─────────────────────────────────────────────────</code></p>

<h3 id="archetype-2--plain-flag-across-threads">Archetype 2 — Plain flag across threads</h3>

<p>The smallest possible bug with the largest possible impact:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">private</span> <span class="k">static</span> <span class="kt">bool</span> <span class="n">_isBusy</span><span class="p">;</span>

<span class="k">public</span> <span class="k">static</span> <span class="k">void</span> <span class="nf">BeginWork</span><span class="p">()</span>
<span class="p">{</span>
    <span class="n">_isBusy</span> <span class="p">=</span> <span class="k">true</span><span class="p">;</span>
    <span class="n">Task</span><span class="p">.</span><span class="nf">Run</span><span class="p">(()</span> <span class="p">=&gt;</span>
    <span class="p">{</span>
        <span class="k">while</span> <span class="p">(</span><span class="n">_isBusy</span><span class="p">)</span>  <span class="c1">// compiler may hoist this read out of the loop</span>
        <span class="p">{</span>
            <span class="nf">DoStep</span><span class="p">();</span>
        <span class="p">}</span>
    <span class="p">});</span>
<span class="p">}</span>

<span class="k">public</span> <span class="k">static</span> <span class="k">void</span> <span class="nf">StopWork</span><span class="p">()</span> <span class="p">=&gt;</span> <span class="n">_isBusy</span> <span class="p">=</span> <span class="k">false</span><span class="p">;</span>
</code></pre></div></div>

<p>The caller signals <code class="language-plaintext highlighter-rouge">StopWork</code>, the background task never sees the write, and the app hangs. The fix is either <code class="language-plaintext highlighter-rouge">volatile</code> or — better — a <code class="language-plaintext highlighter-rouge">CancellationTokenSource</code>:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">private</span> <span class="k">static</span> <span class="n">CancellationTokenSource</span><span class="p">?</span> <span class="n">_cts</span><span class="p">;</span>

<span class="k">public</span> <span class="k">static</span> <span class="k">void</span> <span class="nf">BeginWork</span><span class="p">()</span>
<span class="p">{</span>
    <span class="n">_cts</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">CancellationTokenSource</span><span class="p">();</span>
    <span class="kt">var</span> <span class="n">token</span> <span class="p">=</span> <span class="n">_cts</span><span class="p">.</span><span class="n">Token</span><span class="p">;</span>
    <span class="n">Task</span><span class="p">.</span><span class="nf">Run</span><span class="p">(()</span> <span class="p">=&gt;</span>
    <span class="p">{</span>
        <span class="k">while</span> <span class="p">(!</span><span class="n">token</span><span class="p">.</span><span class="n">IsCancellationRequested</span><span class="p">)</span>
        <span class="p">{</span>
            <span class="nf">DoStep</span><span class="p">();</span>
        <span class="p">}</span>
    <span class="p">},</span> <span class="n">token</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">public</span> <span class="k">static</span> <span class="k">void</span> <span class="nf">StopWork</span><span class="p">()</span>
<span class="p">{</span>
    <span class="n">_cts</span><span class="p">?.</span><span class="nf">Cancel</span><span class="p">();</span>
    <span class="n">_cts</span><span class="p">?.</span><span class="nf">Dispose</span><span class="p">();</span>
    <span class="n">_cts</span> <span class="p">=</span> <span class="k">null</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">★ Insight ─────────────────────────────────────</code></p>

<ul>
  <li>A <code class="language-plaintext highlighter-rouge">CancellationToken</code> has the cross-thread memory semantics baked in — the reader always sees the cancelled state after <code class="language-plaintext highlighter-rouge">Cancel()</code> returns. You do not have to think about <code class="language-plaintext highlighter-rouge">volatile</code> because you delegated that worry to the framework.</li>
  <li>A secondary benefit: <code class="language-plaintext highlighter-rouge">CancellationToken</code> composes. You can pass it to <code class="language-plaintext highlighter-rouge">Task.Delay</code>, <code class="language-plaintext highlighter-rouge">HttpClient.SendAsync</code>, database calls, and loop checks with a single mechanism. A <code class="language-plaintext highlighter-rouge">volatile bool</code> cannot do that.</li>
</ul>

<p><code class="language-plaintext highlighter-rouge">─────────────────────────────────────────────────</code></p>

<h3 id="archetype-3--timer-callback-racing-its-own-field">Archetype 3 — Timer callback racing its own field</h3>

<p>This one was the most subtle. A service had a reusable timer field that was reassigned on restart:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">_heartbeatTimer</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">Timer</span><span class="p">(</span><span class="n">HeartbeatInterval</span><span class="p">);</span>
<span class="n">_heartbeatTimer</span><span class="p">.</span><span class="n">Elapsed</span> <span class="p">+=</span> <span class="n">OnHeartbeatElapsed</span><span class="p">;</span>
<span class="n">_heartbeatTimer</span><span class="p">.</span><span class="n">AutoReset</span> <span class="p">=</span> <span class="k">false</span><span class="p">;</span>
<span class="n">_heartbeatTimer</span><span class="p">.</span><span class="nf">Start</span><span class="p">();</span>
</code></pre></div></div>

<p>If the caller invoked this twice quickly, the first timer’s <code class="language-plaintext highlighter-rouge">Elapsed</code> handler could still fire after the field had been reassigned. The handler then mutated the “current” timer state even though it was the <em>previous</em> timer’s callback. Worse, the old timer was orphaned — it was still alive on the garbage collector’s finalizer queue, still capable of running its callback one more time.</p>

<p>The fix was to wrap the swap in a lock and eagerly stop-and-dispose the prior timer:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">private</span> <span class="k">readonly</span> <span class="kt">object</span> <span class="n">_timerLock</span> <span class="p">=</span> <span class="k">new</span><span class="p">();</span>
<span class="k">private</span> <span class="n">Timer</span><span class="p">?</span> <span class="n">_heartbeatTimer</span><span class="p">;</span>

<span class="k">private</span> <span class="k">void</span> <span class="nf">ReplaceHeartbeatTimer</span><span class="p">(</span><span class="n">TimeSpan</span> <span class="n">interval</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">lock</span> <span class="p">(</span><span class="n">_timerLock</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">_heartbeatTimer</span> <span class="k">is</span> <span class="n">not</span> <span class="k">null</span><span class="p">)</span>
        <span class="p">{</span>
            <span class="n">_heartbeatTimer</span><span class="p">.</span><span class="nf">Stop</span><span class="p">();</span>
            <span class="n">_heartbeatTimer</span><span class="p">.</span><span class="n">Elapsed</span> <span class="p">-=</span> <span class="n">OnHeartbeatElapsed</span><span class="p">;</span>
            <span class="n">_heartbeatTimer</span><span class="p">.</span><span class="nf">Dispose</span><span class="p">();</span>
        <span class="p">}</span>
        <span class="n">_heartbeatTimer</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">Timer</span><span class="p">(</span><span class="n">interval</span><span class="p">.</span><span class="n">TotalMilliseconds</span><span class="p">)</span>
        <span class="p">{</span>
            <span class="n">AutoReset</span> <span class="p">=</span> <span class="k">false</span><span class="p">,</span>
        <span class="p">};</span>
        <span class="n">_heartbeatTimer</span><span class="p">.</span><span class="n">Elapsed</span> <span class="p">+=</span> <span class="n">OnHeartbeatElapsed</span><span class="p">;</span>
        <span class="n">_heartbeatTimer</span><span class="p">.</span><span class="nf">Start</span><span class="p">();</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We also audited every <code class="language-plaintext highlighter-rouge">Timer.Elapsed</code> handler we could find for <code class="language-plaintext highlighter-rouge">async void</code> lambdas. An exception thrown out of <code class="language-plaintext highlighter-rouge">async void</code> can tear down the process or vanish into an unobserved failure path depending on context. In either case, it is unacceptable in infrastructure code. The fix was to wrap the handler and surface the failure explicitly:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">_disconnectTimer</span><span class="p">.</span><span class="n">Elapsed</span> <span class="p">+=</span> <span class="k">async</span> <span class="p">(</span><span class="n">_</span><span class="p">,</span> <span class="n">_</span><span class="p">)</span> <span class="p">=&gt;</span>
<span class="p">{</span>
    <span class="k">try</span>
    <span class="p">{</span>
        <span class="k">await</span> <span class="nf">PutOfflineAsync</span><span class="p">();</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">_publisher</span> <span class="k">is</span> <span class="n">not</span> <span class="k">null</span><span class="p">)</span>
        <span class="p">{</span>
            <span class="k">await</span> <span class="n">_publisher</span><span class="p">.</span><span class="nf">PublishAllEventsAsync</span><span class="p">(</span><span class="k">this</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">catch</span> <span class="p">(</span><span class="n">Exception</span> <span class="n">ex</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="n">_onError</span><span class="p">?.</span><span class="nf">Invoke</span><span class="p">(</span><span class="n">ex</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">_onError</code> callback was injected at construction time, which kept the domain layer free of an <code class="language-plaintext highlighter-rouge">ILogger</code> dependency but still let failures be surfaced by the application layer. That single pattern — inject an <code class="language-plaintext highlighter-rouge">Action&lt;Exception&gt;</code> at the seam — let us rescue dozens of silent failure sites across both services.</p>

<h3 id="the-burndown">The burndown</h3>

<p><img src="/assets/img/issue_fix_burndown.png" alt="Race-condition burndown across 16 weeks, from 44 issues down to 1 acceptable residual" loading="lazy" /></p>

<p>We did not fix all 44. We explicitly chose to leave one. It was a static integer increment in a path that was already serialized by the SignalR hub pipeline — meaning a race was theoretically possible but not reachable given how the method was called. We documented the reasoning and moved on. Not every bug is worth fixing, but every bug is worth understanding.</p>

<hr />

<h2 id="part-4--reducing-technical-debt">Part 4 — Reducing Technical Debt</h2>

<p>“Technical debt” is a vague term, so we tried to reduce it to observable indicators. We tracked five, all derived from audit output:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  Indicator                           Start   Mid   Later
  ─────────────────────────────────  ──────  ────  ────
  Hardcoded secrets in source           3     0     0
  Files commented out in entirety       4     1     0
  TODO/FIXME comments                  28    19    11
  Handlers throwing NotImplemented      5     2     0
  Services without ILogger injection    4     1     0
</code></pre></div></div>

<p>The secret audit was the single highest-leverage activity we did. Claude found a hardcoded GitHub personal access token in <code class="language-plaintext highlighter-rouge">nuget.config</code> — a file not covered by <code class="language-plaintext highlighter-rouge">.gitignore</code>. We rotated the token, moved it to a CI secret, and added the file to <code class="language-plaintext highlighter-rouge">.gitignore</code> within the same hour. An AES key was hardcoded in a crypto helper; we moved it to environment-variable-backed configuration and generated a random IV per encryption instead of the all-zeros IV that had shipped.</p>

<p>The remaining eleven <code class="language-plaintext highlighter-rouge">TODO</code> comments were all triaged. Each one either got a ticket, got a documentation comment explaining why we were leaving it, or was deleted because the thing it was pointing at no longer existed.</p>

<hr />

<h2 id="testing-strategy-throughout">Testing Strategy Throughout</h2>

<p>For the risky changes, the workflow was:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    ┌──────────────────────────────────────────────────────┐
    │  1. Write the failing test that captures the defect. │
    │  2. Ask Claude to propose the minimal fix.           │
    │  3. Apply the fix, run the test, watch it go green.  │
    │  4. Run the full suite to catch regressions.         │
    └──────────────────────────────────────────────────────┘
</code></pre></div></div>

<p>For race condition fixes this was not trivial. Concurrency bugs do not reliably reproduce. We leaned on two techniques:</p>

<ul>
  <li><strong>Deterministic stress harnesses.</strong> For collection-access issues, we wrote tests that spun up dozens of tasks all hammering the same API, then asserted invariants. Before the fix, these tests failed within ten iterations. After, they ran ten thousand iterations clean.</li>
  <li><strong>Injectable time.</strong> For timer-related issues, we replaced <code class="language-plaintext highlighter-rouge">System.Timers.Timer</code> with a small abstraction that accepted a test clock, so the test could advance time manually and observe every callback fire in a known order.</li>
</ul>

<p>A representative stress test (obfuscated):</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">Fact</span><span class="p">]</span>
<span class="k">public</span> <span class="k">async</span> <span class="n">Task</span> <span class="nf">Concurrent_AddOrUpdate_does_not_throw_or_lose_data</span><span class="p">()</span>
<span class="p">{</span>
    <span class="kt">var</span> <span class="n">service</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">DeviceRegistry</span><span class="p">();</span>
    <span class="k">const</span> <span class="kt">int</span> <span class="n">WRITERS</span> <span class="p">=</span> <span class="m">16</span><span class="p">;</span>
    <span class="k">const</span> <span class="kt">int</span> <span class="n">ITERATIONS</span> <span class="p">=</span> <span class="m">500</span><span class="p">;</span>

    <span class="kt">var</span> <span class="n">writers</span> <span class="p">=</span> <span class="n">Enumerable</span><span class="p">.</span><span class="nf">Range</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="n">WRITERS</span><span class="p">).</span><span class="nf">Select</span><span class="p">(</span><span class="n">w</span> <span class="p">=&gt;</span> <span class="n">Task</span><span class="p">.</span><span class="nf">Run</span><span class="p">(()</span> <span class="p">=&gt;</span>
    <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">ITERATIONS</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
        <span class="p">{</span>
            <span class="kt">var</span> <span class="n">id</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">DeviceId</span><span class="p">(</span><span class="n">w</span> <span class="p">*</span> <span class="n">ITERATIONS</span> <span class="p">+</span> <span class="n">i</span><span class="p">);</span>
            <span class="n">service</span><span class="p">.</span><span class="nf">AddOrUpdate</span><span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="k">new</span> <span class="nf">DeviceState</span><span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">online</span><span class="p">:</span> <span class="k">true</span><span class="p">));</span>
        <span class="p">}</span>
    <span class="p">})).</span><span class="nf">ToArray</span><span class="p">();</span>

    <span class="kt">var</span> <span class="n">reader</span> <span class="p">=</span> <span class="n">Task</span><span class="p">.</span><span class="nf">Run</span><span class="p">(()</span> <span class="p">=&gt;</span>
    <span class="p">{</span>
        <span class="kt">int</span> <span class="n">seen</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>
        <span class="k">while</span> <span class="p">(</span><span class="n">seen</span> <span class="p">&lt;</span> <span class="n">WRITERS</span> <span class="p">*</span> <span class="n">ITERATIONS</span><span class="p">)</span>
        <span class="p">{</span>
            <span class="n">seen</span> <span class="p">=</span> <span class="n">service</span><span class="p">.</span><span class="n">Devices</span><span class="p">.</span><span class="n">Count</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">});</span>

    <span class="k">await</span> <span class="n">Task</span><span class="p">.</span><span class="nf">WhenAll</span><span class="p">(</span><span class="n">writers</span><span class="p">.</span><span class="nf">Concat</span><span class="p">(</span><span class="k">new</span><span class="p">[]</span> <span class="p">{</span> <span class="n">reader</span> <span class="p">}));</span>

    <span class="n">service</span><span class="p">.</span><span class="n">Devices</span><span class="p">.</span><span class="nf">Should</span><span class="p">().</span><span class="nf">HaveCount</span><span class="p">(</span><span class="n">WRITERS</span> <span class="p">*</span> <span class="n">ITERATIONS</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Before the fix, this test intermittently threw <code class="language-plaintext highlighter-rouge">InvalidOperationException</code> from the reader. After the synchronization change, it ran clean repeatedly.</p>

<p>Not every defect was amenable to strict test-first development. Some issues were better handled as:</p>

<ul>
  <li>a small reproducer plus code review,</li>
  <li>a logging assertion on a failure path,</li>
  <li>or an integration test added after the fix when the seam became testable.</li>
</ul>

<p>The practical lesson was not “TDD solves legacy systems.” It was narrower: if you are changing concurrency, startup, or observability logic, you need some repeatable proof that the system got safer.</p>

<hr />

<h2 id="what-worked-what-surprised-us">What Worked, What Surprised Us</h2>

<p>Three things worked far better than expected.</p>

<p><strong>The audit was the unlock.</strong> The moment we had a single markdown document listing every issue by severity, with file and line references, the work became tractable. Without it we would have spent the whole project arguing about priorities.</p>

<p><strong>Claude’s plans were often better than our gut priorities.</strong> The recommended fix order consistently put safety-critical issues ahead of ergonomics, even when a developer might have reached for the more satisfying refactor first. Following the plan rather than the vibe saved us from merging a “nice” change while a real bug was still live.</p>

<p><strong>Small, tight edits compounded.</strong> The average fix was under twenty lines. The average PR was under two hundred. We deliberately resisted the temptation to bundle changes. Small PRs reviewed fast; fast review meant more PRs per week; more PRs per week meant the backlog actually burned down instead of drifting.</p>

<p>Two things surprised us.</p>

<p><strong>Logging gave the best ROI of any category.</strong> We initially ranked logging as a medium-priority cleanup. In practice, the correlation-ID work and startup-failure visibility changed how fast we could debug real incidents.</p>

<p><strong>The residual race conditions were the right call to leave alone.</strong> We originally planned to hit zero. Once the count dropped into the single digits, the remaining ones were all in paths that were serialized by upstream mechanisms or would only fire under contention we could not realistically reproduce. Fixing them would have churned a lot of code for no measurable benefit. “Document and move on” turned out to be a valid move.</p>

<hr />

<h2 id="a-playbook-you-can-steal">A Playbook You Can Steal</h2>

<p>If you are staring down a legacy codebase of your own, here is the compact version of the playbook:</p>

<ol>
  <li><strong>Ask Claude to audit a single module.</strong> Start narrow. Get the markdown output.</li>
  <li><strong>Tag every issue with severity.</strong> Safety-critical first, data-integrity next, everything else by effort.</li>
  <li><strong>Write the failing test before the fix.</strong> Every single time. No exceptions, even for one-line changes.</li>
  <li><strong>Keep PRs small.</strong> One issue, one PR. Your reviewers will thank you and your burndown will be honest.</li>
  <li><strong>Regenerate the audit monthly.</strong> New issues creep in. Catching them when they are one line old is cheap.</li>
  <li><strong>Invest in logging early.</strong> Correlation IDs and structured log tests repay their cost within weeks.</li>
  <li><strong>Delete more than you write.</strong> Commented-out files, stale TODOs, unused branches — none of them are getting better with age.</li>
</ol>

<p>The core realization is that Claude is not a replacement for engineering judgment. It is a tireless pair for the parts of engineering that humans are worst at: the inventory, the tabular bookkeeping, the “did we check every <code class="language-plaintext highlighter-rouge">Dictionary</code> in the codebase?” grind. Pairing Claude’s completeness with a human’s priority-setting turns the dreaded “tech debt week” into something closer to a steady drumbeat of small, confident improvements.</p>

<p>If I were distilling this into one rule for engineering teams, it would be this: use Claude to widen the search space, not to waive review. Let it find the patterns, draft the boring fixes, and keep the backlog honest. But keep the decisions about severity, test depth, and merge readiness with engineers who understand the system’s actual failure modes.</p>

<p>And every time one of those improvements lands — every time a race condition that used to page the on-call engineer at three in the morning stops paging anyone — you feel the debt getting lighter. That is the whole point.</p>

<hr />

<p><em>Written from the trenches of a real mission-critical .NET codebase. Names, domains, and code samples have been obfuscated; the patterns and lessons are exactly as we found them.</em></p>]]></content><author><name>Dr. Nobel Khandaker</name></author><category term="systems" /><category term="refactoring" /><category term="technical-debt" /><category term="dotnet" /><category term="concurrency" /><category term="reliability" /><summary type="html"><![CDATA[How a small engineering team used Claude to pay down years of technical debt across three safety-critical .NET services — without breaking production.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://outloop.blog/assets/img/refactoring_cycle.png" /><media:content medium="image" url="https://outloop.blog/assets/img/refactoring_cycle.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Reverb: A Semantic Cache That Knows When Its Answers Go Stale</title><link href="https://outloop.blog/2026/04/22/reverb-semantic-cache-with-knowledge-aware-invalidation.html" rel="alternate" type="text/html" title="Reverb: A Semantic Cache That Knows When Its Answers Go Stale" /><published>2026-04-22T07:00:00-04:00</published><updated>2026-05-31T10:55:04-04:00</updated><id>https://outloop.blog/2026/04/22/reverb-semantic-cache-with-knowledge-aware-invalidation</id><content type="html" xml:base="https://outloop.blog/2026/04/22/reverb-semantic-cache-with-knowledge-aware-invalidation.html"><![CDATA[<p>Caching LLM responses seems, at first glance, like a simple optimization.
Record the prompt, record the answer, serve the answer next time the same
prompt comes in. In practice it is a surprisingly deep problem, and the two
standard approaches both fail in characteristic ways. Exact-match caches miss
on anything short of a byte-identical prompt, which is almost never how users
actually ask questions. TTL-based caches serve confidently-stale answers for
hours after the underlying knowledge base has changed — the classic
hallucination vector dressed up as “we cached it.”</p>

<p><a href="https://github.com/nobelk/reverb">Reverb</a> is a Go library and standalone
service that addresses both failure modes. It combines a <em>two-tier cache</em>
(exact SHA-256 match, then embedding-cosine similarity) with <strong>knowledge-aware
invalidation</strong>: every cached entry tracks the source documents it was derived
from, and a change-data-capture pipeline evicts entries by <em>causality</em> when
their sources change. TTLs become a backstop, not the primary correctness
mechanism.</p>

<!--more-->

<h2 id="two-tiered-approach">Two-tiered approach</h2>

<p>The exact-match tier is cheap and essential — a SHA-256 hash of the
normalized prompt plus namespace and model ID, looked up in a store. Sub-
millisecond latency, perfect precision, zero false positives. It catches
retries, duplicated user requests, and programmatic callers that issue the
same prompt on a schedule. In production workloads this tier alone typically
handles 20–40% of traffic, depending on how much of the workload is human-in-
the-loop.</p>

<p>The semantic tier is where it gets interesting. Two users phrasing the same
question differently — <em>“how do I reset my password?”</em> vs. <em>“password reset
help”</em> — should get the same answer. The tier computes an embedding for the
incoming prompt, searches a vector index for top-k nearest neighbors above a
configurable cosine-similarity threshold (0.95 by default), and returns the
closest hit. Latency climbs to ~50ms, which is still one to two orders of
magnitude faster than actually calling the LLM, and recall improves
substantially.</p>

<p>The fallthrough contract is the part that makes it work: exact misses do not
fail, they degrade to a semantic lookup. Semantic misses do not fail, they
degrade to a real LLM call that then writes back through both tiers. Three
states — exact hit, semantic hit, miss — all with correct fallback.</p>

<h2 id="architecture">Architecture</h2>

<p>Reverb is built around clean interfaces for each pluggable component, which
is what lets it scale down to an in-memory dev setup and up to Redis plus
HNSW plus NATS-driven CDC without code changes. The top-level flow:</p>

<p><img src="/assets/img/reverb_architecture.png" alt="Reverb architecture: two-tier cache with SHA-256 exact match and embedding similarity, connected to CDC-driven invalidation" /></p>

<p>Notice that the invalidation path and the lookup path share
no state beyond the store itself.</p>

<ul>
  <li>CDC events can fire at any time — a webhook from your CMS, a NATS JetStream
message, a polling loop against a content API.</li>
  <li>The invalidation engine consults the lineage index to figure out which
specific cache entries to evict. Every other cached entry keeps its hit rate.</li>
</ul>

<p>Two interesting design choices are:</p>

<ul>
  <li>the two-tier fallthrough, which means the cache has a meaningful answer for
<em>most</em> queries, not just byte-identical ones</li>
  <li>the lineage-based invalidation, which means stale-knowledge hallucinations
stop being an accepted cost of caching</li>
</ul>

<p>Neither is a novel technique in isolation — CDN cache tags and hierarchical CPU caches use similar approaches. The novelty is in recognizing that LLM responses are derived data with explicit sources, and that derived-data systems have
known-correct invalidation disciplines that work just as well when the
derivation is a <em>transformer inference</em>.</p>

<h2 id="lineage-as-the-first-class-concept">Lineage as the first-class concept</h2>

<p>When you <code class="language-plaintext highlighter-rouge">Store()</code> an entry, you hand Reverb a list of <code class="language-plaintext highlighter-rouge">sources</code> — the
documents that contributed to the LLM’s answer. Each source is a <code class="language-plaintext highlighter-rouge">(source_id,
content_hash)</code> pair. The lineage index maintains a <em>bidirectional mapping</em>:
source IDs to the set of cache entries they contributed to, and cache entries
to the set of sources they depend on. When a CDC listener reports a change
for <code class="language-plaintext highlighter-rouge">source_id = "doc:password-guide"</code>, the engine asks the lineage index for
all dependent entries and walks through them:</p>

<p>That is also the contract the application has to honor. Reverb does <strong>not</strong>
infer provenance by itself. Your retrieval layer, tool wrapper, or orchestration
code must tell it which source documents actually contributed to the answer.
If you omit a source, Reverb cannot invalidate on that source’s change; if you
over-attach unrelated sources, you will evict too aggressively. The cache is
only as causally correct as the lineage you record at write time.</p>

<ul>
  <li>If the source has been <em>deleted</em> (zero hash), invalidate every dependent
entry.</li>
  <li>If the source still exists but the <code class="language-plaintext highlighter-rouge">content_hash</code> differs from the stored
value, invalidate.</li>
  <li>If the content hash matches (the webhook fired but nothing actually
changed), do nothing. Idempotency is free.</li>
</ul>

<p>Compare this to the naive alternative — TTL-based invalidation, tuned
conservatively at, say, 6 hours. During those 6 hours, the cache can serve
any number of answers derived from a document that changed 5 minutes after
the entry was cached. The user experiences a confident, fluent, completely
wrong answer. With lineage-based invalidation, the moment your CMS pushes
the webhook, the relevant cache entries are gone.</p>

<p>The operational sequence is short and predictable:</p>

<ol>
  <li>At <code class="language-plaintext highlighter-rouge">t0</code>, the application stores an answer plus <code class="language-plaintext highlighter-rouge">sources=[("doc:password-guide", hash_v1)]</code>.</li>
  <li>At <code class="language-plaintext highlighter-rouge">t1</code>, the source document changes and the CMS emits a CDC event carrying <code class="language-plaintext highlighter-rouge">hash_v2</code>.</li>
  <li>At <code class="language-plaintext highlighter-rouge">t2</code>, Reverb looks up every cache entry linked to <code class="language-plaintext highlighter-rouge">doc:password-guide</code>.</li>
  <li>At <code class="language-plaintext highlighter-rouge">t3</code>, entries whose stored hash is still <code class="language-plaintext highlighter-rouge">hash_v1</code> are evicted from the store and vector index.</li>
  <li>At <code class="language-plaintext highlighter-rouge">t4</code>, the next lookup misses the cache and regenerates against fresh knowledge.</li>
</ol>

<p>This is not a new idea in the abstract — database query caches have done
tuple-level invalidation for decades, and CDN cache tag invalidation is a
production pattern at scale. The contribution is noticing <em>that LLM response
caches have exactly the same dependency structure and applying the same
discipline</em>.</p>

<h2 id="the-pluggable-backends-discipline">The pluggable-backends discipline</h2>

<p>Reverb exposes four interfaces, each with two or more implementations:</p>

<ul>
  <li><strong><code class="language-plaintext highlighter-rouge">embedding.Provider</code></strong> — OpenAI, Ollama, or a deterministic fake for
tests. The fake (<code class="language-plaintext highlighter-rouge">fake.New(n)</code>) is a hash-based embedder that produces
stable vectors for stable inputs, which makes integration tests
reproducible without requiring an API key. This is the kind of detail that
signals the library was written by someone who actually runs tests in CI.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">vector.Index</code></strong> — a brute-force flat index (O(n)) and an HNSW index
(approximately logarithmic search in practice). You start with flat, and
when you outgrow it you swap in HNSW with no other code changes.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">store.Store</code></strong> — memory for dev, Redis for production, BadgerDB for
embedded use cases. The <code class="language-plaintext highlighter-rouge">conformance</code> subpackage ships a shared test suite
that every store implementation must pass, which is how the interface
stays honest over time.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">cdc.Listener</code></strong> — webhook, polling, NATS. Each is a different
architectural fit: webhook for push-based CMS integrations, polling for
systems you cannot modify, NATS for high-volume event streams.</li>
</ul>

<p>The interface-driven design makes Reverb realistic to adopt: start with all-in-memory (zero
infrastructure), move to Redis plus HNSW when you outgrow a single process,
swap the CDC listener when your source-of-truth changes. None of those
migrations need to touch the application code.</p>

<h2 id="deployment-shapes">Deployment shapes</h2>

<p>Reverb runs as three things, depending on how you want to use it:</p>

<ol>
  <li><strong>A Go library</strong>, linked directly into an application. Fastest path,
lowest latency, no extra process to manage. The <code class="language-plaintext highlighter-rouge">pkg/reverb</code> facade is
the full public API.</li>
  <li><strong>A standalone HTTP server</strong> (<code class="language-plaintext highlighter-rouge">cmd/reverb</code>). Language-agnostic REST API
under <code class="language-plaintext highlighter-rouge">/v1/</code> — <code class="language-plaintext highlighter-rouge">lookup</code>, <code class="language-plaintext highlighter-rouge">store</code>, <code class="language-plaintext highlighter-rouge">invalidate</code>, <code class="language-plaintext highlighter-rouge">entries/{id}</code>, <code class="language-plaintext highlighter-rouge">stats</code>,
<code class="language-plaintext highlighter-rouge">healthz</code>. This is the path if your application is in Python or
TypeScript and you want to cache centrally.</li>
  <li><strong>A standalone gRPC server</strong>, same operations as the HTTP API but with
protobuf-defined contracts in <code class="language-plaintext highlighter-rouge">pkg/server/proto/reverb.proto</code>. Clients
in any language can generate their own stubs.</li>
</ol>

<p>The HTTP and gRPC servers share the same underlying <code class="language-plaintext highlighter-rouge">Client</code>, so you can
deploy both protocols side-by-side from the same binary and pick whichever
your calling environment prefers.</p>

<h2 id="where-this-fits">Where this fits</h2>

<p>I think semantic caching is about to become table stakes for production
LLM systems in the same way that ordinary HTTP caching became table stakes
for the web in the 2000s. The <em>cost pressure</em> is enormous — every cache hit
is an LLM call that did not happen — and the latency improvement is user-
perceptible. But “cache LLM responses” is the easy version of the problem.
The hard version is <em>“cache LLM responses correctly, even when the world
the LLM is reasoning about changes out from under the cache.”</em> That is the
problem Reverb is built to solve.</p>

<hr />

<p>Reverb handles the <em>knowledge freshness</em> dimension of agent reliability. For the <em>trust</em> side — knowing which agents to rely on based on observed behavior — see <a href="/2026/04/22/multitrust-subjective-logic-for-multi-agent-systems.html">MultiTrust</a>. For detecting when agents get stuck waiting on each other, see <a href="/2026/04/22/tangle-deadlock-detection-for-langgraph.html">Tangle</a>.</p>]]></content><author><name>Dr. Nobel Khandaker</name></author><category term="systems" /><category term="llm" /><category term="caching" /><category term="go" /><category term="distributed-systems" /><category term="reliability" /><summary type="html"><![CDATA[LLM response caches usually invalidate by TTL and hope for the best. Reverb invalidates by causality — when a source document changes, only the cached answers derived from it get evicted.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://outloop.blog/assets/img/reverb_architecture.png" /><media:content medium="image" url="https://outloop.blog/assets/img/reverb_architecture.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">MultiTrust: Subjective Logic as a Runtime for Multi-Agent Trust</title><link href="https://outloop.blog/2026/04/22/multitrust-subjective-logic-for-multi-agent-systems.html" rel="alternate" type="text/html" title="MultiTrust: Subjective Logic as a Runtime for Multi-Agent Trust" /><published>2026-04-22T06:00:00-04:00</published><updated>2026-05-25T09:27:23-04:00</updated><id>https://outloop.blog/2026/04/22/multitrust-subjective-logic-for-multi-agent-systems</id><content type="html" xml:base="https://outloop.blog/2026/04/22/multitrust-subjective-logic-for-multi-agent-systems.html"><![CDATA[<p>In multiagent systems, trust of an agent is a valuable asset since it gives them an ability to reason about their future collaboration, coordination, and plan.   Most “trust score” implementations in agentic systems are a single float
between 0 and 1. That number is doing two jobs at once — representing how much
positive evidence an agent has accumulated, and representing how <em>confident</em>
the system is in that judgment — and it collapses them into a value that makes
the two indistinguishable. A brand-new agent with no history and a seasoned
agent that has run 10,000 tasks with an even win/loss record both land at 0.5.
The scalar has no room to say <em>“I don’t know yet.”</em></p>

<p><a href="https://github.com/nobelk/multitrust">MultiTrust</a> fixes this by reaching for
the right math. It represents trust as a <strong>Subjective Logic opinion</strong> — a
triple of (belief, disbelief, uncertainty) that sums to one — and exposes the
whole machinery as an MCP server, so any Model Context Protocol-aware agent
can consult it as a standard tool call.</p>

<!--more-->

<h2 id="what-subjective-logic-buys-you">What Subjective Logic buys you</h2>

<p>Subjective Logic, developed by Audun Jøsang in the early 2000s, is a
probabilistic logic designed precisely for reasoning under uncertainty where
the uncertainty itself must be represented. An opinion looks like this:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">opinion</span> <span class="o">=</span> <span class="nc">Opinion</span><span class="p">(</span>
    <span class="n">belief</span>      <span class="o">=</span> <span class="mf">0.60</span><span class="p">,</span>   <span class="c1"># evidence supports trusting the agent
</span>    <span class="n">disbelief</span>   <span class="o">=</span> <span class="mf">0.12</span><span class="p">,</span>   <span class="c1"># evidence against
</span>    <span class="n">uncertainty</span> <span class="o">=</span> <span class="mf">0.28</span><span class="p">,</span>   <span class="c1"># we don't have enough data to be sure
</span>    <span class="n">base_rate</span>   <span class="o">=</span> <span class="mf">0.50</span><span class="p">,</span>   <span class="c1"># prior: how trustworthy is a "typical" agent?
</span><span class="p">)</span>
<span class="c1"># belief + disbelief + uncertainty == 1.0 (invariant)
</span></code></pre></div></div>

<p>The projected trust score — what you use to make a gating decision — is
<code class="language-plaintext highlighter-rouge">belief + uncertainty × base_rate</code>. This is the clever bit. A vacuous
opinion <code class="language-plaintext highlighter-rouge">(0, 0, 1)</code> projects to <code class="language-plaintext highlighter-rouge">base_rate</code>: in the absence of evidence, you
fall back to the population prior. As evidence accumulates, uncertainty
shrinks, and the projection converges on the true belief/disbelief ratio. You
get cold-start behavior and seasoned-agent behavior from the same formula,
with no special-casing.</p>

<p>Under the hood, evidence maps to opinions through the Beta distribution:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>belief      = positive / (positive + negative + W)
disbelief   = negative / (positive + negative + W)
uncertainty = W        / (positive + negative + W)
</code></pre></div></div>

<p>where <code class="language-plaintext highlighter-rouge">W</code> is a prior weight (typically 2). Every call to <code class="language-plaintext highlighter-rouge">submit_evidence()</code>
is an update to the positive/negative counters; the opinion recomputes
deterministically. There are no magic numbers, no tuned decay constants that
drift out of sync with reality.</p>

<p>That sounds abstract until you compare cold start against real history. With
<code class="language-plaintext highlighter-rouge">base_rate = 0.5</code> and <code class="language-plaintext highlighter-rouge">W = 2</code>, the mapping makes the distinction explicit:</p>

<table>
  <thead>
    <tr>
      <th>Agent state</th>
      <th style="text-align: right">Evidence <code class="language-plaintext highlighter-rouge">(positive, negative)</code></th>
      <th style="text-align: right">Opinion <code class="language-plaintext highlighter-rouge">(b, d, u)</code></th>
      <th style="text-align: right">Projected trust</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Brand-new agent</td>
      <td style="text-align: right"><code class="language-plaintext highlighter-rouge">(0, 0)</code></td>
      <td style="text-align: right"><code class="language-plaintext highlighter-rouge">(0.00, 0.00, 1.00)</code></td>
      <td style="text-align: right"><code class="language-plaintext highlighter-rouge">0.50</code></td>
    </tr>
    <tr>
      <td>Early promising run</td>
      <td style="text-align: right"><code class="language-plaintext highlighter-rouge">(3, 0)</code></td>
      <td style="text-align: right"><code class="language-plaintext highlighter-rouge">(0.60, 0.00, 0.40)</code></td>
      <td style="text-align: right"><code class="language-plaintext highlighter-rouge">0.80</code></td>
    </tr>
    <tr>
      <td>Mixed but well-observed</td>
      <td style="text-align: right"><code class="language-plaintext highlighter-rouge">(50, 50)</code></td>
      <td style="text-align: right"><code class="language-plaintext highlighter-rouge">(0.49, 0.49, 0.02)</code></td>
      <td style="text-align: right"><code class="language-plaintext highlighter-rouge">0.50</code></td>
    </tr>
  </tbody>
</table>

<p>The important case is the first versus the third row. Both might project to
roughly <code class="language-plaintext highlighter-rouge">0.5</code>, but they mean opposite things. The brand-new agent is at <code class="language-plaintext highlighter-rouge">0.5</code>
because the system has no evidence and is falling back to the prior. The
seasoned but inconsistent agent is at <code class="language-plaintext highlighter-rouge">0.5</code> because the system has a lot of
evidence and that evidence is genuinely split. A scalar score hides that
difference; the opinion keeps it visible.</p>

<h2 id="architecture">Architecture</h2>

<p>MultiTrust is organized around a single async orchestrator, <code class="language-plaintext highlighter-rouge">TrustManager</code>,
with pluggable backends for storage, evidence accumulation, and exposure. The
MCP server is one of several entry points — you can also use the library
directly, gate async functions with decorators, or export/import snapshots
between environments.</p>

<p><img src="/assets/img/multitrust_architecture.png" alt="MultiTrust architecture: MCP server and decorators feed evidence through TrustManager into pluggable storage backends" /></p>

<p>The flow is deliberately one-directional:</p>

<ul>
  <li>Callers submit observations as <code class="language-plaintext highlighter-rouge">Evidence</code> records (agent, authority,
positive count, negative count, an optional rule name).</li>
  <li>The <code class="language-plaintext highlighter-rouge">TrustManager</code> fuses those into Subjective Logic opinions using the
canonical operators — cumulative fusion for independent authorities,
averaging for redundant ones.</li>
  <li>Opinions are persisted in the trust store.</li>
  <li>When asked for a trust score, the manager applies time decay (opinions
drift toward vacuous at a configurable half-life), projects the current
opinion, and returns the scalar.</li>
</ul>

<p>The <code class="language-plaintext highlighter-rouge">EvidenceLedger</code> is the piece that pulls its weight in production. It
stores the <em>individual</em> observations that contributed to an opinion, with
authority IDs and rule names. When something goes wrong and you need to
defend a trust decision — <em>why did we route this request to agent X?</em> —
<code class="language-plaintext highlighter-rouge">explain_trust()</code> produces a structured breakdown showing which authorities
and <em>which rules</em> moved the score, <em>by how much</em>, and <em>when</em>.</p>

<p>A representative explanation looks less like a mystery score and more like an
audit trail:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>agent: fact-checker
current opinion: b=0.31 d=0.46 u=0.23 base_rate=0.50
projected trust: 0.425
top contributors:
  - validator / factual_consistency : -0.18  (7 negative observations)
  - latency_monitor / timeout_rate  : -0.05  (3 degraded responses)
  - editor / accepted_corrections   : +0.04  (2 successful recoveries)
decision at threshold 0.60: blocked
</code></pre></div></div>

<p>That is the practical advantage of carrying belief, disbelief, uncertainty,
authority, and rule names all the way through the runtime: when the system
changes its behavior, you can inspect the reason instead of reverse-engineering
it from a number.</p>

<h2 id="a-motivating-example">A motivating example</h2>

<p>The repository ships a
<a href="https://github.com/nobelk/multitrust/blob/main/examples/hallucination_firewall.py"><code class="language-plaintext highlighter-rouge">hallucination_firewall.py</code></a>
demo that captures the intended use case. A research pipeline has a
fact-checker agent whose accuracy silently degrades — perhaps the underlying
model was updated, perhaps a prompt regression slipped in, perhaps the
content it’s checking has drifted out of its training distribution. Each
failed fact-check is submitted as negative evidence against the agent.
Within a dozen or so observations, the opinion shifts enough that
<code class="language-plaintext highlighter-rouge">is_trusted(threshold=0.6)</code> returns false, and the orchestrator gates the
fact-checker out of the pipeline <em>before</em> its mistakes reach the final
answer.</p>

<p>The critical thing is that this happens <em>gradually and mathematically</em>, not
through a hand-tuned heuristic. The same framework handles the other
direction too — agents recover trust as they accumulate positive evidence,
and the time-decay mechanism ensures ancient evidence stops dominating
current behavior.</p>

<p>If you are building a multi-agent system where different agents have
different reliability profiles — and in practice, every non-trivial
multi-agent system has this — you eventually need a way to represent and
reason about that. Rolling a scalar score is the obvious first move, and it
will be wrong in the three places that matter: <em>cold start</em>, <em>recovery after
degradation</em>, and <em>explainability</em>. Subjective Logic is a two-decades-old,
well-studied framework that gets all three right. MultiTrust is a small,
modern, MCP-native implementation of it. The combination of principled math
and standard-protocol exposure is, I think, the shape this category of tool
should take.</p>

<hr />

<p>MultiTrust addresses the <em>trust</em> dimension of multi-agent reliability. Two companion pieces cover adjacent failure modes: <a href="/2026/04/22/tangle-deadlock-detection-for-langgraph.html">Tangle</a> detects deadlocks and livelocks when agents form circular waits, and <a href="/2026/04/22/reverb-semantic-cache-with-knowledge-aware-invalidation.html">Reverb</a> ensures cached LLM responses don’t go stale when the underlying knowledge changes.</p>]]></content><author><name>Dr. Nobel Khandaker</name></author><category term="systems" /><category term="agents" /><category term="trust" /><category term="mcp" /><category term="subjective-logic" /><category term="reliability" /><summary type="html"><![CDATA[Scalar trust scores pretend certainty they do not have. MultiTrust models trust as a Subjective Logic opinion — belief, disbelief, uncertainty — and exposes it as an MCP tool any agent can call.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://outloop.blog/assets/img/multitrust_architecture.png" /><media:content medium="image" url="https://outloop.blog/assets/img/multitrust_architecture.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Tangle: Deadlock and Livelock Detection for LangGraph Agents</title><link href="https://outloop.blog/2026/04/22/tangle-deadlock-detection-for-langgraph.html" rel="alternate" type="text/html" title="Tangle: Deadlock and Livelock Detection for LangGraph Agents" /><published>2026-04-22T05:00:00-04:00</published><updated>2026-05-31T10:55:04-04:00</updated><id>https://outloop.blog/2026/04/22/tangle-deadlock-detection-for-langgraph</id><content type="html" xml:base="https://outloop.blog/2026/04/22/tangle-deadlock-detection-for-langgraph.html"><![CDATA[<p>Multi-agent LLM workflows are, from a concurrency standpoint, small distributed
systems. They hold resources, they wait on each other, and — like every other
distributed system we have ever built — they can get stuck. The failure mode is
worse than an outright crash: <em>no exception is raised</em>, <em>no timer fires</em>, <em>no agent knows anything is wrong</em>. The workflow just stops producing tokens. The operator sees a spinner.</p>

<p><a href="https://github.com/nobelk/tangle">Tangle</a> is a small Python library that
catches this class of failure in real time for LangGraph workflows (and, via
OpenTelemetry, for anything else). It reuses an idea that has been sitting in
operating-systems textbooks since 1972 — the Wait-For Graph — and applies it at
the agent layer, where the same topology has quietly reappeared.  To be specific, in its current implementation, Tangle provides repeated-pattern detection over message digests.</p>

<!--more-->

<h2 id="the-failure-mode">The failure mode</h2>

<p>Consider a four-agent pipeline: <code class="language-plaintext highlighter-rouge">researcher → writer → reviewer → editor</code>.
Each agent, under certain states, may wait on output from another. Introduce a
back-edge — say, <code class="language-plaintext highlighter-rouge">editor → researcher</code> for a re-research pass — and the
dependency graph now contains a cycle. If every agent in the cycle is
simultaneously in its waiting state, none of them can advance. None of them
will ever advance. The workflow is deadlocked.</p>

<p>Livelock is the subtler sibling: no circular wait, but two agents bounce the
same rejected message back and forth forever. A reviewer rejects a draft, the
writer revises in a way that changes nothing material, the reviewer rejects
the revision. Tokens keep being spent. Progress is zero.</p>

<p>In practice, a deadlock trace looks like this:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>researcher waiting for writer
writer waiting for reviewer
reviewer waiting for editor
editor waiting for researcher   # closing edge; cycle exists now
</code></pre></div></div>

<p>At that moment, you do not need a timeout to guess the workflow is stuck. The
structure itself is enough: every agent in the cycle is waiting on another
agent in the same cycle, so no further progress is possible without external
intervention.</p>

<p>Both failures are detectable in principle. The question is how to detect them
cheaply enough that instrumentation doesn’t dominate the workflow’s own cost.</p>

<h2 id="architecture">Architecture</h2>

<p>Tangle separates <em>event ingestion</em> from <em>detection</em> from <em>resolution</em>. The
three stages are deliberately independent — you can swap SDK hooks for OTLP
spans, switch cycle detection to livelock detection per event type, and chain
resolvers in any order. The shape of the system:</p>

<p><img src="/assets/img/tangle_architecture.png" alt="Tangle architecture: event ingestion from LangGraph hooks and OpenTelemetry feeding cycle and livelock detectors" /></p>

<p>Events flow in from one of three sources. Each event is a small, typed record
(e.g., <code class="language-plaintext highlighter-rouge">REGISTER</code>, <code class="language-plaintext highlighter-rouge">WAIT_FOR</code>, <code class="language-plaintext highlighter-rouge">RELEASE</code>, <code class="language-plaintext highlighter-rouge">SEND</code>, <code class="language-plaintext highlighter-rouge">CANCEL</code>, <code class="language-plaintext highlighter-rouge">COMPLETE</code>). They hit
<code class="language-plaintext highlighter-rouge">TangleMonitor.process_event()</code>, which updates the Wait-For Graph and
dispatches to the appropriate detector: <code class="language-plaintext highlighter-rouge">WAIT_FOR</code> events touch cycle
detection, <code class="language-plaintext highlighter-rouge">SEND</code> events touch livelock pattern matching. When either fires,
the resolver chain runs in order and halts on the first resolver that
succeeds.</p>

<h2 id="why-a-wait-for-graph">Why a Wait-For Graph?</h2>

<p>The Wait-For Graph (WFG) is one of those classical constructs that keeps
reappearing in disguise. Holt described it in 1972 for kernel deadlock
detection. Database engines use it for transaction lock cycles. Distributed
lock managers like Chubby and ZooKeeper reason about it implicitly. The
insight in Tangle is that an LLM agent holding a conversational turn is, for
the purposes of progress analysis, <em>isomorphic</em> to a process holding a
resource. Same graph, different vertices.</p>

<p>That matters because cycle detection on a WFG is a well-understood problem with
well-understood complexity. Tangle uses two complementary algorithms:</p>

<ol>
  <li><strong>Incremental DFS on edge-add.</strong> When a new <code class="language-plaintext highlighter-rouge">WAIT_FOR</code> edge is added, walk
back along existing edges from the target to see if you return to the
source. O(V+E) worst case, but in practice tiny because multi-agent graphs
are shallow.</li>
  <li><strong>Periodic Kahn’s-algorithm scan</strong> over the whole graph. A topological sort
that fails is a cycle that exists. This is the belt-and-suspenders pass
that catches anything the incremental detector might race against during
concurrent edits.</li>
</ol>

<p>Livelock is different — no cycle appears in the graph. Instead, Tangle
fingerprints each <code class="language-plaintext highlighter-rouge">SEND</code> event’s message payload with
<a href="https://xxhash.com/">xxhash</a> (chosen for speed over cryptographic strength —
this is a signal, not a security claim) and keeps the last N digests in a
ring buffer. When the same digest reappears more than <code class="language-plaintext highlighter-rouge">livelock_min_repeats</code>
times in the window, the detector fires. No semantic understanding of the
message is required; identical repeated content <em>is</em> the signal.</p>

<p>That distinction matters operationally. Deadlock detection here is structural:
if a Wait-For Graph contains a cycle, the workflow is blocked in a precise,
checkable sense. Livelock detection is heuristic: Tangle is looking for repeated
message patterns that strongly suggest non-progress, not proving a theorem about
all possible non-progress states. That is still a useful line to draw in
production. You can treat deadlocks as mechanically certain and livelocks as
high-confidence warnings that deserve intervention.</p>

<h2 id="the-langgraph-integration">The LangGraph integration</h2>

<p>What makes this practical for day-to-day use is that instrumentation is
two decorators:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">tangle</span> <span class="kn">import</span> <span class="n">TangleConfig</span><span class="p">,</span> <span class="n">TangleMonitor</span>
<span class="kn">from</span> <span class="n">tangle.integrations.langgraph</span> <span class="kn">import</span> <span class="p">(</span>
    <span class="n">tangle_node</span><span class="p">,</span>
    <span class="n">tangle_conditional_edge</span><span class="p">,</span>
<span class="p">)</span>

<span class="n">config</span> <span class="o">=</span> <span class="nc">TangleConfig</span><span class="p">(</span>
    <span class="n">resolution</span><span class="o">=</span><span class="sh">"</span><span class="s">cancel_youngest</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">livelock_min_repeats</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span>
<span class="p">)</span>
<span class="n">monitor</span> <span class="o">=</span> <span class="nc">TangleMonitor</span><span class="p">(</span><span class="n">config</span><span class="o">=</span><span class="n">config</span><span class="p">,</span> <span class="n">on_detection</span><span class="o">=</span><span class="k">print</span><span class="p">)</span>

<span class="nd">@tangle_node</span><span class="p">(</span><span class="n">monitor</span><span class="p">,</span> <span class="n">agent_id</span><span class="o">=</span><span class="sh">"</span><span class="s">reviewer</span><span class="sh">"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">reviewer</span><span class="p">(</span><span class="n">state</span><span class="p">):</span>
    <span class="k">return</span> <span class="p">{</span><span class="sh">"</span><span class="s">feedback</span><span class="sh">"</span><span class="p">:</span> <span class="nf">review</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="sh">"</span><span class="s">draft</span><span class="sh">"</span><span class="p">])}</span>

<span class="nd">@tangle_conditional_edge</span><span class="p">(</span><span class="n">monitor</span><span class="p">,</span> <span class="n">from_agent</span><span class="o">=</span><span class="sh">"</span><span class="s">reviewer</span><span class="sh">"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">route_after_review</span><span class="p">(</span><span class="n">state</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">state</span><span class="p">[</span><span class="sh">"</span><span class="s">feedback</span><span class="sh">"</span><span class="p">]</span> <span class="o">==</span> <span class="sh">"</span><span class="s">approved</span><span class="sh">"</span><span class="p">:</span>
        <span class="k">return</span> <span class="sh">"</span><span class="s">__end__</span><span class="sh">"</span>
    <span class="k">return</span> <span class="sh">"</span><span class="s">writer</span><span class="sh">"</span>   <span class="c1"># back-edge — potential loop
</span></code></pre></div></div>

<p>The decorators emit <code class="language-plaintext highlighter-rouge">REGISTER</code>, <code class="language-plaintext highlighter-rouge">SEND</code>, <code class="language-plaintext highlighter-rouge">WAIT_FOR</code>, and <code class="language-plaintext highlighter-rouge">RELEASE</code> events
transparently. Existing LangGraph code keeps working. Tracking is activated
per-invocation by threading a <code class="language-plaintext highlighter-rouge">tangle_workflow_id</code> through the state dict, so
you can roll out detection to a subset of production traffic without changing
the graph definition.</p>

<p>For non-LangGraph workflows (or for multi-language stacks), Tangle can
reconstruct the same events from <em>OpenTelemetry spans</em>, which means any tracing
instrumentation you already have becomes deadlock-aware for free.</p>

<h2 id="resolution-not-just-detection">Resolution, not just detection</h2>

<p>Detection without a response is an alert that wakes someone up at 3am. Tangle
ships several built-in resolvers, for example:</p>

<ul>
  <li><strong>Alert</strong> — the cheap default. Hand a structured <code class="language-plaintext highlighter-rouge">Detection</code> to a callback,
let the application decide.</li>
  <li><strong>Cancel youngest</strong> — kill the most recently joined agent in the cycle. In
practice this is the right default for review/revise loops: it breaks the
cycle with minimal loss of context.</li>
  <li><strong>Tiebreaker prompt injection</strong> — for livelocks, inject a system message
that explicitly names the repeated pattern and asks the agent to change
tack. Cheaper than restarting the workflow.</li>
  <li><strong>Escalate</strong> — POST the detection to an external webhook for human or
upstream-service intervention.</li>
</ul>

<p>The chain executes in order and stops on the first success. Configure it once;
the per-detection behavior emerges from the config, not from scattered
try/except blocks in agent code.</p>

<h2 id="where-this-fits">Where this fits</h2>

<p>If you are running LangGraph in production, especially with conditional edges
or multi-agent negotiation patterns, you have almost certainly hit a workflow
that hung. The standard mitigation — a coarse wall-clock timeout — is a
blunt instrument: it catches deadlocks eventually, but at the cost of
cancelling any slow-but-healthy run that exceeds the budget. Tangle’s
contribution is to give the same workflow a <em>structural</em> reason to cancel
(a cycle in the WFG, a repeated digest pattern) rather than a <em>temporal</em> one
(we waited too long). That distinction matters at scale, because it
decouples correctness from tail latency.</p>

<p>The approach generalizes past LangGraph. Any system where autonomous
components exchange messages and occasionally wait on each other — agent
frameworks, workflow orchestrators, multi-model ensembles — has the same
failure modes. Tangle is an early, careful implementation of what I suspect
will become a valuable tool in building reliable and fault tolerant agentic infrastructure: <em>progress monitors</em> that
treat liveness as a first-class property, not a property you check by
inference after everything has already gone quiet.</p>

<hr />

<p>Tangle covers the <em>liveness</em> dimension of multi-agent reliability — detecting when workflows stop making progress. For the <em>trust</em> layer — modeling which agents are reliable based on accumulated evidence — see <a href="/2026/04/22/multitrust-subjective-logic-for-multi-agent-systems.html">MultiTrust</a>. For ensuring cached responses stay fresh when source knowledge changes, see <a href="/2026/04/22/reverb-semantic-cache-with-knowledge-aware-invalidation.html">Reverb</a>.</p>]]></content><author><name>Dr. Nobel Khandaker</name></author><category term="systems" /><category term="agents" /><category term="langgraph" /><category term="reliability" /><category term="distributed-systems" /><summary type="html"><![CDATA[Multi-agent LangGraph workflows can hang silently when agents form circular waits. Tangle borrows a 50-year-old operating-systems idea — the Wait-For Graph — to catch it in milliseconds.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://outloop.blog/assets/img/tangle_architecture.png" /><media:content medium="image" url="https://outloop.blog/assets/img/tangle_architecture.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>