outloop.blog

Agentic Engineering: From Architecture Document to Delivery Plan

2026-04-24T09:00:00-04:00

Architecture documents are often treated as the end of design work. In an effective engineering organization, they are the beginning of delivery work. The architecture document and the developer design decisions usually get converted to concrete executable task backlog for the engineering team. The engineering team lead and the program or product manager work together to perform this conversion.

Recently, I worked on creating a task rewriting a complex industrial software. The source design describes a replacement for a legacy SignalR/ASP.NET edge server with a Go-based middleware service that coordinates field device systems, operator controller apps, and admin dashboards at remote industrial sites. It specifies the architectural style, transport strategy, runtime behavior, safety constraints, persistence rules, observability expectations, security model, test strategy, and rollout approach.

We started with a strong architecture document but it did not tell a team which work must happen first, which work can happen in parallel, which ambiguities must be resolved before sprint planning, or how to turn a safety requirement like “E-Stop p99 < 100 ms” (Emergency stop signal) into stories, acceptance criteria, test gates, and release evidence.

This is where agentic planning became useful for us. Agents can read the architecture document, extract its delivery-relevant facts, challenge gaps, and synthesize a backlog that preserves the architecture’s intent. Our goal was not to have an agent invent a project plan. The goal was to have agents compile, cross-check, and structure the plan from the design.

This article walks through that process using the anonymized Jira delivery plan as the case study.

The Source Material

The architecture document defines an edge-management middleware service with these major characteristics:

A Go modular monolith using hexagonal architecture, also known as Ports and Adapters.
A domain core responsible for device group management, field-device finite state machines, ownership rules, E-Stop propagation, and device group’s limit computation.
ZeroMQ for real-time communication with field devices and controller apps.
REST APIs for admin dashboard and auth sidecar communication.
Watermill for cold-path command routing.
Go channels for the hot sensor and safety path.
PostgreSQL with an ORM for durable state.
Authentication and authorization through a sidecar.
Configuration through a sidecar.
Hosted error tracking and metrics for operational observability.
Docker Compose as the per-site deployment target.

It also defines constraints that are not optional:

E-Stop must reach all field devices within 100 ms at p99.
Safety and control messages must not be silently dropped.
Restarts must be safe by default.
Devices must be offline until reconnection proves otherwise.
Controller ownership after restart is only a logical association until an operator explicitly reconfirms motion-affecting operations.
Partial network partition inside a device group causes disband and operator alert.
The system must support up to 100 field devices, 10 controller apps, and 10 admin dashboards.
Production rollout starts at Customer A, then Customer B, with rollback to the legacy system.

An agent cannot ignore any of that. A useful plan has to preserve those constraints and turn them into executable work.

The Planning Problem

The hard part is that architecture documents are organized for understanding, while delivery plans are organized for execution.

The architecture has sections like:

Transport strategy by message criticality.
Hexagonal architecture design.
Runtime messaging.
Hot path versus cold path.
Graceful startup, restart, and shutdown.
Network partition and state recovery.
Data and persistence.
Observability and SLOs.
Security architecture.
Testing strategies.
Rollout and migration.

A delivery plan needs different questions answered:

What must be built first so the rest of the team can work?
Which requirements are safety-critical and need stronger evidence?
Which architecture decisions imply reusable work tracks?
Which external dependencies can block progress?
Where should quality gates live?
What belongs in sprint stories versus milestones versus release criteria?
What assumptions must be resolved before implementation starts?

Agentic planning is useful because agents are good at repeatedly transforming structured information across levels of abstraction. In this case, the process turned a 30-page architecture design into a six-milestone delivery plan plus a scoped placeholder for audio/video streaming.

The important point here is the review loop. Agents accelerate the conversion, but the human team members remain responsible for whether the plan is coherent, safe, and aligned with real team constraints.

Step 1: Extract Delivery-Relevant Facts

The first agent task is not to generate stories. It is to extract facts.

For the anonymized service, the extraction pass organized the architecture document into planning inputs:

Product scope: fleet management, user management, state management, message routing, testing, observability, perception, automation hooks, and audio/video streaming.
Non-goals: dashboard UI design, controller app architecture, legacy gateway connection, controller app logging pipeline, wireless and physical E-Stop systems handled outside the middleware service, and strict compliance implementation.
Runtime components: Go service, ZeroMQ transport, Watermill router, PostgreSQL, an ORM, structured logging, hosted observability, and Docker Compose.
Domain responsibilities: FSM, device group management rules, ownership eligibility, most-restrictive limit computation, E-Stop propagation, idempotency.
Message classes: safety, control, telemetry, administrative, peripheral.
State categories: checkpointed, reconstructable, and not persisted.
SLOs: command success rate, E-Stop latency, availability, dead-letter rate, error budget.
Testing obligations: unit tests, mutation tests, fuzz tests, integration tests, contract tests, E2E simulation, chaos, performance, soak.
Rollout obligations: data migration, red-green cutover, rollback, and Customer A / Customer B site acceptance testing.

This fact extraction is where agents prevent a common planning failure: treating all sections of the architecture document as equal. They are not equal. Some sections describe implementation mechanics. Some describe business behavior. Some describe operational proof. Some describe release risk.

For example, “ZeroMQ client” is an implementation fact. “E-Stop must reach all field devices within 100 ms” is a safety constraint. “legacy document layout to ORM entities” is a migration risk. A good plan needs all three, but it should not handle them at the same level.

Step 2: Convert Architecture Boundaries into Work Boundaries

The architecture chose hexagonal boundaries because the middleware service sits between multiple transport protocols and a safety-critical domain. That choice is also a planning gift.

The domain core can be built and tested separately from adapters. The ZeroMQ and REST adapters can be developed against ports. The persistence layer can implement outbound ports without leaking ORM details into domain code. The hot path and warm/cold path can become separate implementation tracks.

The delivery plan used those boundaries directly:

This gave the plan a natural decomposition:

M1 establishes repository, CI, composition root, local stack, health, logging, and port skeletons.
M2 builds core domain and messaging MVP.
M3 adds persistence, authentication, admin APIs, peripheral integration, and configuration.
M4 completes safety, reliability, recovery, and security behavior.
M5 proves the system through testing, observability, performance, and chaos.
M6 handles migration and production cutover.
M7 holds audio/video streaming as in-scope but underspecified.

That milestone order did not copy the architecture document section-by-section. It translated architecture dependency into delivery dependency.

Step 3: Separate the Hot Path from the Warm/Cold Path

The architecture makes a crucial runtime distinction:

Hot path: ZeroMQ -> Go worker -> channel -> batch worker -> domain ports.
Cold path: ZeroMQ/REST -> Watermill -> command handler -> domain ports.

That distinction affects planning. The hot path exists because sensor and safety traffic need low latency and predictable allocation behavior. The warm/cold path exists because ownership, device configuration, mode switches, admin operations, and peripheral commands benefit from validation, logging, retry, and workflow-style handling.

The plan therefore split messaging work into separate epics:

ZeroMQ pub/sub adapter.
Message envelope and protobuf codec.
Watermill router and bounded class queues.
Outbound publisher and retry.
Golden-message fixture validation.
Go-channel ingest worker.
Batch processor with fan-out.
Entity-partitioned sequential processing.
Hot-path backpressure metrics.
Safety timing smoke benchmark.
Cold-path command handlers.

This is a good example of agentic planning preserving design intent. A weaker plan might have created a single “Implement messaging” epic. That would hide the highest-risk part of the architecture inside a broad bucket. The agent-generated plan instead kept the hot and cold paths visible.

The diagram above is more than technical documentation. It is a delivery planning device. It tells the team which work can proceed independently and where integration risk will appear.

Step 4: Promote Safety Constraints into Delivery Gates

Safety requirements should not sit passively in a requirements section. They need to become acceptance criteria, benchmarks, tests, and milestone exit conditions.

The architecture says:

E-Stop has a 100 ms end-to-end timing budget.
Safety messages bypass normal queues.
If no ACK is received before deadline, the sender transitions to fail-safe behavior.
E-Stop propagation applies across the devices.
Restart must recover or surface unconfirmed safety commands.

The delivery plan turns that into M4, “Safety, Reliability & Recovery,” with concrete stories:

High-priority E-Stop lane.
All-members ACK aggregator.
CurveZMQ encryption and device identity.
Unconfirmed E-Stop integration contract.
Controller key revocation.
E-Stop benchmark smoke gate.
Key/cert lifecycle and replacement.
Outbox schema and write-before-publish.
Startup outbox scanner.
Replay policy for unfinished safety commands.
Recovery surfacing for unreplayed commands.

It also resolves the timing budget into per-hop gates:

Controller -> Service           30 ms
Service -> device fan-out       40 ms
ACK path back                   30 ms
-------------------------------------
Total                          100 ms

That budget matters because it changes the shape of the stories. “Implement E-Stop” is not a useful story. “All-members ACK aggregator with a 70 ms site-publish-to-ACK window and operator-visible unconfirmed state” is useful.

Agents are particularly helpful here because they can propagate one safety constraint into multiple artifacts:

Story acceptance criteria.
Metrics.
Benchmark thresholds.
Chaos test scenarios.
Dead-letter behavior.
Restart recovery rules.
Milestone exit criteria.

The result is that safety becomes an execution structure, not just a paragraph in the design.

Step 5: Build State Matrices Before Writing Stories

The Jira plan contains two authoritative matrices that are more important than they may look:

Persistence matrix.
Message-class matrix.

These matrices force the plan to make operational semantics explicit.

The persistence matrix classifies state as:

Checkpointed: survives restart in PostgreSQL.
Reconstructable: rebuilt from live transport, heartbeats, telemetry, or recomputation.
Not persisted: process memory or runtime metrics.

For example, device membership to a group, controller ownership assignments, command-key watermarks, trackfiles, revocation lists, and safety outbox entries are checkpointed. Vehicle connectivity, current sensor values, and derived device group limits are reconstructable. Hot-path queue contents and in-flight worker buffers are not persisted.

That classification drives implementation work:

Runtime state
     |
     +--> checkpointed ---------> PostgreSQL, migrations, recovery tests
     |
     +--> reconstructable ------> startup reconciliation, reconnect behavior
     |
     +--> not persisted --------> metrics, safe loss, no recovery promise

Without that matrix, the team would discover restart semantics piecemeal during implementation. With it, agents can generate persistence stories, recovery tests, and acceptance criteria from a shared source of truth.

The message-class matrix performs the same function for transport behavior. Safety, control, telemetry, administrative, and peripheral messages do not share the same retry, queueing, timeout, deduplication, or dead-letter semantics. The plan makes those differences explicit so stories do not accidentally apply one policy to every message class.

For example:

Safety bypasses normal queues and uses strict ACK deadlines.
Control uses bounded wait, at-least-once behavior, command keys, and dead-letter handling.
Telemetry can drop oldest or coalesce by entity because freshness matters more than replay.
Administrative work must return explicit errors or retry hints.
Peripheral commands can be last-write-wins where appropriate.

This is one of the most valuable outputs of the agentic planning process. The agents did not just produce tickets. They produced intermediate planning artifacts that make the tickets safer.

Step 6: Resolve Ambiguity into a Decisions Log

Architecture documents often contain open questions. Some are harmless. Some are project blockers disguised as implementation details.

The source architecture had open questions around partial partition behavior, clock synchronization, E-Stop timing budget, revocation, and MAC-versus-key identity. The delivery plan records those as resolved decisions:

Protobuf contract source of truth is an internal IPC contract repository on the active v2 development branch.
Field-device FSM is final and documented.
E-Stop budget is 30/40/30 ms.
Partial partition means disband the device group and alert the operator.
Clock synchronization uses external NTP on all nodes.
Controller key revocation uses an admin-triggered revocation list.
MAC is a human label only; the key is the authoritative identity.
RPO is 24 hours and RTO is 4 hours.
Deployment target is Docker Compose per customer site.
Observability hosting uses managed error tracking and metrics platforms.
TLS certificates come from an internal CA owned by the team.
SAT completion requires 14-day soak, zero SEV-1, SLO targets met, and customer sign-off.
Audio/video streaming is in scope for v2 but needs more PRD detail before story breakdown.

This is a critical habit. Agents can draft around ambiguity, but delivery cannot safely proceed if important ambiguity remains hidden. A decision log gives every story a traceable foundation.

Open question
      |
      v
Clarification with sponsor / tech lead / dependency team
      |
      v
Decision log row
      |
      v
Stories, acceptance criteria, tests, release gates

The best agentic plans make assumptions visible. They do not bury them in prose.

Step 7: Turn Test Strategy into Continuous Quality Tracks

The architecture document includes a test pyramid and detailed test categories. A typical project plan might move all of that to the end under a “Testing” milestone. That is too late for this system.

The delivery plan instead creates continuous quality and observability swimlanes from M1 through M5.

Quality starts in M1 with CI, fakes, and reproducible local development. It continues in M2 with domain invariants, fixture validation, and smoke benchmarks. M3 extends integration coverage across persistence, auth, and admin APIs. M4 validates recovery and safety-critical mechanisms. M5 completes the full proof with contract tests, E2E simulation, chaos, fuzzing, mutation testing, performance, and soak.

The CI tiers reinforce that structure:

PR: lint, vet, unit tests, fast domain tests, small adapter tests, golden-message contract smoke, build verification.
Merge/main: broader integration suites, migration checks, targeted benchmarks, image publish, staging-oriented verification.
Nightly: full E2E simulation, fuzz corpus expansion, mutation testing, broader contract verification, scripted chaos.
Pre-release: 24-hour soak, full performance suite, rollout rehearsals, restore rehearsal, and site-specific pre-cutover validation.

That tiering is important. It prevents the plan from pretending every test should run on every pull request. Agents can help here by matching test cost to trigger frequency.

Fast feedback                         Deep proof
     |                                    |
     v                                    v
PR checks -> main checks -> nightly checks -> pre-release gates

The same pattern appears in observability. M1 establishes logs, error tracking, health endpoints, and correlation IDs. Later milestones add queue metrics, hot-path metrics, benchmark output, safety/recovery signals, dashboards, alert routing, and SLO burn-rate surfaces.

Step 8: Model Parallelization Explicitly

A delivery plan is only useful if it accounts for the team that will execute it. This anonymized plan assumes three to four engineers, two-week sprints, and roughly nine months of work.

After M1 lands, the plan maps work to parallel tracks:

Engineer A: domain core, E-Stop, outbox, restart, dead-letter and recovery.
Engineer B: ZeroMQ, hot path, cold path, E2E harness.
Engineer C: persistence, auth, admin APIs, config, migration tooling.
Engineer D, where available: local stack, observability, contract tests, chaos, performance, dashboards, runbooks.

M1 foundation
     |
     +--> Domain track --------> Safety / recovery
     |
     +--> Messaging track -----> Hot path / cold path / E2E
     |
     +--> Integration track ---> Persistence / auth / admin / migration
     |
     +--> Quality track -------> Contracts / chaos / SLOs / release proof

The plan also identifies blockers to parallelization:

FSM definition must finish before meaningful command handler wiring.
Auth sidecar wiring blocks authenticated admin APIs.
E-Stop priority lane touches both domain and messaging code, so domain and messaging engineers should pair.

This is the kind of information agents can infer from dependencies, but humans should review carefully. Parallelization is not just about keeping everyone busy. It is about reducing queueing time without creating integration chaos.

Step 9: Keep Release Work in the Plan

The architecture document includes rollout and migration. The delivery plan preserves that as M6 rather than treating it as an operations afterthought.

M6 includes:

Data migration tooling from the legacy document layout to ORM-backed entities.
Red-green rollout runbooks.
Rollback plan to the legacy app/gateway/Edge server setup.
Customer A activation.
Customer B activation.
Site acceptance testing.
Post-launch retrospective and on-call handoff.

That matters because architecture migration is not done when the code compiles. It is done when production data is migrated, the site is cut over, rollback is rehearsed, support can operate the system, and customer acceptance criteria are met.

The plan’s SAT (site acceptance testing) criteria are concrete: 14-day soak, zero SEV-1, SLO targets met, and customer sign-off. Those criteria prevent “done” from becoming subjective at the end of the program.

What Agents Did Well

The strongest parts of the plan are the places where agents used the architecture document as a constraint system.

They mapped architecture sections into delivery tracks. Hexagonal architecture became foundation, domain, adapter, and integration work. Hot and cold paths became separate messaging epics. Safety constraints became timing gates, outbox work, and recovery stories. Testing strategy became CI tiers and a continuous quality swimlane. Rollout notes became migration and cutover milestones.

They also created useful intermediate artifacts:

Persistence matrix.
Message-class matrix.
Availability SLI definition.
Cross-milestone dependency map.
Decisions log.
Story artifact policy.
CI tiers.

Those artifacts make the plan auditable. A reviewer can trace a story back to a design section, a decision, or a risk.

Where Our Human Team Members Still Matters

Agents can structure the plan, but they cannot own the consequences.

Humans still need to validate:

Whether the 30/40/30 ms E-Stop budget is realistic on actual site networks.
Whether simulated field-device behavior is representative enough before SAT.
Whether the field-device contract will stabilize in time.
Whether Docker Compose per site is operationally sufficient for 99.99 percent availability.
Whether the team has enough Go, ZeroMQ, Watermill, auth, observability, and field deployment experience.
Whether audio/video streaming can remain a placeholder without undermining the release.
Whether the plan’s sequencing fits staffing, procurement, and customer-site constraints.

Agentic planning reduces planning labor. It does not remove technical accountability.

A Repeatable Agentic Planning Pattern

The agent-created plan accurately captures that the service is not simply a rewrite from C#/SignalR to Go/ZeroMQ. It is a safety-sensitive architecture migration with strict latency, restart, identity, idempotency, observability, and rollout requirements. It also makes clear where the team can parallelize and where it must serialize work. That is what good agentic project planning should produce. Not a bigger backlog. A clearer one. The following figure shows the steps that led to the final executable plan.

Agents are most valuable when they help teams preserve architectural intent all the way down to executable work. In this case, the architecture document defined the system. The planning agents turned that definition into delivery structure: milestones, risk controls, quality gates, and release evidence.

That is the difference between an architecture document that is admired and an architecture document that ships.

Agentic Engineering: Spec-Driven Development

2026-04-23T09:00:00-04:00

Spec-Driven Development (SDD) is the process of creating a living contract between human developers and coding agents where the Specification (the what and why) is deliberately decoupled from the Implementation (the how). SDD allows a human developer to become an architect who guides the agent to build and ship high quality software. In this blog I summarize my experience of using the SDD in software engineering. The prompts and the skills are from Paul’s SDD course — see the DeepLearning.AI course repo on GitHub.

Why Spec-Driven Development?

The problems I usually face with vibe coding are 1) lost chat histories and context and 2) lack of a shared architecture/dev contract. These usually result in poor coordination among our team members for complex and long development projects.

SDD is appropriate for projects with significant complexity if you can accomplish what you need in one short prompt, SDD will not provide any advantages.

My SDD Workflow at a Glance

The SDD workflow has two major layers: a one-time project initialization step (the Constitution) and a repeating feature loop.

Phase 0: Create the Project Constitution

The Constitution is the agent-agnostic and structured foundation of the entire project. It is a global, high-level set of documents that captures the agreement between the developers in our team and the agent stored in a specs/ directory:

Step-by-Step: Drafting the Constitution

Step 1 — create the knowledge base

Before talking to the agent, I create a knowledge base for the project:

Any existing READMEs, stakeholder notes, or product requirements
Architecture documents, dev design documents
Technical constraints from your organization (preferred languages, deployment targets)
Your opinions on tech stack, testing frameworks, architecture patterns

Step 2 — create the constitution

I use a prompt similar to the following to build the constitution:

I am building a new web application. Help me create a Project
Constitution with three files in a specs/ directory:
  - mission.md: vision, audience, scope, guiding principles
  - tech-stack.md: frameworks, deployment, technical constraints
  - roadmap.md: phases and features, organized in small steps

Please read README.md for background context, then ask me
questions — one at a time — to clarify what you need.
Use the AskUserQuestion tool if available.

Step 3 — Q&A with agents

I usually find that the agent clarifies a few decisions that have been missed in the knowledge base, e.g., architecture patterns, external packages, speed-vs-fidelity trade-offs.

Step 4 — Human-in-the-loop review

In this stage, I review all three files and ask the agent to fix any gaps. I avoid making any changes manually to ensure that the whole constitution remains in sync.

The mission left out our target audience. Please add:
"The primary audience is internal engineering teams at 
our organization." Also, use SQLite for this prototype.

Step 5 — Commit the Constitution

I commit the constitution to the repo.

Key insight: The Constitution is a living document. Version it and update it as the project evolves — always via the agent, in its own branch, so you can track which Constitution version produced which code.

Phase 1: Feature Specification

For every feature, I create the feature spec. This is the most important step in the loop — the key is not to rush it, but don’t micro-manage it either.

The Feature Spec Files

Each feature lives on its own branch and produces three spec files:

specs/feature-XX/
├── plan.md          ← Approach, task groups, sequence of work
├── requirements.md  ← Functional & non-functional requirements
└── validation.md    ← Scorecard: concrete success criteria

Step-by-Step: Feature Specification

Step 1 — a clean context and a new git branch

I always clear the agent’s context before starting which forces the agent loads everything it needs from the spec files — not from the memory of a previous session.

Step 2 — create a feature spec

Find the next phase on specs/roadmap.md and make a branch, ask me about the feature spec. Create:

A new directory YYYY-MM-DD-feature-name under specs for this feature work
In there:
plan.md as a series of numbered task groups.
requirements.md for the scope, decisions, context
validation.md for how to know the implementation succeeded and can be merged
Refer to specs/mission.md and specs/tech-stack.md for guidance.

Important: You must use your AskUserQuestion tool, grouped on these 3, before writing to disk.

Step 3 — Make decisions at the right altitude

The agent usually asks for key architectural and product decisions. The key is to steer at a high level (goals, missions) and not to over-specify (e.g., variable name, detail implementation steps).

Step 4 — Review and refine all three spec files

In this stage, I review the files: plan.md, requirements.md, and validation.md carefully. If something is wrong, I ask the agent to correct — this keeps requirements and the validation scorecard in sync.

Step 5 — commit the feature spec

Only after this commit I ask the agent to start implementation.

Phase 2: Feature Implementation

With the feature spec committed, let the agent implement it.

Step-by-Step: Feature Implementation

Step 1 — Clear the agent’s context

Start each implementation session fresh:

Step 2 — Send the implementation prompt

Read specs/feature-01/ and implement all task groups defined
in plan.md, following requirements.md.
Work in small commits, one task group at a time.

Step 3 — choosing the step size

┌───────────────────────────────────────────────────────────────┐
│                IMPLEMENTATION STEP SIZE OPTIONS               │
├───────────────────────────────────────────────────────────────┤
│                                                               │
│  ALL TASK GROUPS   Faster, more to review at once             │
│  AT ONCE           Best when you trust the spec fully         │
│                                                               │
│  ONE TASK GROUP    Smaller, easier to review                  │
│  PER PROMPT        Best for: security, auth, DB schema work   │
│                    Small mistakes compound less               │
│                    Especially useful in new codebases         │
│                                                               │
└───────────────────────────────────────────────────────────────┘

Step 4 — observe progress and review

I usually read the agent’s summary of its work (individual task groups) and review the diff of the changes.

Step 5 — running the app

I ask the agent to self-validate against validation.md at the end of implementation and run the app.

Phase 3: validation of the feature as a whole

Note on cognitive debt: Because agents generate code so fast, developers can accumulate cognitive debt — the mental load of tracking what the code is doing and how it has evolved. Keeping changes manageable and reviewing incrementally is how you keep this debt under control.

Step-by-Step: Feature Validation

Step 1 — Start with the commit view

I use the diff/commit view in the IDE and review changes at a high level:

Does the feature work as described in the spec?
Are the right patterns, components, and structures being used?
Avoid reviewing CSS class names or variable names — focus on intent and architecture.

Step 2 — raising issues via the agent

If I find a code issue or a spec omission — I ask the agent to fix it, this keeps all artifacts in sync:

The Home component puts all three sub-components in a single file.
Please split them into their own files and update any spec
documents or README mentions that reference the file structure.

If a code mistake traces back to something ambiguous in the spec, I ask the agent to fix the spec so that the issue does not reappear.

Step 3 — run tests

Here review and run tests via IDE and use the IDE debugger to step through execution. If the testing framework wasn’t configured during implementation, add it via a replanning step (see Phase 4).

Step 4 — constitution updates within a feature branch

Small updates to the Constitution (e.g., checking off a roadmap step) can stay on the feature branch. If a larger constitutional change is needed, I create a separate branch for it so you can track which version of the spec produced which code.

Step 5 — Mark validation complete and merge

┌──────────────────────────────────────────────────────────────────┐
│                      VALIDATION CHECKLIST                        │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  [ ] Feature works as described in the spec                      │
│  [ ] All scorecard items in validation.md are satisfied          │
│  [ ] Code follows patterns established in the Constitution       │
│  [ ] Tests pass                                                  │
│  [ ] Related docs / specs updated for any scope changes found    │
│  [ ] Branch merged to main with a meaningful commit message      │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Phase 4: Project Replanning

After every feature merge, I am careful not to immediately jump to the next feature. The replanning step updates the constitution, roadmap, and workflow to keep the whole process in sync.

What Replanning Covers

┌──────────────────────────────────────────────────────────────────┐
│                       THE REPLANNING STEP                        │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  A) CONSTITUTION UPDATES                                         │
│     • Add testing frameworks or tooling you settled on           │
│     • Record new architectural decisions made during impl        │
│     • Add responsive design requirements, new constraints, etc.  │
│     • Keep the living document current                           │
│                                                                  │
│  B) ROADMAP REVIEW                                               │
│     • Is the next roadmap item still the right thing to do?      │
│     • Can upcoming features be tackled together in one phase?    │
│     • Are there dependencies to re-order?                        │
│     • New info from stakeholders or product managers?            │
│                                                                  │
│  C) WORKFLOW IMPROVEMENT (Skills & Automation)                   │
│     • Package repetitive prompts into Agent Skills               │
│     • Create or improve changelog automation                     │
│     • Add linting, formatting, test-writing to validation step   │
│     • Decide: is this skill project-specific or global?          │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Step-by-Step: Replanning

Step 1 — Create a replanning branch

Keeping constitution updates on their own branch lets you track which version of the spec produced which code.

Step 2 — Update the Constitution with what you learned

Example: adding a testing framework after the first feature revealed the gap:

Update specs/tech-stack.md to add Pytest as our testing
framework with these preferences: [...]

Also update specs/feature-01/requirements.md and the
implementation to add tests using this framework.

And if a product update comes in from stakeholders:

We just learned that our product will run on desktop as well as smart phones.
Update the product specs, feature specs, and any existing
code to emphasize the responsive design.

Guidance: If the new work triggered by a product update is small, implementing it during replanning is fine. If it’s large, I schedule it as its own feature on the roadmap instead.

Step 3 — agent Skills for repetitive workflow steps

I have used Claude to write skills (global or local per project) that captures the common prompts:

I want to stop repeating the feature spec prompt. Use your skill creator to help me write a "feature spec" local skill. Here is the previous prompt:

Find the next phase on specs/roadmap.md and make a branch, ask me about the feature spec. Create:

A new directory YYYY-MM-DD-feature-name under specs for this feature work
In there:
plan.md as a series of numbered task groups.
requirements.md for the scope, decisions, context
validation.md for how to know the implementation succeeded and can be merged
Refer to specs/mission.md and specs/tech-stack.md for guidance.

Important: You must use your AskUserQuestion tool, grouped on these 3, before writing to disk.

I want to keep a CHANGELOG.md in the project root, with headings for dates. If no changelog, examine git commits and add bullets for each date. Then, as we work, we will manually invoke this skill before merging. Help me write a skill for this.

Create a validation skill to with the following steps:
Update CHANGELOG.md                              
Run linter & auto-fix                            
Run formatter                                    
Run test suite, report failures                  
Ask agent to fix any test failures               
Update README if public API changed              
Commit with a standardized message format        

Step 4 — Commit and merge the replanning branch

Managing AI Fatigue

As you begin each new feature, I establish a clean flow state before diving in. Running through this checklist prevents AI fatigue and context contamination between features:

┌──────────────────────────────────────────────────────────────────┐
│                  FEATURE KICKOFF CHECKLIST                       │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  [ ] All previous feature work committed and merged to main?     │
│  [ ] Constitution updated with learnings from the last feature?  │
│  [ ] Roadmap reviewed — is this still the right next feature?    │
│  [ ] Agent context cleared (/clear)?                             │
│      (Ensures specs capture intent, not memory snapshots,        │
│       and focuses the agent's limited context budget on          │
│       the next task only)                                        │
│  [ ] New feature branch created?                                 │
│  [ ] Fresh feature spec prompt ready to send?                    │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Strategies to Combat AI Fatigue

Agents can generate massive amounts of code very quickly, making the human-in-the-loop review exhausting. Use these strategies:

Review at a high level — does it match the spec and reflect your intent?
Don’t nitpick variable names, CSS classes, or minor style choices
For complex areas (security, database schema), implement one task group at a time
Use the agent’s sub-agent review for a thorough second look: ask the agent to spawn several sub-agents to do a deep review of the entire project with the feature change. Sub-agents give the review more reasoning space and preserve the main agent’s context window rather than polluting it.
When you find an omission (e.g., “prop types should be in a standalone TypeScript type file”), fix the code via the agent and update the spec — it will apply automatically to all future features.

Shipping an MVP

If I am confident in the constitution, I sometimes build the rest of the roadmap in a single pass to produce, for example, an MVP.

Brownfield / Legacy Projects

I have used SDD for new and existing codebases:

┌───────────────────────────────────┬───────────────────────────────────┐
│          GREENFIELD               │            BROWNFIELD             │
│       (New project)               │       (Existing codebase)         │
├───────────────────────────────────┼───────────────────────────────────┤
│ Draft Constitution via            │ Agent generates Constitution by   │
│ conversation with agent           │ reading existing code             │
│                                   │                                   │
│ Agent asks questions to           │ Agent extracts: file structure,   │
│ discover your preferences         │ framework versions, patterns,     │
│                                   │ then asks clarifying questions    │
│                                   │                                   │
│ Roadmap starts from scratch       │ Roadmap aligns to existing        │
│ based on your product vision      │ TODO.md, issue trackers, or docs  │
│                                   │                                   │
│ Feature loop begins immediately   │ Feature loop begins immediately   │
│ after Constitution is committed   │ after Constitution is committed   │
└───────────────────────────────────┴───────────────────────────────────┘

Step-by-Step: Recipe for Onboarding a Legacy Project

Step 1 — Gather existing documentation

Collect README.md, TODO.md, issue tracker exports, any architecture docs, and existing product requirement documents. Your legacy project might have plans in spreadsheets, Word documents, or Jira — add as much context as you can.

Step 2 — Send the legacy Constitution prompt

The prompt is nearly the same as for a greenfield project, with one key addition: tell the agent to look for roadmap items in existing artifacts.

I am introducing Spec-Driven Development to an existing project.
Please read all files in this directory and generate a Constitution:

  specs/mission.md      — based on the README and any product context
  specs/tech-stack.md   — based on the actual frameworks, versions,
                          and file structure you find in the codebase
  specs/roadmap.md      — based on TODO.md and any outstanding work

The agent will discover and in a sense reverse-engineer the SDD
artifacts from the existing codebase. Ask me questions to fill
any gaps you cannot determine from the code.

Step 3 — Review and commit

Review all three files and correct any incorrect assumptions the agent made to fill gaps and then commit.

Step 4 — Continue with the standard feature loop

From this point, the workflow is identical to a greenfield project. The spec is now the memory of the project — it does not fade. The Constitution helps align all future code changes made by the agent with what past developers have already created.

Building and Automating Your Own Workflow

Once you are comfortable with the core loop, begin automating and customizing using:

MCP servers

I generally use the context7 MCP server to let the agent review the latest set of documentation.

Research Backlogs

When you have an idea mid-feature that you want to explore without committing to it, use a research backlog:

I want to explore switching to Turso for our database.
Research this topic with me, but do not add it to the roadmap yet.
When we are done, write a report to specs/research/turso-db.md.

You can later ask the agent to schedule the research on the roadmap with a link to the backlog file. As your research backlog grows, write a skill to automate the research workflow.

Key Principles Summary

The complete SDD process distilled into one reference:

╔══════════════════════════════════════════════════════════════════════╗
║           SPEC-DRIVEN DEVELOPMENT — COMPLETE REFERENCE               ║
╠══════════════════════════════════════════════════════════════════════╣
║                                                                      ║
║  CONSTITUTION  (Once per project — a living document)                ║
║  ──────────────────────────────────────────────────────              ║
║  mission.md     → The WHY   (vision, audience, scope)                ║
║  tech-stack.md  → The HOW   (frameworks, constraints, schema)        ║
║  roadmap.md     → The WHAT  (features, phases, sequence)             ║
║                                                                      ║
║  FEATURE LOOP  (Repeat for every feature)                            ║
║  ──────────────────────────────────────────────────────              ║
║  1. SPECIFY    New branch  →  /clear  →  interview agent             ║
║                Commit plan.md, requirements.md, validation.md        ║
║                                                                      ║
║  2. IMPLEMENT  /clear  →  implement prompt                           ║
║                Review diffs as the agent works                       ║
║                Small, frequent commits                               ║
║                                                                      ║
║  3. VALIDATE   Code review at a high level                           ║
║                Fix via the agent (keeps specs in sync)               ║
║                Run tests & validation scorecard                      ║
║                Merge feature branch to main                          ║
║                                                                      ║
║  4. REPLAN     Update the Constitution with what you learned         ║
║                Review and adjust the roadmap                         ║
║                Package repeated prompts as Agent Skills              ║
║                                                                      ║
║  ALWAYS                                                              ║
║  ──────────────────────────────────────────────────────              ║
║  • Clear context (/clear) at the start of each major step            ║
║  • Dedicated branch per feature — and per replanning cycle           ║
║  • Human-in-the-loop: YOU decide, the agent elaborates               ║
║  • Steer at the right altitude — goals, not variable names           ║
║  • When you find a gap, fix the SPEC — it is the project memory      ║
║  • Commit often — small steps compound into great results            ║
║                                                                      ║
║  ANTI-PATTERNS TO AVOID                                              ║
║  ──────────────────────────────────────────────────────              ║
║  ✗ Skipping the spec and starting implementation directly            ║
║  ✗ Editing spec files manually instead of via the agent              ║
║  ✗ Carrying context forward across features without /clear           ║
║  ✗ Nitpicking low-level code instead of reviewing intent             ║
║  ✗ Rushing to the next feature without replanning                    ║
║  ✗ Implementing a large chunk on a weak Constitution                 ║
║                                                                      ║
╚══════════════════════════════════════════════════════════════════════╝

Closing Thoughts

“The best code starts with a great spec.”

The specs you craft, the Constitution you maintain, and the workflow you automate are what separate a thoughtfully engineered software product from a pile of AI-generated code that only one session ever understood.

Start small: pick your next feature, write a spec before writing any code, and see how much more confidently and consistently the agent delivers.

Agentic Engineering: Taming a Legacy Codebase

2026-04-22T08:00:00-04:00

Why This Article Exists

Every engineering team eventually inherits a codebase that has outgrown its original design. Features were shipped, deadlines were met, and somewhere along the way the foundations quietly cracked. Hardcoded secrets found their way into source control. async void started creeping into timer callbacks. Collections got shared across threads without locks. A comment saying // TODO: fix this properly quietly turned into a permanent resident.

This article documents how our team used Claude to audit and harden three legacy services over a multi-month effort. The services were modest in size — roughly fifteen to sixty source files each — but they sat on the critical path of a real-time control system. A crash in any of them meant downtime for physical equipment in the field. That made the usual “just rewrite it” advice completely unacceptable.

What follows is a practical playbook. We walk through how we used Claude for five distinct jobs:

Refactoring sprawling services back into something understandable.
Fixing bugs and logging gaps that hid failures from operators.
Eliminating race conditions in collection access, timers, and state flags.
Improving code quality across naming, dead code, and error handling.
Reducing technical debt measurably, without freezing feature work.

High-risk changes were verified with tests first or alongside the fix, especially around concurrency, startup behavior, and logging. Every code sample below has been obfuscated — names, domains, and specifics have been changed — but the patterns, shapes, and lessons are exactly what we encountered in production.

The Landscape We Inherited

Three services sat at the heart of the system. A device-side proxy ran on embedded Linux hardware and bridged a local message bus to a central coordinator over SignalR. A server-side coordinator aggregated state from hundreds of connected devices and fanned out commands to operator consoles. A mobile controller gave field operators a touchscreen interface to issue commands.

Across the audit documents we generated for these services, Claude surfaced well over 140 issue instances across five categories:

Category	Issues Found
Bugs and logic errors	32
Race conditions and thread safety	44
Security and secret management	9
Logging and observability gaps	38
Code quality and tech debt	24

Those counts came from multiple audits run on different services and at different points in time, so they are best read as a backlog inventory rather than a mathematically exact “open issue count” at one instant. That distinction matters. Engineers can smell inflated metrics immediately.

What the numbers did tell us, reliably, was where the risk clustered: concurrency, observability, and unsafe async usage. No single human could hold that entire list in working memory while also writing code. That is exactly the kind of problem where Claude earns its keep.

What Claude Did, and What Human Developers Still Owned

Claude was most useful in four places:

Inventory. Scanning a module and producing a structured issue list with file paths, line numbers, and candidate fixes.
Pattern matching. Finding every variation of the same bug class: SafeFireAndForget() without an error callback, mutable dictionaries shared across threads, Task.Delay calls with unclear units, timer callbacks implemented as async void.
Drafting small fixes. Once we had a failing test or a clearly defined target, Claude was effective at drafting the minimal change.
Summarizing conventions. It was surprisingly good at inferring the codebase’s implicit design rules from the parts that were already clean.

Humans still owned the parts that actually determine whether a system stays safe:

Severity. Claude could identify a race; engineers still had to decide whether it was a theoretical code smell, a real production risk, or already serialized by some upstream mechanism.
Merge readiness. Many source audits included issues that were fixed on a branch, partially fixed, or still open. Humans had to reconcile audit output with reality.
Test strategy. Claude could suggest a fix, but engineers still had to decide whether the defect needed a unit test, a stress harness, an integration test, or simply a code review plus a reproducer.
Stopping. We did not pursue zero issues at any cost. Some low-severity findings were documented and left alone because the churn was not worth it.

Our Working Loop

Every change followed the same four-phase cycle. The loop became muscle memory within the first two weeks.

The key insight was that Claude is excellent at the first two phases — the ones humans find tedious — and genuinely helpful at the third phase when guided by an explicit failing test. The fourth phase still belongs to humans, but it becomes fast when the diff is small and the test captures the intent.

Throughout the effort, we produced markdown audit documents. Each one listed every issue, its severity, its file and line numbers, a short explanation of the danger, and a proposed fix. These documents became the working backlog for each service and were regenerated regularly as the code changed.

Part 1 — Refactoring

We asked Claude for a refactoring proposal. The useful part was not “AI architecture.” It was the dependency map: which methods touched transport, which ones mutated shared state, and which timers or callbacks crossed those boundaries. From there, the split along transport, command dispatch, state, and peripheral control became fairly obvious.

Rather than inventing style guides from scratch, we had Claude summarize the implicit conventions it observed in the existing clean code, and then used those as the refactoring targets for the rest. Three principles emerged:

One responsibility per class, one mutation site per field. If a field was being written in three places, we refactored so only one method could mutate it.
Pass cancellation through every async boundary. No exceptions.
Return snapshots, not references. Any public getter on shared mutable state returned a copy.

These rules sound obvious, but before Claude cataloged every violation across sixty files, we had no idea how widespread the pattern breaks actually were.

The audit surfaced bugs ranging from the embarrassing to the genuinely dangerous. We will walk through two representative examples and then describe the logging work, which turned out to have the highest operational return.

The “task is always faulted” bug

A senior engineer on the team had written a background task launcher that was supposed to log a critical message if the task failed. It looked something like this:

task.ContinueWith(t =>
{
    if (t.IsFaulted || t.IsCompleted)
    {
        _logger.LogCritical("Background task faulted! {ex}", t.Exception);
    }
});

Claude flagged this immediately. ContinueWith only runs after the antecedent completes, which means t.IsCompleted is always true inside the callback. Every normal, successful completion was being logged as a critical failure with a null exception. Worse, the parent health-check loop was looking at IsFaulted || IsCompleted on the returned continuation task, which normalizes to completed regardless of the antecedent — so the health loop was restarting the task on every one-second tick. A silent restart storm had been running in production for months, masked by the “critical” log spam nobody read anymore.

The fix was small:

task.ContinueWith(t =>
{
    if (t.IsFaulted)
    {
        _logger.LogCritical(t.Exception, "Background task faulted");
    }
}, TaskContinuationOptions.OnlyOnFaulted);

return task; // return the original, not the continuation

★ Insight ─────────────────────────────────────

The key was returning the original task rather than the continuation. The health-check loop needs to observe the real task’s fault status, not a continuation that always completes normally.
TaskContinuationOptions.OnlyOnFaulted is belt-and-suspenders: even if someone later changes the predicate, the continuation will only fire on the fault path.

─────────────────────────────────────────────────

The “unit mismatch” bug

Another entry from the audit:

// waitTime is in seconds
await Task.Delay(waitTime);

Task.Delay expects milliseconds. A value intended to wait thirty seconds was waiting thirty milliseconds. This bug had lived in startup code for over a year. Nobody noticed because the system mostly worked — but a handful of intermittent initialization failures finally got attributed to it once Claude surfaced the issue alongside the documentation comment that explicitly said the unit was seconds.

await Task.Delay(TimeSpan.FromSeconds(waitTime));

We took the lesson and asked Claude to scan every Task.Delay call site. It found two more with ambiguous units and converted all of them to TimeSpan-based calls as a policy.

The logging overhaul

Logging was the area with the highest return on effort. The pattern was painfully consistent: important operations had no logs, unimportant operations had too many, and nothing was correlated. An operator investigating a failed command in the middle of the night had no way to trace a single request from the user interface through the coordinator into the device and back.

    BEFORE                                   AFTER
    ─────────────────                        ────────────────────────────
    [10:42:01] INFO  Cmd received            [10:42:01] INFO  CorrelationId=a7f2...
    [10:42:01] INFO  Processing                              Cmd received
    [10:42:03] WARN  Something happened      [10:42:01] INFO  CorrelationId=a7f2...
    [10:42:05] INFO  Done                                    Processing {CommandId} for {DeviceId}
                                             [10:42:03] WARN  CorrelationId=a7f2...
    Which command? Which device?                             TransportError at step {Step}
    No way to join the dots.                 [10:42:05] INFO  CorrelationId=a7f2...
                                                             Completed {CommandId} in {ElapsedMs}ms

We asked Claude to produce a logging plan. The plan identified 38 gaps across seven categories: missing correlation IDs, unlogged services, SafeFireAndForget callbacks with no error handler, missing audit trails on user-facing operations, log level misuse, missing high-performance LoggerMessage source-generator usage, and minimal test coverage for logging behavior.

The most impactful change was introducing correlation IDs at the message boundary and threading them through the pipeline via Serilog.LogContext. A single command could now be followed from hub entry, through the command handler, into domain events, and back to the response — all tagged with the same identifier.

Just as important were the places where the system was failing silently. One startup path loaded the in-memory device cache using SafeFireAndForget(). If that task threw during initialization, the service would start “successfully” but every later gateway lookup would fail because the cache was empty. Another path fired safety-related operations without any durable audit trail. These were not style issues. They were production forensics failures.

A representative fix for silent fire-and-forget failures (obfuscated):

// Before — exception silently swallowed
initializationTask.SafeFireAndForget();

// After — critical startup failure is loud
initializationTask.SafeFireAndForget(ex =>
    logger.LogCritical(ex,
        "Initial data fetch failed — service started with empty cache"));

★ Insight ─────────────────────────────────────

The SafeFireAndForget helper is popular in mobile and server .NET code because it lets you call async methods from sync contexts. But its default behavior — swallow everything — is a silent failure generator. Every high-value call site needed an onException handler, and some startup-path failures deserved LogCritical, not LogError.
The biggest logging win was not “more logs.” It was better logs: correlation, identifiers, and audit trails on operations that operators actually care about.

─────────────────────────────────────────────────

We added logging assertions around key paths and used them to catch regressions, but this was not a codebase with exhaustive log-level test coverage. The useful lesson was narrower: once logging becomes part of your operational contract, it deserves tests just like any other behavior.

Part 3 — Eliminating Race Conditions

This was the longest phase and the one with the most learning. Race conditions are the pathology that legacy .NET codebases are most prone to, because C# makes it very easy to share a Dictionary or a bool between threads without anything shouting at you.

One of the concurrency audits found 44 race conditions across a service and its related components. They fell into six archetypes:

  ARCHETYPE                                       COUNT
  ─────────────────────────────────────────────   ─────
  1. Non-thread-safe collection shared across      12
     SignalR handlers, timers, and UI thread
  2. Plain bool / enum flag read and written        9
     from different threads (no volatile)
  3. Timer callback racing with Dispose or          7
     reassignment of the same timer field
  4. async void in timer / event contexts           6
     with no try-catch
  5. Check-then-act on state that can change        6
     across an await (TOCTOU)
  6. Fire-and-forget background loops with          4
     no cancellation or restart monitoring

Those categories appeared across the broader set of audits as well. We address the first three below.

Archetype 1 — Unprotected shared collection

Pattern we found, obfuscated:

public readonly Dictionary<DeviceId, DeviceState> DeviceCollection = new();

// Written from hub callback thread
DeviceCollection.Add(id, state);

// Enumerated from UI-bound property getter
public IEnumerable<DeviceState> AllDevices => DeviceCollection.Values;

// Cleared from state machine event
DeviceCollection.Clear();

Three threads, zero synchronization. The first time a user opened the device list while a state transition fired, we got an InvalidOperationException from the enumerator. In a test environment it was easy to reproduce; in production it took months.

The fix had two reasonable shapes. For dictionaries where we needed cheap concurrent access, ConcurrentDictionary was often the right answer. For dictionaries where we wanted atomic “swap the whole thing” semantics — for example, replacing the entire device list at login — we used an ImmutableDictionary and a single-reference swap:

private ImmutableDictionary<DeviceId, DeviceState> _devices =
    ImmutableDictionary<DeviceId, DeviceState>.Empty;

public ImmutableDictionary<DeviceId, DeviceState> Devices => _devices;

public void ReplaceAll(IEnumerable<KeyValuePair<DeviceId, DeviceState>> items)
{
    _devices = items.ToImmutableDictionary();
}

public void AddOrUpdate(DeviceId id, DeviceState state)
{
    ImmutableInterlocked.AddOrUpdate(ref _devices, id, state, (_, _) => state);
}

★ Insight ─────────────────────────────────────

ImmutableInterlocked.AddOrUpdate gives you atomic single-writer-multiple-reader semantics without a lock. The reader gets a consistent snapshot because the dictionary reference they captured is literally immutable.
This was not a universal replacement for locks. In some audited code, a plain lock was simpler and safer because readers and writers also needed to coordinate side effects like event raises or timer replacement.

─────────────────────────────────────────────────

Archetype 2 — Plain flag across threads

The smallest possible bug with the largest possible impact:

private static bool _isBusy;

public static void BeginWork()
{
    _isBusy = true;
    Task.Run(() =>
    {
        while (_isBusy)  // compiler may hoist this read out of the loop
        {
            DoStep();
        }
    });
}

public static void StopWork() => _isBusy = false;

The caller signals StopWork, the background task never sees the write, and the app hangs. The fix is either volatile or — better — a CancellationTokenSource:

private static CancellationTokenSource? _cts;

public static void BeginWork()
{
    _cts = new CancellationTokenSource();
    var token = _cts.Token;
    Task.Run(() =>
    {
        while (!token.IsCancellationRequested)
        {
            DoStep();
        }
    }, token);
}

public static void StopWork()
{
    _cts?.Cancel();
    _cts?.Dispose();
    _cts = null;
}

★ Insight ─────────────────────────────────────

A CancellationToken has the cross-thread memory semantics baked in — the reader always sees the cancelled state after Cancel() returns. You do not have to think about volatile because you delegated that worry to the framework.
A secondary benefit: CancellationToken composes. You can pass it to Task.Delay, HttpClient.SendAsync, database calls, and loop checks with a single mechanism. A volatile bool cannot do that.

─────────────────────────────────────────────────

Archetype 3 — Timer callback racing its own field

This one was the most subtle. A service had a reusable timer field that was reassigned on restart:

_heartbeatTimer = new Timer(HeartbeatInterval);
_heartbeatTimer.Elapsed += OnHeartbeatElapsed;
_heartbeatTimer.AutoReset = false;
_heartbeatTimer.Start();

If the caller invoked this twice quickly, the first timer’s Elapsed handler could still fire after the field had been reassigned. The handler then mutated the “current” timer state even though it was the previous timer’s callback. Worse, the old timer was orphaned — it was still alive on the garbage collector’s finalizer queue, still capable of running its callback one more time.

The fix was to wrap the swap in a lock and eagerly stop-and-dispose the prior timer:

private readonly object _timerLock = new();
private Timer? _heartbeatTimer;

private void ReplaceHeartbeatTimer(TimeSpan interval)
{
    lock (_timerLock)
    {
        if (_heartbeatTimer is not null)
        {
            _heartbeatTimer.Stop();
            _heartbeatTimer.Elapsed -= OnHeartbeatElapsed;
            _heartbeatTimer.Dispose();
        }
        _heartbeatTimer = new Timer(interval.TotalMilliseconds)
        {
            AutoReset = false,
        };
        _heartbeatTimer.Elapsed += OnHeartbeatElapsed;
        _heartbeatTimer.Start();
    }
}

We also audited every Timer.Elapsed handler we could find for async void lambdas. An exception thrown out of async void can tear down the process or vanish into an unobserved failure path depending on context. In either case, it is unacceptable in infrastructure code. The fix was to wrap the handler and surface the failure explicitly:

_disconnectTimer.Elapsed += async (_, _) =>
{
    try
    {
        await PutOfflineAsync();
        if (_publisher is not null)
        {
            await _publisher.PublishAllEventsAsync(this);
        }
    }
    catch (Exception ex)
    {
        _onError?.Invoke(ex);
    }
};

The _onError callback was injected at construction time, which kept the domain layer free of an ILogger dependency but still let failures be surfaced by the application layer. That single pattern — inject an Action at the seam — let us rescue dozens of silent failure sites across both services.

The burndown

We did not fix all 44. We explicitly chose to leave one. It was a static integer increment in a path that was already serialized by the SignalR hub pipeline — meaning a race was theoretically possible but not reachable given how the method was called. We documented the reasoning and moved on. Not every bug is worth fixing, but every bug is worth understanding.

Part 4 — Reducing Technical Debt

“Technical debt” is a vague term, so we tried to reduce it to observable indicators. We tracked five, all derived from audit output:

  Indicator                           Start   Mid   Later
  ─────────────────────────────────  ──────  ────  ────
  Hardcoded secrets in source           3     0     0
  Files commented out in entirety       4     1     0
  TODO/FIXME comments                  28    19    11
  Handlers throwing NotImplemented      5     2     0
  Services without ILogger injection    4     1     0

The secret audit was the single highest-leverage activity we did. Claude found a hardcoded GitHub personal access token in nuget.config — a file not covered by .gitignore. We rotated the token, moved it to a CI secret, and added the file to .gitignore within the same hour. An AES key was hardcoded in a crypto helper; we moved it to environment-variable-backed configuration and generated a random IV per encryption instead of the all-zeros IV that had shipped.

The remaining eleven TODO comments were all triaged. Each one either got a ticket, got a documentation comment explaining why we were leaving it, or was deleted because the thing it was pointing at no longer existed.

Testing Strategy Throughout

For the risky changes, the workflow was:

    ┌──────────────────────────────────────────────────────┐
    │  1. Write the failing test that captures the defect. │
    │  2. Ask Claude to propose the minimal fix.           │
    │  3. Apply the fix, run the test, watch it go green.  │
    │  4. Run the full suite to catch regressions.         │
    └──────────────────────────────────────────────────────┘

For race condition fixes this was not trivial. Concurrency bugs do not reliably reproduce. We leaned on two techniques:

Deterministic stress harnesses. For collection-access issues, we wrote tests that spun up dozens of tasks all hammering the same API, then asserted invariants. Before the fix, these tests failed within ten iterations. After, they ran ten thousand iterations clean.
Injectable time. For timer-related issues, we replaced System.Timers.Timer with a small abstraction that accepted a test clock, so the test could advance time manually and observe every callback fire in a known order.

A representative stress test (obfuscated):

[Fact]
public async Task Concurrent_AddOrUpdate_does_not_throw_or_lose_data()
{
    var service = new DeviceRegistry();
    const int WRITERS = 16;
    const int ITERATIONS = 500;

    var writers = Enumerable.Range(0, WRITERS).Select(w => Task.Run(() =>
    {
        for (int i = 0; i < ITERATIONS; i++)
        {
            var id = new DeviceId(w * ITERATIONS + i);
            service.AddOrUpdate(id, new DeviceState(id, online: true));
        }
    })).ToArray();

    var reader = Task.Run(() =>
    {
        int seen = 0;
        while (seen < WRITERS * ITERATIONS)
        {
            seen = service.Devices.Count;
        }
    });

    await Task.WhenAll(writers.Concat(new[] { reader }));

    service.Devices.Should().HaveCount(WRITERS * ITERATIONS);
}

Before the fix, this test intermittently threw InvalidOperationException from the reader. After the synchronization change, it ran clean repeatedly.

Not every defect was amenable to strict test-first development. Some issues were better handled as:

a small reproducer plus code review,
a logging assertion on a failure path,
or an integration test added after the fix when the seam became testable.

The practical lesson was not “TDD solves legacy systems.” It was narrower: if you are changing concurrency, startup, or observability logic, you need some repeatable proof that the system got safer.

What Worked, What Surprised Us

Three things worked far better than expected.

The audit was the unlock. The moment we had a single markdown document listing every issue by severity, with file and line references, the work became tractable. Without it we would have spent the whole project arguing about priorities.

Claude’s plans were often better than our gut priorities. The recommended fix order consistently put safety-critical issues ahead of ergonomics, even when a developer might have reached for the more satisfying refactor first. Following the plan rather than the vibe saved us from merging a “nice” change while a real bug was still live.

Small, tight edits compounded. The average fix was under twenty lines. The average PR was under two hundred. We deliberately resisted the temptation to bundle changes. Small PRs reviewed fast; fast review meant more PRs per week; more PRs per week meant the backlog actually burned down instead of drifting.

Two things surprised us.

Logging gave the best ROI of any category. We initially ranked logging as a medium-priority cleanup. In practice, the correlation-ID work and startup-failure visibility changed how fast we could debug real incidents.

The residual race conditions were the right call to leave alone. We originally planned to hit zero. Once the count dropped into the single digits, the remaining ones were all in paths that were serialized by upstream mechanisms or would only fire under contention we could not realistically reproduce. Fixing them would have churned a lot of code for no measurable benefit. “Document and move on” turned out to be a valid move.

A Playbook You Can Steal

If you are staring down a legacy codebase of your own, here is the compact version of the playbook:

Ask Claude to audit a single module. Start narrow. Get the markdown output.
Tag every issue with severity. Safety-critical first, data-integrity next, everything else by effort.
Write the failing test before the fix. Every single time. No exceptions, even for one-line changes.
Keep PRs small. One issue, one PR. Your reviewers will thank you and your burndown will be honest.
Regenerate the audit monthly. New issues creep in. Catching them when they are one line old is cheap.
Invest in logging early. Correlation IDs and structured log tests repay their cost within weeks.
Delete more than you write. Commented-out files, stale TODOs, unused branches — none of them are getting better with age.

The core realization is that Claude is not a replacement for engineering judgment. It is a tireless pair for the parts of engineering that humans are worst at: the inventory, the tabular bookkeeping, the “did we check every Dictionary in the codebase?” grind. Pairing Claude’s completeness with a human’s priority-setting turns the dreaded “tech debt week” into something closer to a steady drumbeat of small, confident improvements.

If I were distilling this into one rule for engineering teams, it would be this: use Claude to widen the search space, not to waive review. Let it find the patterns, draft the boring fixes, and keep the backlog honest. But keep the decisions about severity, test depth, and merge readiness with engineers who understand the system’s actual failure modes.

And every time one of those improvements lands — every time a race condition that used to page the on-call engineer at three in the morning stops paging anyone — you feel the debt getting lighter. That is the whole point.

Written from the trenches of a real mission-critical .NET codebase. Names, domains, and code samples have been obfuscated; the patterns and lessons are exactly as we found them.

Reverb: A Semantic Cache That Knows When Its Answers Go Stale

2026-04-22T07:00:00-04:00

Caching LLM responses seems, at first glance, like a simple optimization. Record the prompt, record the answer, serve the answer next time the same prompt comes in. In practice it is a surprisingly deep problem, and the two standard approaches both fail in characteristic ways. Exact-match caches miss on anything short of a byte-identical prompt, which is almost never how users actually ask questions. TTL-based caches serve confidently-stale answers for hours after the underlying knowledge base has changed — the classic hallucination vector dressed up as “we cached it.”

Reverb is a Go library and standalone service that addresses both failure modes. It combines a two-tier cache (exact SHA-256 match, then embedding-cosine similarity) with knowledge-aware invalidation: every cached entry tracks the source documents it was derived from, and a change-data-capture pipeline evicts entries by causality when their sources change. TTLs become a backstop, not the primary correctness mechanism.

Two-tiered approach

The exact-match tier is cheap and essential — a SHA-256 hash of the normalized prompt plus namespace and model ID, looked up in a store. Sub- millisecond latency, perfect precision, zero false positives. It catches retries, duplicated user requests, and programmatic callers that issue the same prompt on a schedule. In production workloads this tier alone typically handles 20–40% of traffic, depending on how much of the workload is human-in- the-loop.

The semantic tier is where it gets interesting. Two users phrasing the same question differently — “how do I reset my password?” vs. “password reset help” — should get the same answer. The tier computes an embedding for the incoming prompt, searches a vector index for top-k nearest neighbors above a configurable cosine-similarity threshold (0.95 by default), and returns the closest hit. Latency climbs to ~50ms, which is still one to two orders of magnitude faster than actually calling the LLM, and recall improves substantially.

The fallthrough contract is the part that makes it work: exact misses do not fail, they degrade to a semantic lookup. Semantic misses do not fail, they degrade to a real LLM call that then writes back through both tiers. Three states — exact hit, semantic hit, miss — all with correct fallback.

Architecture

Reverb is built around clean interfaces for each pluggable component, which is what lets it scale down to an in-memory dev setup and up to Redis plus HNSW plus NATS-driven CDC without code changes. The top-level flow:

Notice that the invalidation path and the lookup path share no state beyond the store itself.

CDC events can fire at any time — a webhook from your CMS, a NATS JetStream message, a polling loop against a content API.
The invalidation engine consults the lineage index to figure out which specific cache entries to evict. Every other cached entry keeps its hit rate.

Two interesting design choices are:

the two-tier fallthrough, which means the cache has a meaningful answer for most queries, not just byte-identical ones
the lineage-based invalidation, which means stale-knowledge hallucinations stop being an accepted cost of caching

Neither is a novel technique in isolation — CDN cache tags and hierarchical CPU caches use similar approaches. The novelty is in recognizing that LLM responses are derived data with explicit sources, and that derived-data systems have known-correct invalidation disciplines that work just as well when the derivation is a transformer inference.

Lineage as the first-class concept

When you Store() an entry, you hand Reverb a list of sources — the documents that contributed to the LLM’s answer. Each source is a (source_id, content_hash) pair. The lineage index maintains a bidirectional mapping: source IDs to the set of cache entries they contributed to, and cache entries to the set of sources they depend on. When a CDC listener reports a change for source_id = "doc:password-guide", the engine asks the lineage index for all dependent entries and walks through them:

That is also the contract the application has to honor. Reverb does not infer provenance by itself. Your retrieval layer, tool wrapper, or orchestration code must tell it which source documents actually contributed to the answer. If you omit a source, Reverb cannot invalidate on that source’s change; if you over-attach unrelated sources, you will evict too aggressively. The cache is only as causally correct as the lineage you record at write time.

If the source has been deleted (zero hash), invalidate every dependent entry.
If the source still exists but the content_hash differs from the stored value, invalidate.
If the content hash matches (the webhook fired but nothing actually changed), do nothing. Idempotency is free.

Compare this to the naive alternative — TTL-based invalidation, tuned conservatively at, say, 6 hours. During those 6 hours, the cache can serve any number of answers derived from a document that changed 5 minutes after the entry was cached. The user experiences a confident, fluent, completely wrong answer. With lineage-based invalidation, the moment your CMS pushes the webhook, the relevant cache entries are gone.

The operational sequence is short and predictable:

At t0, the application stores an answer plus sources=[("doc:password-guide", hash_v1)].
At t1, the source document changes and the CMS emits a CDC event carrying hash_v2.
At t2, Reverb looks up every cache entry linked to doc:password-guide.
At t3, entries whose stored hash is still hash_v1 are evicted from the store and vector index.
At t4, the next lookup misses the cache and regenerates against fresh knowledge.

This is not a new idea in the abstract — database query caches have done tuple-level invalidation for decades, and CDN cache tag invalidation is a production pattern at scale. The contribution is noticing that LLM response caches have exactly the same dependency structure and applying the same discipline.

The pluggable-backends discipline

Reverb exposes four interfaces, each with two or more implementations:

embedding.Provider — OpenAI, Ollama, or a deterministic fake for tests. The fake (fake.New(n)) is a hash-based embedder that produces stable vectors for stable inputs, which makes integration tests reproducible without requiring an API key. This is the kind of detail that signals the library was written by someone who actually runs tests in CI.
vector.Index — a brute-force flat index (O(n)) and an HNSW index (approximately logarithmic search in practice). You start with flat, and when you outgrow it you swap in HNSW with no other code changes.
store.Store — memory for dev, Redis for production, BadgerDB for embedded use cases. The conformance subpackage ships a shared test suite that every store implementation must pass, which is how the interface stays honest over time.
cdc.Listener — webhook, polling, NATS. Each is a different architectural fit: webhook for push-based CMS integrations, polling for systems you cannot modify, NATS for high-volume event streams.

The interface-driven design makes Reverb realistic to adopt: start with all-in-memory (zero infrastructure), move to Redis plus HNSW when you outgrow a single process, swap the CDC listener when your source-of-truth changes. None of those migrations need to touch the application code.

Deployment shapes

Reverb runs as three things, depending on how you want to use it:

A Go library, linked directly into an application. Fastest path, lowest latency, no extra process to manage. The pkg/reverb facade is the full public API.
A standalone HTTP server (cmd/reverb). Language-agnostic REST API under /v1/ — lookup, store, invalidate, entries/{id}, stats, healthz. This is the path if your application is in Python or TypeScript and you want to cache centrally.
A standalone gRPC server, same operations as the HTTP API but with protobuf-defined contracts in pkg/server/proto/reverb.proto. Clients in any language can generate their own stubs.

The HTTP and gRPC servers share the same underlying Client, so you can deploy both protocols side-by-side from the same binary and pick whichever your calling environment prefers.

Where this fits

I think semantic caching is about to become table stakes for production LLM systems in the same way that ordinary HTTP caching became table stakes for the web in the 2000s. The cost pressure is enormous — every cache hit is an LLM call that did not happen — and the latency improvement is user- perceptible. But “cache LLM responses” is the easy version of the problem. The hard version is “cache LLM responses correctly, even when the world the LLM is reasoning about changes out from under the cache.” That is the problem Reverb is built to solve.

Reverb handles the knowledge freshness dimension of agent reliability. For the trust side — knowing which agents to rely on based on observed behavior — see MultiTrust. For detecting when agents get stuck waiting on each other, see Tangle.

MultiTrust: Subjective Logic as a Runtime for Multi-Agent Trust

2026-04-22T06:00:00-04:00

In multiagent systems, trust of an agent is a valuable asset since it gives them an ability to reason about their future collaboration, coordination, and plan. Most “trust score” implementations in agentic systems are a single float between 0 and 1. That number is doing two jobs at once — representing how much positive evidence an agent has accumulated, and representing how confident the system is in that judgment — and it collapses them into a value that makes the two indistinguishable. A brand-new agent with no history and a seasoned agent that has run 10,000 tasks with an even win/loss record both land at 0.5. The scalar has no room to say “I don’t know yet.”

MultiTrust fixes this by reaching for the right math. It represents trust as a Subjective Logic opinion — a triple of (belief, disbelief, uncertainty) that sums to one — and exposes the whole machinery as an MCP server, so any Model Context Protocol-aware agent can consult it as a standard tool call.

What Subjective Logic buys you

Subjective Logic, developed by Audun Jøsang in the early 2000s, is a probabilistic logic designed precisely for reasoning under uncertainty where the uncertainty itself must be represented. An opinion looks like this:

opinion = Opinion(
    belief      = 0.60,   # evidence supports trusting the agent
    disbelief   = 0.12,   # evidence against
    uncertainty = 0.28,   # we don't have enough data to be sure
    base_rate   = 0.50,   # prior: how trustworthy is a "typical" agent?
)
# belief + disbelief + uncertainty == 1.0 (invariant)

The projected trust score — what you use to make a gating decision — is belief + uncertainty × base_rate. This is the clever bit. A vacuous opinion (0, 0, 1) projects to base_rate: in the absence of evidence, you fall back to the population prior. As evidence accumulates, uncertainty shrinks, and the projection converges on the true belief/disbelief ratio. You get cold-start behavior and seasoned-agent behavior from the same formula, with no special-casing.

Under the hood, evidence maps to opinions through the Beta distribution:

belief      = positive / (positive + negative + W)
disbelief   = negative / (positive + negative + W)
uncertainty = W        / (positive + negative + W)

where W is a prior weight (typically 2). Every call to submit_evidence() is an update to the positive/negative counters; the opinion recomputes deterministically. There are no magic numbers, no tuned decay constants that drift out of sync with reality.

That sounds abstract until you compare cold start against real history. With base_rate = 0.5 and W = 2, the mapping makes the distinction explicit:

Agent state	Evidence `(positive, negative)`	Opinion `(b, d, u)`	Projected trust
Brand-new agent	`(0, 0)`	`(0.00, 0.00, 1.00)`	`0.50`
Early promising run	`(3, 0)`	`(0.60, 0.00, 0.40)`	`0.80`
Mixed but well-observed	`(50, 50)`	`(0.49, 0.49, 0.02)`	`0.50`

The important case is the first versus the third row. Both might project to roughly 0.5, but they mean opposite things. The brand-new agent is at 0.5 because the system has no evidence and is falling back to the prior. The seasoned but inconsistent agent is at 0.5 because the system has a lot of evidence and that evidence is genuinely split. A scalar score hides that difference; the opinion keeps it visible.

Architecture

MultiTrust is organized around a single async orchestrator, TrustManager, with pluggable backends for storage, evidence accumulation, and exposure. The MCP server is one of several entry points — you can also use the library directly, gate async functions with decorators, or export/import snapshots between environments.

The flow is deliberately one-directional:

Callers submit observations as Evidence records (agent, authority, positive count, negative count, an optional rule name).
The TrustManager fuses those into Subjective Logic opinions using the canonical operators — cumulative fusion for independent authorities, averaging for redundant ones.
Opinions are persisted in the trust store.
When asked for a trust score, the manager applies time decay (opinions drift toward vacuous at a configurable half-life), projects the current opinion, and returns the scalar.

The EvidenceLedger is the piece that pulls its weight in production. It stores the individual observations that contributed to an opinion, with authority IDs and rule names. When something goes wrong and you need to defend a trust decision — why did we route this request to agent X? — explain_trust() produces a structured breakdown showing which authorities and which rules moved the score, by how much, and when.

A representative explanation looks less like a mystery score and more like an audit trail:

agent: fact-checker
current opinion: b=0.31 d=0.46 u=0.23 base_rate=0.50
projected trust: 0.425
top contributors:
  - validator / factual_consistency : -0.18  (7 negative observations)
  - latency_monitor / timeout_rate  : -0.05  (3 degraded responses)
  - editor / accepted_corrections   : +0.04  (2 successful recoveries)
decision at threshold 0.60: blocked

That is the practical advantage of carrying belief, disbelief, uncertainty, authority, and rule names all the way through the runtime: when the system changes its behavior, you can inspect the reason instead of reverse-engineering it from a number.

A motivating example

The repository ships a hallucination_firewall.py demo that captures the intended use case. A research pipeline has a fact-checker agent whose accuracy silently degrades — perhaps the underlying model was updated, perhaps a prompt regression slipped in, perhaps the content it’s checking has drifted out of its training distribution. Each failed fact-check is submitted as negative evidence against the agent. Within a dozen or so observations, the opinion shifts enough that is_trusted(threshold=0.6) returns false, and the orchestrator gates the fact-checker out of the pipeline before its mistakes reach the final answer.

The critical thing is that this happens gradually and mathematically, not through a hand-tuned heuristic. The same framework handles the other direction too — agents recover trust as they accumulate positive evidence, and the time-decay mechanism ensures ancient evidence stops dominating current behavior.

If you are building a multi-agent system where different agents have different reliability profiles — and in practice, every non-trivial multi-agent system has this — you eventually need a way to represent and reason about that. Rolling a scalar score is the obvious first move, and it will be wrong in the three places that matter: cold start, recovery after degradation, and explainability. Subjective Logic is a two-decades-old, well-studied framework that gets all three right. MultiTrust is a small, modern, MCP-native implementation of it. The combination of principled math and standard-protocol exposure is, I think, the shape this category of tool should take.

MultiTrust addresses the trust dimension of multi-agent reliability. Two companion pieces cover adjacent failure modes: Tangle detects deadlocks and livelocks when agents form circular waits, and Reverb ensures cached LLM responses don’t go stale when the underlying knowledge changes.

Tangle: Deadlock and Livelock Detection for LangGraph Agents

2026-04-22T05:00:00-04:00

Multi-agent LLM workflows are, from a concurrency standpoint, small distributed systems. They hold resources, they wait on each other, and — like every other distributed system we have ever built — they can get stuck. The failure mode is worse than an outright crash: no exception is raised, no timer fires, no agent knows anything is wrong. The workflow just stops producing tokens. The operator sees a spinner.

Tangle is a small Python library that catches this class of failure in real time for LangGraph workflows (and, via OpenTelemetry, for anything else). It reuses an idea that has been sitting in operating-systems textbooks since 1972 — the Wait-For Graph — and applies it at the agent layer, where the same topology has quietly reappeared. To be specific, in its current implementation, Tangle provides repeated-pattern detection over message digests.

The failure mode

Consider a four-agent pipeline: researcher → writer → reviewer → editor. Each agent, under certain states, may wait on output from another. Introduce a back-edge — say, editor → researcher for a re-research pass — and the dependency graph now contains a cycle. If every agent in the cycle is simultaneously in its waiting state, none of them can advance. None of them will ever advance. The workflow is deadlocked.

Livelock is the subtler sibling: no circular wait, but two agents bounce the same rejected message back and forth forever. A reviewer rejects a draft, the writer revises in a way that changes nothing material, the reviewer rejects the revision. Tokens keep being spent. Progress is zero.

In practice, a deadlock trace looks like this:

researcher waiting for writer
writer waiting for reviewer
reviewer waiting for editor
editor waiting for researcher   # closing edge; cycle exists now

At that moment, you do not need a timeout to guess the workflow is stuck. The structure itself is enough: every agent in the cycle is waiting on another agent in the same cycle, so no further progress is possible without external intervention.

Both failures are detectable in principle. The question is how to detect them cheaply enough that instrumentation doesn’t dominate the workflow’s own cost.

Architecture

Tangle separates event ingestion from detection from resolution. The three stages are deliberately independent — you can swap SDK hooks for OTLP spans, switch cycle detection to livelock detection per event type, and chain resolvers in any order. The shape of the system:

Events flow in from one of three sources. Each event is a small, typed record (e.g., REGISTER, WAIT_FOR, RELEASE, SEND, CANCEL, COMPLETE). They hit TangleMonitor.process_event(), which updates the Wait-For Graph and dispatches to the appropriate detector: WAIT_FOR events touch cycle detection, SEND events touch livelock pattern matching. When either fires, the resolver chain runs in order and halts on the first resolver that succeeds.

Why a Wait-For Graph?

The Wait-For Graph (WFG) is one of those classical constructs that keeps reappearing in disguise. Holt described it in 1972 for kernel deadlock detection. Database engines use it for transaction lock cycles. Distributed lock managers like Chubby and ZooKeeper reason about it implicitly. The insight in Tangle is that an LLM agent holding a conversational turn is, for the purposes of progress analysis, isomorphic to a process holding a resource. Same graph, different vertices.

That matters because cycle detection on a WFG is a well-understood problem with well-understood complexity. Tangle uses two complementary algorithms:

Incremental DFS on edge-add. When a new WAIT_FOR edge is added, walk back along existing edges from the target to see if you return to the source. O(V+E) worst case, but in practice tiny because multi-agent graphs are shallow.
Periodic Kahn’s-algorithm scan over the whole graph. A topological sort that fails is a cycle that exists. This is the belt-and-suspenders pass that catches anything the incremental detector might race against during concurrent edits.

Livelock is different — no cycle appears in the graph. Instead, Tangle fingerprints each SEND event’s message payload with xxhash (chosen for speed over cryptographic strength — this is a signal, not a security claim) and keeps the last N digests in a ring buffer. When the same digest reappears more than livelock_min_repeats times in the window, the detector fires. No semantic understanding of the message is required; identical repeated content is the signal.

That distinction matters operationally. Deadlock detection here is structural: if a Wait-For Graph contains a cycle, the workflow is blocked in a precise, checkable sense. Livelock detection is heuristic: Tangle is looking for repeated message patterns that strongly suggest non-progress, not proving a theorem about all possible non-progress states. That is still a useful line to draw in production. You can treat deadlocks as mechanically certain and livelocks as high-confidence warnings that deserve intervention.

The LangGraph integration

What makes this practical for day-to-day use is that instrumentation is two decorators:

from tangle import TangleConfig, TangleMonitor
from tangle.integrations.langgraph import (
    tangle_node,
    tangle_conditional_edge,
)

config = TangleConfig(
    resolution="cancel_youngest",
    livelock_min_repeats=3,
)
monitor = TangleMonitor(config=config, on_detection=print)

@tangle_node(monitor, agent_id="reviewer")
def reviewer(state):
    return {"feedback": review(state["draft"])}

@tangle_conditional_edge(monitor, from_agent="reviewer")
def route_after_review(state):
    if state["feedback"] == "approved":
        return "__end__"
    return "writer"   # back-edge — potential loop

The decorators emit REGISTER, SEND, WAIT_FOR, and RELEASE events transparently. Existing LangGraph code keeps working. Tracking is activated per-invocation by threading a tangle_workflow_id through the state dict, so you can roll out detection to a subset of production traffic without changing the graph definition.

For non-LangGraph workflows (or for multi-language stacks), Tangle can reconstruct the same events from OpenTelemetry spans, which means any tracing instrumentation you already have becomes deadlock-aware for free.

Resolution, not just detection

Detection without a response is an alert that wakes someone up at 3am. Tangle ships several built-in resolvers, for example:

Alert — the cheap default. Hand a structured Detection to a callback, let the application decide.
Cancel youngest — kill the most recently joined agent in the cycle. In practice this is the right default for review/revise loops: it breaks the cycle with minimal loss of context.
Tiebreaker prompt injection — for livelocks, inject a system message that explicitly names the repeated pattern and asks the agent to change tack. Cheaper than restarting the workflow.
Escalate — POST the detection to an external webhook for human or upstream-service intervention.

The chain executes in order and stops on the first success. Configure it once; the per-detection behavior emerges from the config, not from scattered try/except blocks in agent code.

Where this fits

If you are running LangGraph in production, especially with conditional edges or multi-agent negotiation patterns, you have almost certainly hit a workflow that hung. The standard mitigation — a coarse wall-clock timeout — is a blunt instrument: it catches deadlocks eventually, but at the cost of cancelling any slow-but-healthy run that exceeds the budget. Tangle’s contribution is to give the same workflow a structural reason to cancel (a cycle in the WFG, a repeated digest pattern) rather than a temporal one (we waited too long). That distinction matters at scale, because it decouples correctness from tail latency.

The approach generalizes past LangGraph. Any system where autonomous components exchange messages and occasionally wait on each other — agent frameworks, workflow orchestrators, multi-model ensembles — has the same failure modes. Tangle is an early, careful implementation of what I suspect will become a valuable tool in building reliable and fault tolerant agentic infrastructure: progress monitors that treat liveness as a first-class property, not a property you check by inference after everything has already gone quiet.

Tangle covers the liveness dimension of multi-agent reliability — detecting when workflows stop making progress. For the trust layer — modeling which agents are reliable based on accumulated evidence — see MultiTrust. For ensuring cached responses stay fresh when source knowledge changes, see Reverb.

outloop.blog

Agentic Engineering: From Architecture Document to Delivery Plan

The Source Material

The Planning Problem

Step 1: Extract Delivery-Relevant Facts

Step 2: Convert Architecture Boundaries into Work Boundaries

Step 3: Separate the Hot Path from the Warm/Cold Path

Step 4: Promote Safety Constraints into Delivery Gates

Step 5: Build State Matrices Before Writing Stories

Step 6: Resolve Ambiguity into a Decisions Log

Step 7: Turn Test Strategy into Continuous Quality Tracks

Step 8: Model Parallelization Explicitly

Step 9: Keep Release Work in the Plan

What Agents Did Well

Where Our Human Team Members Still Matters

A Repeatable Agentic Planning Pattern

Agentic Engineering: Spec-Driven Development

Why Spec-Driven Development?

My SDD Workflow at a Glance

Phase 0: Create the Project Constitution

Step-by-Step: Drafting the Constitution

Step 1 — create the knowledge base

Step 2 — create the constitution

Step 3 — Q&A with agents

Step 4 — Human-in-the-loop review

Step 5 — Commit the Constitution

Phase 1: Feature Specification

The Feature Spec Files

Step-by-Step: Feature Specification

Step 1 — a clean context and a new git branch

Step 2 — create a feature spec

Step 3 — Make decisions at the right altitude

Step 4 — Review and refine all three spec files

Step 5 — commit the feature spec

Phase 2: Feature Implementation

Step-by-Step: Feature Implementation

Step 1 — Clear the agent’s context

Step 2 — Send the implementation prompt

Step 3 — choosing the step size

Step 4 — observe progress and review

Step 5 — running the app

Phase 3: validation of the feature as a whole

Step-by-Step: Feature Validation

Step 1 — Start with the commit view

Step 2 — raising issues via the agent

Step 3 — run tests

Step 4 — constitution updates within a feature branch

Step 5 — Mark validation complete and merge

Phase 4: Project Replanning

What Replanning Covers

Step-by-Step: Replanning

Step 1 — Create a replanning branch

Step 2 — Update the Constitution with what you learned

Step 3 — agent Skills for repetitive workflow steps

Step 4 — Commit and merge the replanning branch

Managing AI Fatigue

Strategies to Combat AI Fatigue

Shipping an MVP

Brownfield / Legacy Projects

Step-by-Step: Recipe for Onboarding a Legacy Project

Step 1 — Gather existing documentation

Step 2 — Send the legacy Constitution prompt

Step 3 — Review and commit

Step 4 — Continue with the standard feature loop

Building and Automating Your Own Workflow

MCP servers

Research Backlogs

Key Principles Summary

Closing Thoughts

Agentic Engineering: Taming a Legacy Codebase

Why This Article Exists

The Landscape We Inherited

What Claude Did, and What Human Developers Still Owned

Our Working Loop

Part 1 — Refactoring

Part 2 — Fixing Bugs and Logging Blind Spots

The “task is always faulted” bug

The “unit mismatch” bug

The logging overhaul

Part 3 — Eliminating Race Conditions