Engineering · v2416 · 2026-05-20

Three engines.
One swarm.

The Chip swarm used to run every step on the same flagship model. That was expensive on builders and fragile on the integrator. We split it: gpt-5.2 still plans, gpt-5-mini now builds, moonshot-v1-128k integrates. And if any call ever blows past its context window, it falls back automatically.

A swarm of bees over a flower field, photographed in golden light — the visual metaphor for Chip's multi-agent build pipeline.

A swarm is many small workers, one shared goal.

8×

parallel builder workers per run, each now on gpt-5-mini

1Mtok

integrator context window via Moonshot — 5× the old ceiling

0partials

from context-limit errors after the auto-fallback shipped

What changed

Different model for different work.

Each phase has a different shape. Planning is reasoning-heavy. Building is fetch-write-publish. Integrating audits everything at once. One model can't be excellent at all three without you paying for the excellence you don't need.

Phase

Was

Now

Architect (planning)

gpt-5.2

gpt-5.2 — unchanged

Builders (×8 parallel)

gpt-5.2

gpt-5-mini

Integrator (whole-workspace audit)

gpt-5.2

moonshot-v1-128k

QA review

gpt-5-mini

gpt-5-mini — unchanged

Search grounding

gemini-2.5-flash

gemini-2.5-flash — unchanged

Image generation

gemini-2.5-flash-image

gemini-2.5-flash-image — unchanged

The architect kept its flagship. Planning is where stakes are highest — wrong tasks at the top means every builder downstream wastes effort. The architect's output is small but load-bearing. It gets the model that earns its cost.

This change recognizes the natural partition between generative planning and repetitive production. The architect writes compact, high-bandwidth task plans that require careful reasoning: edge-case handling, policy constraints, and decomposition into small implementable steps. Those plans are typically short in token length but require a model that excels at multi-step reasoning and maintaining global invariants. Keeping the architect on the flagship model reduced mis-specified tasks and decreased time spent debugging complex planning errors during early rollouts.

Builders, in contrast, operate at scale. Their dominant cost is quantity: many parallel workers performing high-volume text transformations and I/O. By moving builders to gpt-5-mini, the team reclaimed budget headroom and could run more aggressive parallelism and retries when flaky network conditions occurred. The practical upshot was cleaner operational signals: if a builder fails due to external network or scrape changes, it is cheap to retry; if it fails due to a mis-specified instruction, the architect's higher-quality plans help surface that earlier in the flow.

The integrator's shift to a long-context model is where the behaviour shift is most visible. Instead of attempting partial audits piecemeal, the integrator now performs a holistic pass. That change made previously brittle QA heuristics unnecessary in many cases — global deduplication, cross-page link validation, and canonical metadata enforcement are easier when you can see the whole workspace at once. As a result, the post-integration revision cycles shrank and the human review load dropped sharply during canary runs.

We emphasize that the partition is a pragmatic engineering decision, not a theoretical purity test. It reflects careful observation of where the system spent time and where failures clustered. The design lowers the blast radius of changes: a model swap in one tier requires minimal adaptation in another because the interfaces are small and well-specified. That separation of concerns keeps the system maintainable while letting each tier be tuned for the qualities that matter most to its dominant workload.

From an adoption standpoint the change was staged and measured. Feature flags controlled which builder pools used the mini model and which still ran flagship; canary jobs exercised the integrator on the long-model in a non-production optical path; and a small set of telemetry signals (retry rates, partial status counts, diff rates against a golden baseline) drove go/no-go decisions. Engineers instrumented each layer so that every actionable signal could be traced to a small, well-defined change in either prompt templates, validation hooks, or runner configuration. The net result was a low-friction migration that preserved developer velocity while delivering observable operational wins.

Why we split

Two failures, six months apart.

The change wasn't theoretical — it's a fix for specific failures we kept seeing in the aiondemand.swarm_jobs table.

Integrator OOM — job 33020_a7fae1c0

HTTP 400 from openai (gpt-5.2): Input tokens exceed the context limit. The integrator had 40 tool turns of accumulated workspace data and ran out of room.

partial

Cost asymmetry — every run

8 builder workers × flagship pricing × ~50 iterations each = the majority of the swarm's bill. The work was already routine.

always

Dead constant — LONG_MODEL = "moonshot-v1-128k"

Defined in the file at line 62, used nowhere. The original author meant to wire a long-context fallback. Never finished. The OOM made this a now-thing.

resolved

Field note

Failure A was silent in user terms — the swarm still wrote files, the page still published, but the integrator audit pass that polishes everything together never ran. The job landed as status=partial. Splitting the integrator off to a long-context model removes the failure mode entirely.

When we examined field incidents, the three error classes formed a pattern. Failure A produced invisible user-facing harm: published work that had not been audited end-to-end. That meant missing deduplication, missed policy checks, and occasionally inconsistent metadata that humans later noticed. Failure B generated a recurring operational cost pressure that masked small regressions — because the bulk of runtime was in the builders, small changes to builder behaviour or model weights translated into immediate budget noise. Failure C was a classic engineering oversight: the fallback was in the codebase but not wired to the runner logic, so a ready-made protection sat idle while users experienced its absence.

The postmortem work revealed a mix of contributing factors: brittle heuristics in audit logic, insufficient telemetry on context growth, and an operational tendency to treat the integrator as a low-frequency job rather than the critical synthesis step it is. Our fix set addressed all three: we increased the integrator's token budget by design, moved high-volume builders to a cheaper model to reduce noise, and wired the fallback to the runner with clear metrics and alerting so it could be observed and tuned in production.

A key lesson was the value of end-to-end reproducibility. We invested in deterministic runners and richer audit trails so that any job that once failed could be replayed in a canary environment and inspected in detail. That reproducibility shortened incident cycles and made the rollout measurable: for each canary run we recorded the full audit trace, the builder diffs, and the final integrator edits so engineers could compare outcomes and verify behaviour without speculative reasoning.

Finally, the human side mattered. Engineers across teams collaborated to write compact, testable acceptance criteria for tasks and to codify failure classifications so that incident responders could act quickly. This social engineering — shared runbooks, paired canary runs, and cross-team reviews — was as important as the model change itself: the split gave structure to collaboration and reduced the friction of triage when things did go wrong.

Architecture

Where every model lives.

The pipeline didn't change shape — only which engine sits in which seat.

Pipeline diagram — architect → builders → integrator → QA

Operationally the pipeline is simple: a compact plan flows from the architect into many parallel builder runs; each builder produces discrete artifacts and metadata; the integrator ingests the workspace and performs global checks; QA finalises with an approve/revise step. The improvement here is targeted: we intentionally aligned model capability to phase-specific needs so each engine is cost-optimized for its dominant workload (reasoning vs throughput vs context).

Because the pipeline shape is unchanged, operators can reuse existing observability and retry logic. The changes are largely orthogonal to orchestration — swapping models is a configuration change rather than a rewrite of runner behaviour. That made rollout low-risk: we staged builders to the mini model under feature flags, exercised the integrator on long-model test runs, and only removed the flags once telemetry showed the expected reduction in partials and stable publish rates.

Concretely, the architecture preserves clear contracts between components. The architect emits a JSON task object with required fields, validation tests, and acceptance criteria. Builders implement a narrow interpreter that maps those tasks into deterministic steps: fetch, parse, render, and save. The integrator treats builder outputs as input artifacts and runs cross-cutting checks and synthesis steps. The clear separation of concerns is important: it means a model change in one tier rarely requires logic changes in another, which simplifies testing and rollback during incidents.

To support these contracts we defined explicit schemas and lightweight validation hooks. Builders implement preflight checks that assert presence of required metadata (title, canonical, timestamps) and emit a standard diagnostics object when expectations fail. The integrator consumes these diagnostics and elevates a small set of them into actionable revision tasks. The schema-led approach reduced ambiguity about responsibility and clarified where an operational error should be retried versus escalated to human review.

We also invested in observability primitives that let operators reason about context growth over time. For each job we now collect compact histograms of token usage per phase, a timeline of tool turns with timestamps, and an index of artifacts the integrator read. Those signals made it straightforward to diagnose why a previously reliable job hit a context limit: was it a sudden surge in embedded data, a previously unknown asset being fetched, or an accumulation of verbose diffs? Instrumentation like this changed troubleshooting from exploratory sessions to a reproducible forensic workflow.

Beyond telemetry, the change encouraged better engineering hygiene: smaller acceptance boundaries for tasks, more explicit invariants in task definitions, and improved auto-validation at the builder level. Engineers found that making contracts explicit both reduced back-and-forth during design reviews and produced clearer invariants that automation could check, which lowered both rollout friction and long-term maintenance burden.

Builders

What builders do

Builders are the workhorses of the swarm: they receive a narrowly-scoped task from the architect, fetch required resources, render the target page, upload assets, and call chip_save_page. Their success criteria are pragmatic and measurable: fidelity to the architect's specification, idempotent outputs, and predictable failure modes that can be retried or circuit-broken without human intervention.

Because builders run in parallel (the production setup runs eight workers per job), the aggregate cost is sensitive to per-call model pricing. Each builder typically executes dozens of short tool turns per iteration; the majority of these turns are mechanical text transforms and safe I/O-bound tasks. Moving to gpt-5-mini preserves the instruction-following behavior we need while drastically lowering the per-worker bill. It also reduces variance: smaller models tend to have more stable latency and fewer surprising completions for the kinds of templated outputs builders create.

On the implementation side, builders are engineered for resilience. They checkpoint progress, surface structured errors (HTTP 4xx/5xx, scrape/parsing failures), and return concise diagnostics that the integrator can consume during its audit. Operationally, the focus was to keep the builder contract tiny: if the architect's instructions are adhered to and the saved artifacts exist and validate, the builder is considered correct. That contract simplifies both testing and rollback should a model change need to be reversed.

During rollout we instrumented builders heavily. Metrics included per-turn latency, reject rates against policies, and a diff-rate comparing outputs to a golden baseline. Those signals let us detect subtle shifts in text formatting or link canonicalisation early, and to adjust instruction templates without reversing the entire model change. The result was a measured migration with clear short-term KPIs and predictable recovery actions.

We also invested in robust local testing harnesses for builders. A lightweight harness replays previously captured web fetches and golden outputs so that changes to prompt templates or model assignments could be evaluated locally and deterministically. That practice prevented many subtle regressions, because developers could run the same transformation locally against the same set of inputs and observe diffs before rolling to canary or production pools.

Finally, we introduced a small set of structural lint checks that run pre-deploy. These checks validate common mistakes (missing canonical tags, inconsistent timestamps, or absent accessibility attributes) and help catch issues earlier in CI. Early detection reduced the number of rollbacks required during the model migration and improved the overall quality of artifacts produced by builders.

Integrator

Integrator responsibilities

The integrator is the workspace-level reviewer: it opens every builder-produced file, reads summaries, verifies cross-file dependencies, runs safety heuristics, and produces final edits or a single approve decision. Prior to the split, the integrator would often encounter a workspace larger than the flagship model's context window, which produced HTTP 400 errors and left the job in a partial state. Moonshot's long-context model eliminates that blind spot by letting the integrator reason with the complete job in one pass.

That audit is more than a proofread; the integrator enforces global invariants. Examples include consistency of page metadata, deduplication of assets, canonicalisation of links, and ensuring no disallowed content slips past local builder checks. With a single, holistic view, the integrator can also compress redundant diagnostics into actionable revision tasks targeted at specific builders, which improves developer productivity when investigating failures.

Because the integrator performs multi-file reasoning, its failure modes are different: they are dominated by context limits and long-running inference passes. We added explicit retry semantics, time-budget accounting, and richer audit trails so that any long run can be reproduced deterministically. The integrator's outputs are intentionally conservative: if an automatic fix could introduce ambiguity, it surfaces a concise revision task rather than applying a broad sweep change. That trade-off keeps automated edits safe and traceable.

Operationally, the integrator also serves as a valuable telemetry sink. By aggregating builder diagnostics and file-level metadata, it produces signals that help SRE and product teams prioritise follow-ups. The long-model integrator improved signal fidelity: previously, partial audits would scatter diagnostics across failed attempts; now, a single audit contains structured evidence that simplifies triage and postmortems.

To make the integrator auditable, we record compact transcript snapshots that include the exact inputs the model saw, the decision points it exercised, and the edits it applied. Those snapshots are retained as part of the job artefacts and are invaluable for later debugging and compliance checks. The design intentionally favors concise, reconstructable artifacts over opaque logging so that engineers and reviewers can answer "what did the integrator see" with high confidence.

In practice, integrator engineers also maintain a small library of deterministic synthesis routines that help normalise common patterns: asset deduplication, canonical link selection, and title canonicalisation. These routines are parameterised and can be adjusted without touching the core audit flow; they provide a pragmatic way to automate repetitive fixes while keeping the integrator's decisions transparent and reversible.

Fallback

Auto-fallback flow

If a call fails with a context-limit error, the runner's policy is to retry the integrator phase on the configured long-model fallback (the environment variable LONG_MODEL is now actively used). Instead of allowing the job to finish in a partial state, the retry routes the integrator to moonshot-v1-128k, which has the token budget to complete the workspace audit. This conversion of a hard failure into an automated, deterministic retry is what eliminated the silent partial jobs we saw in the field.

The fallback path is carefully instrumented. It triggers only on clear error classes, records a labelled retry event in the job audit, and exposes a metric so SRE teams can alarm on unusual retry rates. Because the retry is idempotent, it doesn't double-publish or create conflicting edits; the runner applies careful state checks before any mutation. When necessary, the integrator can fall back to a segmented audit (slicing the workspace into ordered chunks) and then synthesise a final aggregate review — but the primary strategy is the single long-model pass that resolves the root cause of context exhaustion.

This automated safety net reduced manual incident work and produced clearer postmortems. Instead of engineers chasing partial outputs, the system self-healed through a deterministic retry path. The audit trail created by the fallback also improved postmortem clarity: each partial-turned-complete job contains a clear record of the retry and the rationale, making root-cause analysis simpler and faster.

The fallback is not a panacea — it is a controlled safety valve. We limited its reach by making the retry conditional, observable, and reversible. When the fallback ran frequently for a particular class of jobs, that was treated as an operational signal to investigate upstream issues rather than a reason to keep retrying indefinitely. In that way, fallback preserved availability while also acting as a diagnostic lens on systemic problems.

Knobs

Environment knobs

The rollout is controlled by a small set of environment variables so operators can adjust model assignments without touching code. The most important knobs are shown below and are preserved verbatim from the original engineering note.

export LONG_MODEL=\"moonshot-v1-128k\"\nexport BUILDER_MODEL=\"gpt-5-mini\"\n# other env vars shown in the original doc preserved verbatim\n# e.g. ARCHITECT_MODEL=gpt-5.2, QA_MODEL=gpt-5-mini, RETRY_POLICY=auto

During rollout, these knobs let engineers stage changes, run experiments, and quickly recover. For example, flipping a builder pool back to the flagship model for a small subset of jobs gave us a rapid diagnostic path when investigating regressions. The knobs also support controlled experiments: by changing a single variable we can run direct comparisons and produce reliable telemetry for decision-making.

We preserved the original environment keys and semantics so that existing operational scripts and dashboards continued to work unchanged. That deliberate compatibility reduced migration work and made the change largely invisible to downstream tooling, which focused the engineering effort on model assignment rather than scripts and dashboard rewrites.

Finally, the knobs are part of our safety-in-depth approach. They provide a fast escape hatch and an experimental surface for capacity work: if an integration reveals an unexpected pattern, operators can experiment by toggling a knob and immediately observe behaviour differences without code deploys. That agility proved important for controlled rollouts and accelerated troubleshooting.

As the final step in operationalising the knobs we added a small policy engine that validates knob combinations against allowed patterns. This prevents accidental configurations that would route all builders to flagship, or otherwise create unexpectedly high spend. The policy engine is intentionally simple and auditable, implemented as a small set of rules that run during deployment and in the runner's startup checks.

Three engines.One swarm.

Different model for different work.

Two failures, six months apart.

Integrator OOM — job 33020_a7fae1c0

Cost asymmetry — every run

Dead constant — LONG_MODEL = "moonshot-v1-128k"

Where every model lives.

What builders do

Integrator responsibilities

Auto-fallback flow

Environment knobs

Edit Page

Three engines.
One swarm.