ADR-0001 — DocPipe — one long-lived Chromium pool over per-request browsers

Context

DocPipe v1 hung under sustained load. Each render spawned a short-lived Chromium child via chromedp; orphaned grandchildren accumulated until fork() failed. The hang only manifested in production-scale traffic and was invisible under low-volume soak.

Decision

Run one long-lived Chromium process per service instance. Per-request tabs derive from a shared parent context. A semaphore caps concurrent renders. A supervisor goroutine probes Chromium every 30 seconds and recycles it on failure or after a configurable render count. Add tini as PID 1 so orphaned grandchildren are reaped.

Soak test gates every release: 1000 sequential renders plus 50-wide parallel waves for 10 minutes, asserting zero zombie processes, RSS within 100 MB of baseline, and analytics totals matching what was sent.

Consequences

Positive:

The class of hang that killed v1 is structurally impossible — there is exactly one Chromium process, and the supervisor owns its lifecycle.
Memory and FD usage become bounded and predictable; one fewer ops surprise.
The soak test is a real regression gate, not a smoke check — production-scale load runs before every release.

Negative:

Concurrency is hard-capped by the semaphore. A noisy caller can hold slots up to the timeout and starve others. A weighted-fair queue over the semaphore is the right next step but is deliberately out of v2 scope.
Chromium crash means a brief blip while the supervisor restarts it. Acceptable for the current SLO, may not be for tighter budgets later.
Restart cost scales with peak-time tab count. Coarse but survivable.