Context
DocPipe v1 hung under sustained load. Each render spawned a short-lived Chromium child via chromedp; orphaned grandchildren accumulated until fork() failed. The hang only manifested in production-scale traffic and was invisible under low-volume soak.
Decision
Run one long-lived Chromium process per service instance. Per-request tabs derive from a shared parent context. A semaphore caps concurrent renders. A supervisor goroutine probes Chromium every 30 seconds and recycles it on failure or after a configurable render count. Add tini as PID 1 so orphaned grandchildren are reaped.
Soak test gates every release: 1000 sequential renders plus 50-wide parallel waves for 10 minutes, asserting zero zombie processes, RSS within 100 MB of baseline, and analytics totals matching what was sent.
Consequences
Positive:
- The class of hang that killed v1 is structurally impossible — there is exactly one Chromium process, and the supervisor owns its lifecycle.
- Memory and FD usage become bounded and predictable; one fewer ops surprise.
- The soak test is a real regression gate, not a smoke check — production-scale load runs before every release.
Negative:
- Concurrency is hard-capped by the semaphore. A noisy caller can hold slots up to the timeout and starve others. A weighted-fair queue over the semaphore is the right next step but is deliberately out of v2 scope.
- Chromium crash means a brief blip while the supervisor restarts it. Acceptable for the current SLO, may not be for tighter budgets later.
- Restart cost scales with peak-time tab count. Coarse but survivable.