Postmortem: DocPipe v1 hung under load

Status: Resolved · DocPipe v2 (2026-05) supersedes v1. Severity: Sev-2 — service unavailable until manual container restart. Format: Blameless postmortem, modelled on Google SRE Chapter 15.

Summary

DocPipe v1 — a single-binary HTML-to-PDF service — hung intermittently under sustained load. The hang manifested as healthy /healthz responses while the conversion endpoint stopped completing requests. Recovery required a manual container restart. The root cause was a combination of orphaned Chromium grandchildren (no PID-1 reaper) and per-request Chromium spawns that leaked file descriptors over time. DocPipe v2 replaces the spawn model with one long-lived Chromium process plus tini as PID 1; a soak test now gates every release.

Impact

Duration: Recurrent over the last 6 weeks of v1's life. Each episode lasted 5–25 minutes before manual intervention.
Surface: The CASCK job portal — POST /api/html-to-pdf returns hung — which downstream blocked admit-card generation during recruitment windows.
Users affected: Estimated low-thousands per incident window, all routed through the same internal portal.
Data loss: None. Requests timed out cleanly on the client side; no partial PDFs persisted.

Timeline

All times approximate, local (UTC+6).

T-0 (steady state): Service serving 5–20 RPS, p95 800 ms. No alerts.
T+0 — first symptom: A request to /api/html-to-pdf does not complete inside the client's 30 s timeout. Healthz still 200.
T+5 min: Two subsequent requests time out. On-call paged.
T+8 min: docker exec ... ps shows ~80 zombie processes parented to the Go binary. fork() is failing. Logs show resource temporarily unavailable on Chromium spawn.
T+12 min: Container restarted. Service recovers.
Recurrence interval: Variable — once per ~24 h under sustained load; faster under bursts.

Root causes

Three problems compounded:

No PID-1 reaper. The Go binary ran as PID 1 in the container and did not reap orphaned grandchildren. chromedp spawns Chromium as a child; Chromium spawns helper processes (renderer, GPU, utility) as grandchildren. When the parent Chromium exits abnormally, those grandchildren are reparented to PID 1, which never calls wait() — they accumulate as zombies. Eventually fork() returns EAGAIN and new conversions cannot start.
Per-request Chromium spawn. Every conversion launched a fresh chromedp.NewContext(...) rooted at a new exec.Command. Under steady load this multiplied (1) by request rate.
Hang masquerading as healthy. /healthz was a static 200 OK. It did not probe the browser, so the service reported healthy while the hot path was deadlocked behind a failing fork().

Trigger

The combination existed from day one. It only manifested past a request-rate threshold (~8 RPS sustained for ~20 minutes) where zombie accumulation outpaced the rare cases of Chromium clean shutdown.

Resolution (DocPipe v2)

Replaced per-request spawn with one long-lived Chromium process plus a supervisor goroutine that probes the browser every 30 s and recycles it on failure or after a configurable render count. Per-request tabs are now derived from a shared parent context. A semaphore caps concurrent tabs so a slow renderer cannot starve others.
Added tini as PID 1. Application binaries make poor PID 1s. tini costs nothing, lives in a couple of KB, and reaps grandchildren correctly.
/readyz now reflects actual browser health. A failing browser produces a non-200 readyz so the orchestrator pulls the pod out of rotation rather than continuing to send requests to a deadlocked instance.
Soak test as a release gate. Every release runs 1000 sequential conversions plus 50-wide parallel waves for 10 minutes, asserting zero zombie processes inside the container, RSS within 100 MB of baseline, and analytics totals matching what was sent. The v1 hang doesn't manifest at low volume, so unit tests alone could not have caught it. The soak proves the fix at production scale before every deploy.

Detection

What we had:

A static health endpoint that always passed.
Client-side timeout alerts (which is how we found out).

What we have now:

/readyz probes the actual browser process.
Analytics expose per-status-code counts + p95/p99 latency. A latency spike or sudden absence of 200s alerts oncall via the existing dashboard.
The soak test runs in CI on every release. A regression of the underlying class of bug fails the build, not production.

Lessons learned

Health checks must probe what matters. A 200 OK static handler is worse than no health check — it masks the failure mode and breaks orchestrator self-healing.
PID 1 is a real concern even in small containers. If a process can become PID 1, decide explicitly who reaps zombies. tini is the right answer for almost every application image.
Spawn-per-request is a smell when the spawned process is heavyweight. Chromium falls in that bucket. A bounded pool gives predictable memory + FD usage and removes a whole class of orchestration bug.
Soak tests beat unit tests for time-and-resource-correlated failures. Run the same workload-shape for long enough at production-ish concurrency, then assert the invariants (zero zombies, bounded RSS, totals match).

Action items

✅ Replace spawn model with long-lived Chromium pool (DocPipe v2).
✅ Add tini as PID 1 to the container image.
✅ /readyz reflects browser health.
✅ Soak test as release gate.
✅ Public stats dashboard exposes p95/p99 and rolling 1 h / 24 h windows.
☐ Weighted-fair queueing over the semaphore — noisy callers can hold slots up to the timeout and starve neighbours. Out of v2 scope; tracked.
☐ T-digest behind the existing Observe(int64) interface, swappable when the fixed-bucket histograms stop being good enough.