EduQueue Deep Dive: The Asynq Worker Pool, ChromeDP PDF Pipeline & Live-Reload Settings

Companion post to the UIU EduQueue platform overview. This is the long-form version of "how the worker pool actually works."

The HTTP layer of EduQueue is small on purpose. Its job is to validate, enqueue, and return a 202. The interesting code lives downstream of that, in the Asynq worker pool that drives every PDF render, every email send, every analytics event, and every cron-style observability tick.

This post walks through that pool — sizing, idempotency, the dual-renderer PDF pipeline, the presigned-URL distribution layer, the gocron observability stack, and the Redis pub/sub SettingsService that lets me change worker behavior in production without a redeploy.

The pool, sized

Worker init lives in server/worker/main.go:86-94:

Setting	Value	Reasoning
Global concurrency	`10`	Bound on Postgres + Redis connection use. The DB pool is sized to `max_open=20`; 10 workers leaves headroom for the HTTP layer.
PDF semaphore	`2`	ChromeDP allocates ~150–250MB resident per render. Two concurrent fits comfortably in a 512MB container with the rest of the worker process.
Bulk email concurrency	`5`	Sits below the 10/min Azure cap with headroom for retries inside the same window.
Email queue tick	`30s`	Long enough to amortize the rate-limit math (one tick = one window check); short enough to feel live to the operator.
Worker monitor HTTP	`:8775`	Out-of-band so health endpoints don't share a goroutine pool with task work.

The PDF semaphore is the only place I pin a hard internal limit below the global concurrency. ChromeDP is the only resource where the cost of over-subscription is a Linux OOM-kill rather than a backed-up queue. Asymmetric cost ⇒ asymmetric guard.

Registered tasks and their idempotency keys

Task type	Handler	Retries	Timeout	Idempotency key
`pdf:generate`	`HandlePDFTask`	2	50s	`SubscriptionEmail(StudentID, ExamName, Type)`
`email:bulk`	`HandleBulkEmailTask`	2	—	De-duped by `EMAIL_COOLDOWN_HOURS`
`campaign:custom-send`	`HandleCustomCampaignSend`	2	—	`EmailCampaign.ID` + recipient set
`monitor:health-check`	gocron @ 5min	—	—	—
`monitor:daily-report`	gocron @ 00:01 UTC	—	—	—
`monitor:weekly-report`	gocron @ Sun 00:05	—	—	—

The single most important property of every task: idempotent by construction. Re-running the same pdf:generate either no-ops (the cooldown row already exists) or produces a byte-identical PDF in R2 under the same routines/{studentID}/{filename}-{uniqueID}.pdf key. This is what lets me docker compose down/up mid-batch and not lose anyone's routine.

There is no dead-letter queue in the traditional sense — Asynq's MaxRetry(2) parks failed tasks in its own archived state, where they're surfaced on the operator dashboard's "stuck queue" panel. Manual replay is one click.

The PDF pipeline

PDFs are generated by the worker, never by the request handler. The HTTP layer's role is to validate, enqueue a pdf:generate task, and return a 202 with a queued message.

Two renderers, on purpose

Renderer	When it's chosen	p50 latency	Knobs
ChromeDP (headless Chromium)	Layouts that need real CSS, web fonts, JS	~3.5s	A4 viewport, 0.3/0.2/0.2/0.2-inch margins, 5s post-nav settle, 50s ctx timeout, `--no-sandbox --disable-dev-shm-usage --disable-gpu --headless`
Maroto (native Go PDF)	Tabular reports, simple invoices	~300ms	No Chromium dependency; ~10× faster

The ChromeDP path lives in helper/pdf_html_v2.go:20-100. The Docker flags matter: --no-sandbox is required because the worker container doesn't run as a privileged user; --disable-dev-shm-usage avoids the /dev/shm 64MB default that Chrome treats as fatal under load.

The 5-second post-navigation sleep is a settle window for web fonts and CSS animations. It is empirically calibrated, which is to say: I tried 1s and got broken layouts, tried 2s and got intermittently broken layouts, settled on 5s after a week of test runs.

R2 distribution, not S3 storage

PDFs land on Cloudflare R2 at routines/{studentID}/{filename}-{uniqueID}.pdf. The metadata is intentional:

Cache-Control: public, max-age=604800

A week of edge caching means a re-fetch of the same routine is served from Cloudflare's POP nearest the user. The originating R2 bucket sees one read per million students — and R2's egress is free even when it doesn't.

Distribution to the user is via presigned URLs, never raw bucket reads. The database stores the R2 object key, never the public URL. Each user-facing link is signed at request time:

url := s3.PresignClient.PresignGetObject(ctx, input,
    func(opts *s3.PresignOptions) {
        opts.Expires = constant.S3PDFExpiry  // default 24h
    })

This shape gives me four properties at once:

Time-bounded leakage. A leaked URL stops working in 24 hours.
No bucket exposure. The R2 bucket is private; only signed URLs work.
Per-request audit. I can grep logs for who minted which URL.
Cheap rotation. Rotating R2 credentials invalidates every signed URL in flight without a database change.

PIN-gated access

A presigned URL is necessary but not sufficient. The user must also prove ownership with a PIN:

Field	Purpose
`PIN`	6-digit, server-generated, required to mint a presigned URL
`AccessCount`	Incremented per fetch, capped by `MAX_ROUTINE_ACCESS_COUNT` (default 5)
`ExpiresAt`	Hard cutoff (`ROUTINE_ACCESS_WINDOW_HOUR`, default 168h)
`EduQueueBlockedStudent`	SQL-side blocklist consulted in the access query

PIN, view counter, and expiry are three independent guards. Any one of them can fail open — say, a database trigger that doesn't fire — and the other two still hold. That is the platform-engineering instinct: defense in depth where the cost of a leak is asymmetric.

Cache-driven products on top of the same pool

The same worker pool that renders PDFs also keeps the currency module's cache layers warm. The currency converter at /currency is a public read-heavy product that never goes to upstream Open Exchange Rates on a hot path — the worker hydrates Redis on a schedule, and a Postgres ExchangeRateData table provides a warm fallback when both cache and upstream miss.

EduQueue currency converter page showing a USD-to-BDT result of 123.16, the quick-convert grid, currency comparison bars, batch converter, historical trend chart, popular currency pairs, and the live exchange-rate table. — The currency converter is one of the products driven by this worker pool. Three cache layers (Redis hot → Postgres warm → upstream API cold) mean the page always renders something, even when upstream is down.

The pattern matters more than the product: a worker pool that handles long-running tasks is also the right place to keep latency-sensitive caches warm. Same goroutine pool, same Postgres connection budget, same observability surface. No second daemon.

The SettingsService — live reload via Redis pub/sub

Every value that I might want to change in production at 2am without a redeploy lives in EduQueueSettings and is fronted by SettingsService (worker/settings_service.go:24-152).

The service:

Loads the table into an in-memory map[string]string on init.
Exposes GetBool, GetInt, GetDuration helpers with env-default fallbacks.
Subscribes to the Redis channel eduqueue:settings:update.
Updates its in-memory cache on every published key:value message.

The HTTP layer's PUT /api/eduqueue/admin/settings writes to Postgres and then publishes:

rdb.Publish(ctx, "eduqueue:settings:update",
    fmt.Sprintf("%s:%s", key, value))

…and within the next moment, every worker process honors the new value. No restart. No deploy. No race.

Things wired through SettingsService:

Key	Default	Effect
`email_processor_enabled`	`true`	Pauses the email tick on the next 30s boundary
`email_rate_limit_per_minute`	`30`	Caps total send-rate independent of provider limits
`tracking_enabled`	`true`	Toggles open/click pixel injection (privacy override)
`provider_chain_order`	`aws,azure,http`	Live re-orders the failover chain

The provider_chain_order toggle is the one I love most. When AWS SES throttles me during exam week, I can demote it to tertiary from my phone, in three taps, while Postgres is still under load.

Observability — Discord as a stack

EduQueue does not run Datadog. EduQueue does not run Grafana. The observability stack is a gocron scheduler in the worker (worker/monitoring.go:22-47) emitting structured messages to a Discord webhook.

Cadence	What it emits	Discord behavior
Every 5 min	Health check vs CPU 80% / Mem 85% / Disk 90% / API 2s	Silent if green; @here on breach
Daily 00:01 UTC	24h roll-up: subscribers, sends, opens, clicks, top campaigns	Posted to #eduqueue-reports
Weekly Sun 00:05	Long-form report	Chunked at 2000-char Discord limit (`monitoring.go:1028-1102`)
On-demand	`POST /api/v1/monitor/trigger` with `{"job_name": "..."}`	Manual report fire

This is intentional. Discord is a notification surface I check anyway. The chunking, the threshold logic, and the on-demand trigger together cover 95% of what I would want from a paid observability tool, at zero monthly cost. When the platform earns paid observability, the upgrade path is a Vector or OpenTelemetry collector tailing the same webhook payload format.

The metrics endpoint (GET /api/v1/monitor/metrics) returns the JSON the Discord posts are built from — GoRoutines, HeapAlloc, DBConnections, plus the per-task Asynq counters. Anyone with the session token can pull it; the payload is small enough to render inline in the operator dashboard.

What I'd change

Three things this subsystem will want before the next scale milestone:

Per-queue isolation. Right now all task types share one Asynq queue. Pulling pdf:generate into its own queue with its own concurrency knob would prevent a slow ChromeDP run from blocking an email retry.
Structured task tracing. Every task gets a UUID; carrying it through to the Discord report and the analytics layer would let me reconstruct any single user's journey in one query.
Worker autoscaling. Right now I run a fixed 1-replica worker. A simple "add a replica when queue depth > N" loop in the deploy workflow would give me free elasticity without Kubernetes.

None of these are urgent. All are queued.

Part of the UIU EduQueue platform case study.