Companion post to the UIU EduQueue platform overview. This is the long-form version of "how the worker pool actually works."
The HTTP layer of EduQueue is small on purpose. Its job is to validate, enqueue, and return a 202. The interesting code lives downstream of that, in the Asynq worker pool that drives every PDF render, every email send, every analytics event, and every cron-style observability tick.
This post walks through that pool — sizing, idempotency, the dual-renderer PDF pipeline, the presigned-URL distribution layer, the gocron observability stack, and the Redis pub/sub SettingsService that lets me change worker behavior in production without a redeploy.
The pool, sized
Worker init lives in server/worker/main.go:86-94:
| Setting | Value | Reasoning |
|---|---|---|
| Global concurrency | 10 | Bound on Postgres + Redis connection use. The DB pool is sized to max_open=20; 10 workers leaves headroom for the HTTP layer. |
| PDF semaphore | 2 | ChromeDP allocates ~150–250MB resident per render. Two concurrent fits comfortably in a 512MB container with the rest of the worker process. |
| Bulk email concurrency | 5 | Sits below the 10/min Azure cap with headroom for retries inside the same window. |
| Email queue tick | 30s | Long enough to amortize the rate-limit math (one tick = one window check); short enough to feel live to the operator. |
| Worker monitor HTTP | :8775 | Out-of-band so health endpoints don't share a goroutine pool with task work. |
The PDF semaphore is the only place I pin a hard internal limit below the global concurrency. ChromeDP is the only resource where the cost of over-subscription is a Linux OOM-kill rather than a backed-up queue. Asymmetric cost ⇒ asymmetric guard.
Registered tasks and their idempotency keys
| Task type | Handler | Retries | Timeout | Idempotency key |
|---|---|---|---|---|
pdf:generate | HandlePDFTask | 2 | 50s | SubscriptionEmail(StudentID, ExamName, Type) |
email:bulk | HandleBulkEmailTask | 2 | — | De-duped by EMAIL_COOLDOWN_HOURS |
campaign:custom-send | HandleCustomCampaignSend | 2 | — | EmailCampaign.ID + recipient set |
monitor:health-check | gocron @ 5min | — | — | — |
monitor:daily-report | gocron @ 00:01 UTC | — | — | — |
monitor:weekly-report | gocron @ Sun 00:05 | — | — | — |
The single most important property of every task: idempotent by construction. Re-running the same pdf:generate either no-ops (the cooldown row already exists) or produces a byte-identical PDF in R2 under the same routines/{studentID}/{filename}-{uniqueID}.pdf key. This is what lets me docker compose down/up mid-batch and not lose anyone's routine.
There is no dead-letter queue in the traditional sense — Asynq's MaxRetry(2) parks failed tasks in its own archived state, where they're surfaced on the operator dashboard's "stuck queue" panel. Manual replay is one click.
The PDF pipeline
PDFs are generated by the worker, never by the request handler. The HTTP layer's role is to validate, enqueue a pdf:generate task, and return a 202 with a queued message.
Two renderers, on purpose
| Renderer | When it's chosen | p50 latency | Knobs |
|---|---|---|---|
| ChromeDP (headless Chromium) | Layouts that need real CSS, web fonts, JS | ~3.5s | A4 viewport, 0.3/0.2/0.2/0.2-inch margins, 5s post-nav settle, 50s ctx timeout, --no-sandbox --disable-dev-shm-usage --disable-gpu --headless |
| Maroto (native Go PDF) | Tabular reports, simple invoices | ~300ms | No Chromium dependency; ~10× faster |
The ChromeDP path lives in helper/pdf_html_v2.go:20-100. The Docker flags matter: --no-sandbox is required because the worker container doesn't run as a privileged user; --disable-dev-shm-usage avoids the /dev/shm 64MB default that Chrome treats as fatal under load.
The 5-second post-navigation sleep is a settle window for web fonts and CSS animations. It is empirically calibrated, which is to say: I tried 1s and got broken layouts, tried 2s and got intermittently broken layouts, settled on 5s after a week of test runs.
R2 distribution, not S3 storage
PDFs land on Cloudflare R2 at routines/{studentID}/{filename}-{uniqueID}.pdf. The metadata is intentional:
Cache-Control: public, max-age=604800A week of edge caching means a re-fetch of the same routine is served from Cloudflare's POP nearest the user. The originating R2 bucket sees one read per million students — and R2's egress is free even when it doesn't.
Distribution to the user is via presigned URLs, never raw bucket reads. The database stores the R2 object key, never the public URL. Each user-facing link is signed at request time:
url := s3.PresignClient.PresignGetObject(ctx, input,
func(opts *s3.PresignOptions) {
opts.Expires = constant.S3PDFExpiry // default 24h
})This shape gives me four properties at once:
- Time-bounded leakage. A leaked URL stops working in 24 hours.
- No bucket exposure. The R2 bucket is private; only signed URLs work.
- Per-request audit. I can grep logs for who minted which URL.
- Cheap rotation. Rotating R2 credentials invalidates every signed URL in flight without a database change.
PIN-gated access
A presigned URL is necessary but not sufficient. The user must also prove ownership with a PIN:
| Field | Purpose |
|---|---|
PIN | 6-digit, server-generated, required to mint a presigned URL |
AccessCount | Incremented per fetch, capped by MAX_ROUTINE_ACCESS_COUNT (default 5) |
ExpiresAt | Hard cutoff (ROUTINE_ACCESS_WINDOW_HOUR, default 168h) |
EduQueueBlockedStudent | SQL-side blocklist consulted in the access query |
PIN, view counter, and expiry are three independent guards. Any one of them can fail open — say, a database trigger that doesn't fire — and the other two still hold. That is the platform-engineering instinct: defense in depth where the cost of a leak is asymmetric.
Cache-driven products on top of the same pool
The same worker pool that renders PDFs also keeps the currency module's cache layers warm. The currency converter at /currency is a public read-heavy product that never goes to upstream Open Exchange Rates on a hot path — the worker hydrates Redis on a schedule, and a Postgres ExchangeRateData table provides a warm fallback when both cache and upstream miss.

The pattern matters more than the product: a worker pool that handles long-running tasks is also the right place to keep latency-sensitive caches warm. Same goroutine pool, same Postgres connection budget, same observability surface. No second daemon.
The SettingsService — live reload via Redis pub/sub
Every value that I might want to change in production at 2am without a redeploy lives in EduQueueSettings and is fronted by SettingsService (worker/settings_service.go:24-152).
The service:
- Loads the table into an in-memory
map[string]stringon init. - Exposes
GetBool,GetInt,GetDurationhelpers with env-default fallbacks. - Subscribes to the Redis channel
eduqueue:settings:update. - Updates its in-memory cache on every published
key:valuemessage.
The HTTP layer's PUT /api/eduqueue/admin/settings writes to Postgres and then publishes:
rdb.Publish(ctx, "eduqueue:settings:update",
fmt.Sprintf("%s:%s", key, value))…and within the next moment, every worker process honors the new value. No restart. No deploy. No race.
Things wired through SettingsService:
| Key | Default | Effect |
|---|---|---|
email_processor_enabled | true | Pauses the email tick on the next 30s boundary |
email_rate_limit_per_minute | 30 | Caps total send-rate independent of provider limits |
tracking_enabled | true | Toggles open/click pixel injection (privacy override) |
provider_chain_order | aws,azure,http | Live re-orders the failover chain |
The provider_chain_order toggle is the one I love most. When AWS SES throttles me during exam week, I can demote it to tertiary from my phone, in three taps, while Postgres is still under load.
Observability — Discord as a stack
EduQueue does not run Datadog. EduQueue does not run Grafana. The observability stack is a gocron scheduler in the worker (worker/monitoring.go:22-47) emitting structured messages to a Discord webhook.
| Cadence | What it emits | Discord behavior |
|---|---|---|
| Every 5 min | Health check vs CPU 80% / Mem 85% / Disk 90% / API 2s | Silent if green; @here on breach |
| Daily 00:01 UTC | 24h roll-up: subscribers, sends, opens, clicks, top campaigns | Posted to #eduqueue-reports |
| Weekly Sun 00:05 | Long-form report | Chunked at 2000-char Discord limit (monitoring.go:1028-1102) |
| On-demand | POST /api/v1/monitor/trigger with {"job_name": "..."} | Manual report fire |
This is intentional. Discord is a notification surface I check anyway. The chunking, the threshold logic, and the on-demand trigger together cover 95% of what I would want from a paid observability tool, at zero monthly cost. When the platform earns paid observability, the upgrade path is a Vector or OpenTelemetry collector tailing the same webhook payload format.
The metrics endpoint (GET /api/v1/monitor/metrics) returns the JSON the Discord posts are built from — GoRoutines, HeapAlloc, DBConnections, plus the per-task Asynq counters. Anyone with the session token can pull it; the payload is small enough to render inline in the operator dashboard.
What I'd change
Three things this subsystem will want before the next scale milestone:
- Per-queue isolation. Right now all task types share one Asynq queue. Pulling
pdf:generateinto its own queue with its own concurrency knob would prevent a slow ChromeDP run from blocking an email retry. - Structured task tracing. Every task gets a UUID; carrying it through to the Discord report and the analytics layer would let me reconstruct any single user's journey in one query.
- Worker autoscaling. Right now I run a fixed 1-replica worker. A simple "add a replica when queue depth > N" loop in the deploy workflow would give me free elasticity without Kubernetes.
None of these are urgent. All are queued.
Part of the UIU EduQueue platform case study.
