Companion post to the UIU EduQueue platform overview. This post is the long-form version of "how email actually works" inside EduQueue. If you only want the platform tour, read that one first.
In EduQueue, email is the product. A late routine is a missed exam. A double-send is a panicked student. A silently dropped bounce is a sender-reputation crater that takes weeks to climb out of. Everything else in the system can degrade gracefully — email cannot. So the email subsystem is, by line count, smaller than the analytics layer; by design effort, it is the largest part of the codebase.
This post is the audit trail for every choice in it.
The EmailProvider interface
The whole subsystem hinges on one interface (server/worker/email_worker.go:26-202):
type EmailProvider interface {
Name() string
Send(ctx context.Context, msg *EmailMessage) error
RateLimits() []RateLimitConfig
}
type RateLimitConfig struct {
Limit int
Duration time.Duration
}Three implementations live behind it:
| Implementation | Wraps | Stated cap | Latency p50 |
|---|---|---|---|
AWSEmailProvider | AWS SES SDK v2 | Negotiated quota | ~150ms |
AzureEmailProvider | Azure Communication Services SDK | 10/min | ~250ms |
HTTPEmailProvider | Custom Cloudflare Worker (azure-email-worker.monzim.workers.dev) | 10/min | ~100ms |
The RateLimits() method is the contract. A provider that lies about its limits is a provider that gets the system rate-limited and then asks for forgiveness. So I treat the declared limit as ground truth and enforce it client-side.
The rate-limit math
For each provider, the worker maintains a Redis sorted set keyed by provider name. Each successful send appends the current timestamp as both score and member; the dispatcher counts members within the trailing window before issuing a send.
In rough pseudo-Go:
func (s *EmailService) canSend(ctx context.Context, p EmailProvider) bool {
for _, lim := range p.RateLimits() {
key := "ratelimit:" + p.Name()
cutoff := time.Now().Add(-lim.Duration).UnixNano()
rdb.ZRemRangeByScore(ctx, key, "0", strconv.FormatInt(cutoff, 10))
count, _ := rdb.ZCard(ctx, key).Result()
if int(count) >= lim.Limit {
return false
}
}
return true
}Two properties matter:
- The window slides. A static counter that resets every minute is gameable — 10 sends at
:59, 10 more at:00, congratulations, you just sent 20/minute. The sorted-set approach measures the last 60 seconds at the moment of the check. - The check is atomic enough. Under contention,
ZRemRangeByScore + ZCard + ZADDracing with itself can over-count by a small N. For a system with aLimit=10/mincap and one worker, that variance is irrelevant. For a system 100x bigger I'd move it into a single Lua script.
When canSend() returns false for the active provider, the dispatcher iterates to the next provider in the chain. When all providers say no, the message stays on the EmailQueue table for the next 30-second tick of ProcessEmailQueue (email_worker.go:32-48).
The failover decision
Failover triggers on three signatures:
| Signature | Source | Action |
|---|---|---|
| Local rate-limit hit | canSend() == false | Try next provider, do not log a failure |
| Provider 4xx (429, 403) | Send() error | Try next provider, log EmailEvent(failed, reason) |
| Provider 5xx / network | Send() error | Try next provider, log + Discord alert if 3 in 5min |
When the chain exhausts, the EmailMessage row's Status flips to failed (not bounced — those are different) and the row stays eligible for the next tick. Asynq's MaxRetry(2) then governs the upper bound on attempts. Beyond that, the row is parked and surfaced on the operator dashboard's "stuck queue" panel.
The single most important property: the chain order is data, not code. Re-ordering providers does not require a redeploy — it's a setting in the live-reload SettingsService. When AWS SES throttles me during exam week, I can demote it to tertiary from my phone.
Bounces — the hidden tax
Bounces are the part of email infrastructure that nobody talks about until it bites them. EduQueue handles them at /api/webhooks/email-bounce, which routes through tracking.IsHardBounceError() (server/tracking/tracking.go:291-368):
var hardBouncePatterns = []*regexp.Regexp{
regexp.MustCompile(`(?i)\b550\b`),
regexp.MustCompile(`(?i)mailbox not found`),
regexp.MustCompile(`(?i)no such user`),
regexp.MustCompile(`(?i)recipient address rejected`),
// ...
}A hard bounce kicks AutoUnsubscribe() (tracking.go:319-368):
- Subscription deactivation.
Subscription.Active = false,BounceReason = "<source>:<reason>",UnsubscribedAt = now(). - Blocklist insert. A row in
EduQueueBlockedStudentso the nextWHEREclause inemail:bulkexcludes them at the SQL layer. - Event emission. An
EmailEventrow of typebouncedfor the analytics path. - Idempotency stamp.
AlreadyDone = trueso the same payload, replayed, becomes a no-op.
That last step is non-negotiable. Email infrastructure replays webhooks. A non-idempotent unsubscribe handler will eventually re-block a re-subscribed user. The AlreadyDone flag makes the operation a function — same input, same output, every time.
The blocklist is checked at the SQL layer rather than in application code because I cannot trust that every code path that sends email knows about the blocklist. The constraint goes in the schema, the query joins through it, and there is no way to forget.
Open & click tracking
EduQueue tracks opens with a 1×1 transparent GIF and clicks with rewritten redirects. The implementation lives in tracking.InjectHTML() (server/tracking/tracking.go:156-209).
The open pixel
<img
src="https://api.eduqueue.monzim.com/api/t/o/{messageID}.gif"
width="1"
height="1"
border="0"
style="display: block; border: none; outline: none; text-decoration: none"
/>The pixel is appended immediately before </body> in every outbound HTML email. The handler:
- Validates the
messageIDshape (UUID v4). - Increments
EmailMessage.OpenCount. - Appends an
EmailEvent(MessageID, opened, OccurredAt = now()). - Returns the GIF with
Cache-Control: no-store, no-cache, must-revalidate.
The cache header matters more than the rest combined. Without it, Gmail's image proxy will deduplicate and you'll see one open per email instead of one per recipient open. This is the difference between an analytics layer that works and one that lies.
The click rewriter
Every outbound <a href="X"> is rewritten to:
/api/t/c/{messageID}?u={base64url(X)}The base64-url encoding is deliberate — it prevents the original URL from leaking into referrer headers, server logs, and CDN access lines that might index it.
The redirect handler decodes the URL, validates scheme and host (tracking.go:237-257) against an allowlist to prevent open-redirect abuse, increments EmailMessage.ClickCount, appends an EmailEvent(clicked), then issues a 302 to the decoded URL.
Open-redirect validation is the kind of thing that's only obvious after you've seen a tracking endpoint weaponized in a phishing campaign. EduQueue's allowlist is conservative — only domains under monzim.com, eduqueue.monzim.com, and a small set of known partner domains.
Templates, versioning, and promotion slots
Templates live in EmailTemplate (server/models/email_templates.go:16-30):
| Column | Purpose |
|---|---|
Slug | Stable identifier (e.g. exam_routine, welcome_email) |
Version | Monotonic per slug |
IsActive | Exactly one row per slug carries true; older versions retained for rollback |
HTMLBody / TextBody | Rendered with Go's html/template at send time |
PromotionSlots | JSONB; lets ops drop in a banner or footer block per send without touching the template |
A (slug, version) unique index is what makes the rollback story safe. To revert a bad template, the admin console flips IsActive from version N to version N-1 — no migration, no redeploy, no editing in production.
Campaigns vs. routines
Two send modes share the same pipeline but discriminate on EmailCampaign.Type:
| Mode | Type | Lifecycle | When body is built |
|---|---|---|---|
| Routine send | "exam_routine" | Auto-generated per trimester | At pdf:generate task time, per-recipient |
| Custom campaign | "custom" | draft → scheduled → sending → sent | Admin pre-renders Subject/HTMLBody/TextBody before queue |
The lifecycle distinction is the operationally important one. A custom campaign in draft is editable; in scheduled it has a timestamp and the worker won't pick it up early; in sending the queue is draining; in sent it's frozen for analytics. The transitions are gated server-side — the UI never owns campaign state.
Cooldowns
A subscriber can request the same routine multiple times. Without a cooldown, every retry is a fresh PDF render and a fresh email. The cooldown lives at the row level:
SubscriptionEmail(StudentID, ExamName, Type, TriggerAt)Bulk send queries skip recipients whose latest TriggerAt is within EMAIL_COOLDOWN_HOURS (default 24). The cooldown is a send-side check, not a request-side check — so a hammering retry loop is absorbed at the worker, not bounced at the API.
What I'd change
Three things this subsystem will need before the next 10x of users:
- Lua-scripted rate limiter. The current sorted-set check has a small race window. A single-script
ZREMRANGEBYSCORE + ZCARD + ZADDmakes it strictly correct. - Provider health scores. Right now failover is binary per-attempt. A weighted EWMA of recent error rates would let me down-rank a flaky provider for an hour without yanking it from the chain.
- Per-template warm-up. New templates send to a small canary set first to catch render bugs before a 10k-recipient blast.
None of these are urgent. All of them will become urgent the day they're not done.
Part of the UIU EduQueue platform case study.
