Go
Email
AWS SES
Azure Communication Services

EduQueue Deep Dive: Multi-Provider Email Failover, Bounce Idempotency & Tracking Pixels

A platform-engineering deep dive into UIU EduQueue's email subsystem — the pluggable EmailProvider interface, per-provider sliding-window rate limiting in Redis, the AWS SES → Azure CS → custom Cloudflare Worker fallback chain, idempotent bounce-webhook handling, and the cache-defeating tracking pixel that survives Gmail's image proxy.

Azraf Al Monzim
72 views
Listen to this post··:··

Companion post to the UIU EduQueue platform overview. This post is the long-form version of "how email actually works" inside EduQueue. If you only want the platform tour, read that one first.

In EduQueue, email is the product. A late routine is a missed exam. A double-send is a panicked student. A silently dropped bounce is a sender-reputation crater that takes weeks to climb out of. Everything else in the system can degrade gracefully — email cannot. So the email subsystem is, by line count, smaller than the analytics layer; by design effort, it is the largest part of the codebase.

This post is the audit trail for every choice in it.


The EmailProvider interface

The whole subsystem hinges on one interface (server/worker/email_worker.go:26-202):

type EmailProvider interface {
    Name() string
    Send(ctx context.Context, msg *EmailMessage) error
    RateLimits() []RateLimitConfig
}
 
type RateLimitConfig struct {
    Limit    int
    Duration time.Duration
}

Three implementations live behind it:

ImplementationWrapsStated capLatency p50
AWSEmailProviderAWS SES SDK v2Negotiated quota~150ms
AzureEmailProviderAzure Communication Services SDK10/min~250ms
HTTPEmailProviderCustom Cloudflare Worker (azure-email-worker.monzim.workers.dev)10/min~100ms

The RateLimits() method is the contract. A provider that lies about its limits is a provider that gets the system rate-limited and then asks for forgiveness. So I treat the declared limit as ground truth and enforce it client-side.


The rate-limit math

For each provider, the worker maintains a Redis sorted set keyed by provider name. Each successful send appends the current timestamp as both score and member; the dispatcher counts members within the trailing window before issuing a send.

In rough pseudo-Go:

func (s *EmailService) canSend(ctx context.Context, p EmailProvider) bool {
    for _, lim := range p.RateLimits() {
        key := "ratelimit:" + p.Name()
        cutoff := time.Now().Add(-lim.Duration).UnixNano()
        rdb.ZRemRangeByScore(ctx, key, "0", strconv.FormatInt(cutoff, 10))
        count, _ := rdb.ZCard(ctx, key).Result()
        if int(count) >= lim.Limit {
            return false
        }
    }
    return true
}

Two properties matter:

  1. The window slides. A static counter that resets every minute is gameable — 10 sends at :59, 10 more at :00, congratulations, you just sent 20/minute. The sorted-set approach measures the last 60 seconds at the moment of the check.
  2. The check is atomic enough. Under contention, ZRemRangeByScore + ZCard + ZADD racing with itself can over-count by a small N. For a system with a Limit=10/min cap and one worker, that variance is irrelevant. For a system 100x bigger I'd move it into a single Lua script.

When canSend() returns false for the active provider, the dispatcher iterates to the next provider in the chain. When all providers say no, the message stays on the EmailQueue table for the next 30-second tick of ProcessEmailQueue (email_worker.go:32-48).


The failover decision

Failover triggers on three signatures:

SignatureSourceAction
Local rate-limit hitcanSend() == falseTry next provider, do not log a failure
Provider 4xx (429, 403)Send() errorTry next provider, log EmailEvent(failed, reason)
Provider 5xx / networkSend() errorTry next provider, log + Discord alert if 3 in 5min

When the chain exhausts, the EmailMessage row's Status flips to failed (not bounced — those are different) and the row stays eligible for the next tick. Asynq's MaxRetry(2) then governs the upper bound on attempts. Beyond that, the row is parked and surfaced on the operator dashboard's "stuck queue" panel.

The single most important property: the chain order is data, not code. Re-ordering providers does not require a redeploy — it's a setting in the live-reload SettingsService. When AWS SES throttles me during exam week, I can demote it to tertiary from my phone.


Bounces — the hidden tax

Bounces are the part of email infrastructure that nobody talks about until it bites them. EduQueue handles them at /api/webhooks/email-bounce, which routes through tracking.IsHardBounceError() (server/tracking/tracking.go:291-368):

var hardBouncePatterns = []*regexp.Regexp{
    regexp.MustCompile(`(?i)\b550\b`),
    regexp.MustCompile(`(?i)mailbox not found`),
    regexp.MustCompile(`(?i)no such user`),
    regexp.MustCompile(`(?i)recipient address rejected`),
    // ...
}

A hard bounce kicks AutoUnsubscribe() (tracking.go:319-368):

  1. Subscription deactivation. Subscription.Active = false, BounceReason = "<source>:<reason>", UnsubscribedAt = now().
  2. Blocklist insert. A row in EduQueueBlockedStudent so the next WHERE clause in email:bulk excludes them at the SQL layer.
  3. Event emission. An EmailEvent row of type bounced for the analytics path.
  4. Idempotency stamp. AlreadyDone = true so the same payload, replayed, becomes a no-op.

That last step is non-negotiable. Email infrastructure replays webhooks. A non-idempotent unsubscribe handler will eventually re-block a re-subscribed user. The AlreadyDone flag makes the operation a function — same input, same output, every time.

The blocklist is checked at the SQL layer rather than in application code because I cannot trust that every code path that sends email knows about the blocklist. The constraint goes in the schema, the query joins through it, and there is no way to forget.


Open & click tracking

EduQueue tracks opens with a 1×1 transparent GIF and clicks with rewritten redirects. The implementation lives in tracking.InjectHTML() (server/tracking/tracking.go:156-209).

The open pixel

<img
  src="https://api.eduqueue.monzim.com/api/t/o/{messageID}.gif"
  width="1"
  height="1"
  border="0"
  style="display: block; border: none; outline: none; text-decoration: none"
/>

The pixel is appended immediately before </body> in every outbound HTML email. The handler:

  1. Validates the messageID shape (UUID v4).
  2. Increments EmailMessage.OpenCount.
  3. Appends an EmailEvent(MessageID, opened, OccurredAt = now()).
  4. Returns the GIF with Cache-Control: no-store, no-cache, must-revalidate.

The cache header matters more than the rest combined. Without it, Gmail's image proxy will deduplicate and you'll see one open per email instead of one per recipient open. This is the difference between an analytics layer that works and one that lies.

The click rewriter

Every outbound <a href="X"> is rewritten to:

/api/t/c/{messageID}?u={base64url(X)}

The base64-url encoding is deliberate — it prevents the original URL from leaking into referrer headers, server logs, and CDN access lines that might index it.

The redirect handler decodes the URL, validates scheme and host (tracking.go:237-257) against an allowlist to prevent open-redirect abuse, increments EmailMessage.ClickCount, appends an EmailEvent(clicked), then issues a 302 to the decoded URL.

Open-redirect validation is the kind of thing that's only obvious after you've seen a tracking endpoint weaponized in a phishing campaign. EduQueue's allowlist is conservative — only domains under monzim.com, eduqueue.monzim.com, and a small set of known partner domains.


Templates, versioning, and promotion slots

Templates live in EmailTemplate (server/models/email_templates.go:16-30):

ColumnPurpose
SlugStable identifier (e.g. exam_routine, welcome_email)
VersionMonotonic per slug
IsActiveExactly one row per slug carries true; older versions retained for rollback
HTMLBody / TextBodyRendered with Go's html/template at send time
PromotionSlotsJSONB; lets ops drop in a banner or footer block per send without touching the template

A (slug, version) unique index is what makes the rollback story safe. To revert a bad template, the admin console flips IsActive from version N to version N-1 — no migration, no redeploy, no editing in production.


Campaigns vs. routines

Two send modes share the same pipeline but discriminate on EmailCampaign.Type:

ModeTypeLifecycleWhen body is built
Routine send"exam_routine"Auto-generated per trimesterAt pdf:generate task time, per-recipient
Custom campaign"custom"draft → scheduled → sending → sentAdmin pre-renders Subject/HTMLBody/TextBody before queue

The lifecycle distinction is the operationally important one. A custom campaign in draft is editable; in scheduled it has a timestamp and the worker won't pick it up early; in sending the queue is draining; in sent it's frozen for analytics. The transitions are gated server-side — the UI never owns campaign state.


Cooldowns

A subscriber can request the same routine multiple times. Without a cooldown, every retry is a fresh PDF render and a fresh email. The cooldown lives at the row level:

SubscriptionEmail(StudentID, ExamName, Type, TriggerAt)

Bulk send queries skip recipients whose latest TriggerAt is within EMAIL_COOLDOWN_HOURS (default 24). The cooldown is a send-side check, not a request-side check — so a hammering retry loop is absorbed at the worker, not bounced at the API.


What I'd change

Three things this subsystem will need before the next 10x of users:

  1. Lua-scripted rate limiter. The current sorted-set check has a small race window. A single-script ZREMRANGEBYSCORE + ZCARD + ZADD makes it strictly correct.
  2. Provider health scores. Right now failover is binary per-attempt. A weighted EWMA of recent error rates would let me down-rank a flaky provider for an hour without yanking it from the chain.
  3. Per-template warm-up. New templates send to a small canary set first to catch render bugs before a 10k-recipient blast.

None of these are urgent. All of them will become urgent the day they're not done.


Part of the UIU EduQueue platform case study.

Tags:
Go
Email
AWS SES
Azure Communication Services
Cloudflare Workers
Redis
Rate Limiting
Webhooks
Idempotency
Tracking Pixels
Platform Engineering
Deep Dive
EduQueue
Azraf Al Monzim

Written by Azraf Al Monzim

Platform Engineer passionate about building scalable systems and sharing knowledge through writing.