All ADRs

ADR-0002 · 2026-04 · Accepted

EduQueue — multi-provider email failover (AWS SES → Azure CS → CF Worker)

Related project: /projects/uiu-eduqueue-platform

Context

EduQueue runs the email-routine and notification path for UIU's student platform. Outbound email is the single most visible failure mode — a missed routine is a missed exam day. Relying on one provider has bitten us twice: an AWS SES throughput-quota cap during enrollment week, and an Azure regional outage.

Decision

Multi-provider failover chain, ordered by cost-then-latency:

  1. AWS SES — primary. Cheapest, fastest for our region, well-instrumented.
  2. Azure Communication Services — secondary. Different control plane, different DNS, different physical region.
  3. Custom Cloudflare Worker relay — tertiary. Synthetic HTTP relay we control end-to-end, sized to handle critical-path-only traffic if the first two are simultaneously degraded.

Every send is wrapped in an Asynq task with retries bound to the chain depth. The EmailEvent append-only table records every attempt, every provider, every status — so we can replay the chain after the fact.

Consequences

Positive:

  • Provider outages stay invisible to students. The recovery path runs in seconds, not in a deploy.
  • Replay-ability — we know exactly which provider delivered which message and can prove it without a vendor support ticket.
  • Cost stays close to the AWS-SES-only baseline because the fallbacks only carry traffic during incidents.

Negative:

  • Three sender domains to keep SPF / DKIM / DMARC aligned on. Adds calendar overhead.
  • The CF Worker relay is a custom thing we own. It needs its own observability + rotation, not just provider dashboards.
  • The fallback chain hides slow-but-not-failed providers. We added a per-provider latency budget on top to surface those.