ADR-0002 — EduQueue — multi-provider email failover (AWS SES → Azure CS → CF Worker)

Context

EduQueue runs the email-routine and notification path for UIU's student platform. Outbound email is the single most visible failure mode — a missed routine is a missed exam day. Relying on one provider has bitten us twice: an AWS SES throughput-quota cap during enrollment week, and an Azure regional outage.

Decision

Multi-provider failover chain, ordered by cost-then-latency:

AWS SES — primary. Cheapest, fastest for our region, well-instrumented.
Azure Communication Services — secondary. Different control plane, different DNS, different physical region.
Custom Cloudflare Worker relay — tertiary. Synthetic HTTP relay we control end-to-end, sized to handle critical-path-only traffic if the first two are simultaneously degraded.

Every send is wrapped in an Asynq task with retries bound to the chain depth. The EmailEvent append-only table records every attempt, every provider, every status — so we can replay the chain after the fact.

Consequences

Positive:

Provider outages stay invisible to students. The recovery path runs in seconds, not in a deploy.
Replay-ability — we know exactly which provider delivered which message and can prove it without a vendor support ticket.
Cost stays close to the AWS-SES-only baseline because the fallbacks only carry traffic during incidents.

Negative:

Three sender domains to keep SPF / DKIM / DMARC aligned on. Adds calendar overhead.
The CF Worker relay is a custom thing we own. It needs its own observability + rotation, not just provider dashboards.
The fallback chain hides slow-but-not-failed providers. We added a per-provider latency budget on top to surface those.