Context
EduQueue runs the email-routine and notification path for UIU's student platform. Outbound email is the single most visible failure mode — a missed routine is a missed exam day. Relying on one provider has bitten us twice: an AWS SES throughput-quota cap during enrollment week, and an Azure regional outage.
Decision
Multi-provider failover chain, ordered by cost-then-latency:
- AWS SES — primary. Cheapest, fastest for our region, well-instrumented.
- Azure Communication Services — secondary. Different control plane, different DNS, different physical region.
- Custom Cloudflare Worker relay — tertiary. Synthetic HTTP relay we control end-to-end, sized to handle critical-path-only traffic if the first two are simultaneously degraded.
Every send is wrapped in an Asynq task with retries bound to the chain depth. The EmailEvent append-only table records every attempt, every provider, every status — so we can replay the chain after the fact.
Consequences
Positive:
- Provider outages stay invisible to students. The recovery path runs in seconds, not in a deploy.
- Replay-ability — we know exactly which provider delivered which message and can prove it without a vendor support ticket.
- Cost stays close to the AWS-SES-only baseline because the fallbacks only carry traffic during incidents.
Negative:
- Three sender domains to keep SPF / DKIM / DMARC aligned on. Adds calendar overhead.
- The CF Worker relay is a custom thing we own. It needs its own observability + rotation, not just provider dashboards.
- The fallback chain hides slow-but-not-failed providers. We added a per-provider latency budget on top to surface those.