EduQueue Deep Dive: OTP-Authenticated Operator Console & Append-Only Email Analytics

Companion post to the UIU EduQueue platform overview. This is the long-form version of "how the operator console actually works" and "where the analytics numbers come from."

The operator console is the part of EduQueue I rely on most and the part most users never see. It's also the part where the platform-engineering decisions show up most clearly — every "should this be a YAML setting or a Redis pub/sub message" call gets made in this layer.

This post walks through the auth model, the analytics schema, the on-demand rollups, and the live-control surfaces that turn the dashboard from a viewer into a control plane.

Auth — passwordless, Redis-backed, single-use

There is no password anywhere in the operator console. The full login flow:

1. Operator hits  /admin-dashboard
2. Frontend POSTs /api/eduqueue/admin/request-otp
3. Server generates 6-digit OTP via crypto/rand
4. Server stores  eduqueue_otp:{otp} → "valid"  in Redis (TTL 10m)
5. Server posts the OTP to a Discord webhook  (out-of-band channel)
6. Operator copies OTP into the UI
7. Frontend POSTs /api/eduqueue/admin/verify-otp { otp }
8. Server DELs the OTP key (single-use), generates a session token
9. Server stores  eduqueue_session:{token} → "valid"  in Redis (TTL 24h)
10. Token returned to frontend, stored in localStorage, sent via X-Session-Token

OTP generation lives in eduqueue_admin.go:generateEduQueueOTP. The token is sha256(otp + nanoTimestamp) truncated and base64-encoded.

The constants:

Const	Value	Why
`eduQueueOTPPrefix`	`"eduqueue_otp:"`	Namespacing for safe Redis sharing
`eduQueueOTPTTL`	`10 * time.Minute`	Long enough for a copy-paste, short enough for security
`eduQueueSessionPrefix`	`"eduqueue_session:"`	Same namespacing rule
`eduQueueSessionTTL`	`24 * time.Hour`	One day of operator work, then re-auth

Why server-side sessions, not JWTs

I deliberately chose Redis sessions over JWTs:

Revocation is a DEL. No token denylist, no key-rotation game.
TTL is the source of truth. Redis evicts; I don't have to.
No secrets in the cookie. The token is opaque.
Server-side, not browser-side. A stolen token without DB access is useless.

The trade-off is one Redis round-trip per admin request. For a console that an operator hits a few hundred times per day, that's negligible. JWTs would matter at 100k req/sec. They do not matter at 100 req/sec.

Discord as the OTP channel

OTPs flow through a Discord webhook because:

The channel I check anyway. I'm in Discord all day.
Out of band from email. If email is broken, I can still get into the console to fix it.
Audit trail for free. Every OTP request shows up in #eduqueue-ops with timestamp, requester IP, and the 10-minute warning.

The webhook payload includes the requester IP so I can spot a wrong actor immediately.

The analytics schema — append-only events, on-demand rollups

The hardest decision in the analytics layer was whether to pre-aggregate. I chose not to. Here's the schema and the reasoning.

Table	Shape	Cardinality	Read pattern
`EmailMessage`	Per-send record (UUID, status, OpenCount, ClickCount, CampaignID)	1 per recipient × campaign	KPI denominators, drill-down
`EmailEvent`	Append-only (MessageID, EventType, OccurredAt)	10–100× `EmailMessage`	Time-series, funnel
`EmailCampaign`	Aggregate (TotalSent, TotalOpened, TotalClicked, TotalBounced)	1 per campaign	Overview cards
`EmailRateLimitTracker`	Per-provider window counter	Small (one row per provider × window)	Live operator console

The discipline is: EmailEvent is the only mutating writer for engagement signals. EmailMessage.OpenCount and EmailMessage.ClickCount are denormalized counters maintained by the same handler that writes the event row, in the same transaction. No background job recomputes them. That keeps the read-side fast and the write-side simple.

The query layer

KPIs reduce to grouped SUMs against EmailCampaign:

SELECT
  campaign_id,
  SUM(total_sent)   AS sent,
  SUM(total_opened) AS opens,
  SUM(total_clicked) AS clicks,
  SUM(total_bounced) AS bounces
FROM email_campaigns
WHERE created_at > now() - interval '30 days'
GROUP BY campaign_id;

Drill-down for a single campaign joins EmailMessage to EmailEvent:

SELECT date_trunc('hour', e.occurred_at) AS hour,
       e.event_type,
       count(*) AS n
FROM email_events e
JOIN email_messages m ON m.id = e.message_id
WHERE m.campaign_id = $1
GROUP BY hour, e.event_type
ORDER BY hour;

These are not pre-aggregated. With covering indexes on (campaign_id, occurred_at) and (message_id, event_type), the queries return in tens of milliseconds at current cardinality. The upgrade path when this stops being true is a MATERIALIZED VIEW refreshed by a cron task — not a rewrite.

Why on-demand wins at this scale

Pre-aggregation is the classic "cleverness debt." It costs you:

An extra writer. Now every event hits two places.
A reconciliation problem. If the rollup drifts, which side is right?
Schema rigidity. The rollup encodes today's questions; tomorrow's analytics question requires a backfill.

The append-only schema with on-demand queries trades query latency (which is currently fine) for maximum schema flexibility (which is always valuable). When traffic warrants the trade-off, materialized views are a one-migration upgrade — not a rewrite.

The campaign builder

Custom campaigns share the email pipeline but discriminate on EmailCampaign.Type = "custom". They have their own state machine:

draft  ─► scheduled  ─► sending  ─► sent
  │          │            │           
  ▼          ▼            ▼           
cancel    cancel       cancel

State	What's allowed	What blocks transition
`draft`	Edit Subject, HTMLBody, TextBody, recipient query	—
`scheduled`	Set `SendAt` in the future; worker won't pick up early	—
`sending`	The queue is draining; per-recipient rows being written	No edits
`sent`	Frozen for analytics	No edits, no resend (must clone)

The transitions are gated server-side. The UI never owns campaign state. This is the kind of invariant that seems like overkill until the day a half-ready campaign accidentally fires because a button race condition didn't disable on mouseup.

The template editor

Campaigns and templates use the same EmailTemplate table ((slug, version) unique index, IsActive boolean). Editing a template:

Inserts a new row with Version = current + 1, IsActive = false.
Renders a preview against sample data via POST /api/eduqueue/admin/email-templates/preview-draft.
On approve, flips the new version's IsActive to true and the previous version's to false in a single transaction.

Rollback is a single SQL UPDATE — flip IsActive from N back to N-1. No migration, no redeploy, no editing in production.

The live-control surfaces

The operator console exposes the parts of the platform that are too dangerous for a YAML config and too operationally important to require a redeploy.

Capability	Endpoint	Effect
Pause all sends	`POST /api/eduqueue/admin/email/pause`	Sets `email_processor_enabled=false` via Redis pub/sub; workers honor on next 30s tick
Send a test routine	`POST /api/eduqueue/admin/email/test`	Renders + sends without touching analytics
Inspect the queue	`GET /api/eduqueue/admin/email-queue`	Pending / in-flight / failed snapshot
Reset rate limits	`DELETE /api/eduqueue/admin/rate-limits`	Truncates `EmailRateLimitTracker`, publishes `eduqueue:ratelimit:clear`
Live settings	`GET/PUT /api/eduqueue/admin/settings`	Hot-reload via SettingsService + `eduqueue:settings:update` pub/sub
Block a student	`POST /api/eduqueue/admin/blocked-students`	Inserts into `EduQueueBlockedStudent`; SQL-side filter applied immediately

Why this matters

A console that only displays state is a graveyard. A console that controls state is a platform. Every row in the table above represents a decision I would otherwise have to make by editing code, redeploying, and waiting — at 2am, during exam week, while students are watching.

The pause toggle is the one I use most. It exists because the worst day of EduQueue's life was an accidental triple-send to 4,000 subscribers in 90 seconds. The post-mortem said: needs a kill switch the operator can hit from a phone. So it has one.

The "stuck queue" panel

Asynq's MaxRetry(2) parks failed tasks in an archived state. The console exposes them:

Field	Source
Task type	`asynq.TaskInfo.Type`
Last error	`asynq.TaskInfo.LastErr`
Failed at	`asynq.TaskInfo.LastFailedAt`
Payload	Decoded from `asynq.TaskInfo.Payload`
Replay button	`POST /api/eduqueue/admin/email-queue/replay/{id}`

The replay path is intentionally manual. Auto-retry beyond MaxRetry(2) is the path to infinite loops; manual replay forces me to look at the error first. In platform engineering, a slow path you take consciously beats a fast path you forgot you wired.

What I'd change

Three things on the roadmap for this subsystem:

WebAuthn alongside OTP. OTP-via-Discord works for me; a hardware key would scale to a small ops team without giving anyone my Discord.
Saved analytics views. Right now every drill-down is built from scratch. A SavedQuery table with a few canonical KPIs would speed up daily report-building.
Audit log. Every operator action should write a row I can grep. Currently I rely on Discord history and Redis TTLs — fine for one operator, fragile for two.

None of these are urgent. All become important the day a second person logs in.

Part of the UIU EduQueue platform case study.