Companion post to the UIU EduQueue platform overview. This is the long-form version of "how the operator console actually works" and "where the analytics numbers come from."
The operator console is the part of EduQueue I rely on most and the part most users never see. It's also the part where the platform-engineering decisions show up most clearly — every "should this be a YAML setting or a Redis pub/sub message" call gets made in this layer.
This post walks through the auth model, the analytics schema, the on-demand rollups, and the live-control surfaces that turn the dashboard from a viewer into a control plane.
Auth — passwordless, Redis-backed, single-use
There is no password anywhere in the operator console. The full login flow:
1. Operator hits /admin-dashboard
2. Frontend POSTs /api/eduqueue/admin/request-otp
3. Server generates 6-digit OTP via crypto/rand
4. Server stores eduqueue_otp:{otp} → "valid" in Redis (TTL 10m)
5. Server posts the OTP to a Discord webhook (out-of-band channel)
6. Operator copies OTP into the UI
7. Frontend POSTs /api/eduqueue/admin/verify-otp { otp }
8. Server DELs the OTP key (single-use), generates a session token
9. Server stores eduqueue_session:{token} → "valid" in Redis (TTL 24h)
10. Token returned to frontend, stored in localStorage, sent via X-Session-TokenOTP generation lives in eduqueue_admin.go:generateEduQueueOTP. The token is sha256(otp + nanoTimestamp) truncated and base64-encoded.
The constants:
| Const | Value | Why |
|---|---|---|
eduQueueOTPPrefix | "eduqueue_otp:" | Namespacing for safe Redis sharing |
eduQueueOTPTTL | 10 * time.Minute | Long enough for a copy-paste, short enough for security |
eduQueueSessionPrefix | "eduqueue_session:" | Same namespacing rule |
eduQueueSessionTTL | 24 * time.Hour | One day of operator work, then re-auth |
Why server-side sessions, not JWTs
I deliberately chose Redis sessions over JWTs:
- Revocation is a
DEL. No token denylist, no key-rotation game. - TTL is the source of truth. Redis evicts; I don't have to.
- No secrets in the cookie. The token is opaque.
- Server-side, not browser-side. A stolen token without DB access is useless.
The trade-off is one Redis round-trip per admin request. For a console that an operator hits a few hundred times per day, that's negligible. JWTs would matter at 100k req/sec. They do not matter at 100 req/sec.
Discord as the OTP channel
OTPs flow through a Discord webhook because:
- The channel I check anyway. I'm in Discord all day.
- Out of band from email. If email is broken, I can still get into the console to fix it.
- Audit trail for free. Every OTP request shows up in #eduqueue-ops with timestamp, requester IP, and the 10-minute warning.
The webhook payload includes the requester IP so I can spot a wrong actor immediately.
The analytics schema — append-only events, on-demand rollups
The hardest decision in the analytics layer was whether to pre-aggregate. I chose not to. Here's the schema and the reasoning.
| Table | Shape | Cardinality | Read pattern |
|---|---|---|---|
EmailMessage | Per-send record (UUID, status, OpenCount, ClickCount, CampaignID) | 1 per recipient × campaign | KPI denominators, drill-down |
EmailEvent | Append-only (MessageID, EventType, OccurredAt) | 10–100× EmailMessage | Time-series, funnel |
EmailCampaign | Aggregate (TotalSent, TotalOpened, TotalClicked, TotalBounced) | 1 per campaign | Overview cards |
EmailRateLimitTracker | Per-provider window counter | Small (one row per provider × window) | Live operator console |
The discipline is: EmailEvent is the only mutating writer for engagement signals. EmailMessage.OpenCount and EmailMessage.ClickCount are denormalized counters maintained by the same handler that writes the event row, in the same transaction. No background job recomputes them. That keeps the read-side fast and the write-side simple.
The query layer
KPIs reduce to grouped SUMs against EmailCampaign:
SELECT
campaign_id,
SUM(total_sent) AS sent,
SUM(total_opened) AS opens,
SUM(total_clicked) AS clicks,
SUM(total_bounced) AS bounces
FROM email_campaigns
WHERE created_at > now() - interval '30 days'
GROUP BY campaign_id;Drill-down for a single campaign joins EmailMessage to EmailEvent:
SELECT date_trunc('hour', e.occurred_at) AS hour,
e.event_type,
count(*) AS n
FROM email_events e
JOIN email_messages m ON m.id = e.message_id
WHERE m.campaign_id = $1
GROUP BY hour, e.event_type
ORDER BY hour;These are not pre-aggregated. With covering indexes on (campaign_id, occurred_at) and (message_id, event_type), the queries return in tens of milliseconds at current cardinality. The upgrade path when this stops being true is a MATERIALIZED VIEW refreshed by a cron task — not a rewrite.
Why on-demand wins at this scale
Pre-aggregation is the classic "cleverness debt." It costs you:
- An extra writer. Now every event hits two places.
- A reconciliation problem. If the rollup drifts, which side is right?
- Schema rigidity. The rollup encodes today's questions; tomorrow's analytics question requires a backfill.
The append-only schema with on-demand queries trades query latency (which is currently fine) for maximum schema flexibility (which is always valuable). When traffic warrants the trade-off, materialized views are a one-migration upgrade — not a rewrite.
The campaign builder
Custom campaigns share the email pipeline but discriminate on EmailCampaign.Type = "custom". They have their own state machine:
draft ─► scheduled ─► sending ─► sent
│ │ │
▼ ▼ ▼
cancel cancel cancel | State | What's allowed | What blocks transition |
|---|---|---|
draft | Edit Subject, HTMLBody, TextBody, recipient query | — |
scheduled | Set SendAt in the future; worker won't pick up early | — |
sending | The queue is draining; per-recipient rows being written | No edits |
sent | Frozen for analytics | No edits, no resend (must clone) |
The transitions are gated server-side. The UI never owns campaign state. This is the kind of invariant that seems like overkill until the day a half-ready campaign accidentally fires because a button race condition didn't disable on mouseup.
The template editor
Campaigns and templates use the same EmailTemplate table ((slug, version) unique index, IsActive boolean). Editing a template:
- Inserts a new row with
Version = current + 1,IsActive = false. - Renders a preview against sample data via
POST /api/eduqueue/admin/email-templates/preview-draft. - On approve, flips the new version's
IsActiveto true and the previous version's to false in a single transaction.
Rollback is a single SQL UPDATE — flip IsActive from N back to N-1. No migration, no redeploy, no editing in production.
The live-control surfaces
The operator console exposes the parts of the platform that are too dangerous for a YAML config and too operationally important to require a redeploy.
| Capability | Endpoint | Effect |
|---|---|---|
| Pause all sends | POST /api/eduqueue/admin/email/pause | Sets email_processor_enabled=false via Redis pub/sub; workers honor on next 30s tick |
| Send a test routine | POST /api/eduqueue/admin/email/test | Renders + sends without touching analytics |
| Inspect the queue | GET /api/eduqueue/admin/email-queue | Pending / in-flight / failed snapshot |
| Reset rate limits | DELETE /api/eduqueue/admin/rate-limits | Truncates EmailRateLimitTracker, publishes eduqueue:ratelimit:clear |
| Live settings | GET/PUT /api/eduqueue/admin/settings | Hot-reload via SettingsService + eduqueue:settings:update pub/sub |
| Block a student | POST /api/eduqueue/admin/blocked-students | Inserts into EduQueueBlockedStudent; SQL-side filter applied immediately |
Why this matters
A console that only displays state is a graveyard. A console that controls state is a platform. Every row in the table above represents a decision I would otherwise have to make by editing code, redeploying, and waiting — at 2am, during exam week, while students are watching.
The pause toggle is the one I use most. It exists because the worst day of EduQueue's life was an accidental triple-send to 4,000 subscribers in 90 seconds. The post-mortem said: needs a kill switch the operator can hit from a phone. So it has one.
The "stuck queue" panel
Asynq's MaxRetry(2) parks failed tasks in an archived state. The console exposes them:
| Field | Source |
|---|---|
| Task type | asynq.TaskInfo.Type |
| Last error | asynq.TaskInfo.LastErr |
| Failed at | asynq.TaskInfo.LastFailedAt |
| Payload | Decoded from asynq.TaskInfo.Payload |
| Replay button | POST /api/eduqueue/admin/email-queue/replay/{id} |
The replay path is intentionally manual. Auto-retry beyond MaxRetry(2) is the path to infinite loops; manual replay forces me to look at the error first. In platform engineering, a slow path you take consciously beats a fast path you forgot you wired.
What I'd change
Three things on the roadmap for this subsystem:
- WebAuthn alongside OTP. OTP-via-Discord works for me; a hardware key would scale to a small ops team without giving anyone my Discord.
- Saved analytics views. Right now every drill-down is built from scratch. A
SavedQuerytable with a few canonical KPIs would speed up daily report-building. - Audit log. Every operator action should write a row I can grep. Currently I rely on Discord history and Redis TTLs — fine for one operator, fragile for two.
None of these are urgent. All become important the day a second person logs in.
Part of the UIU EduQueue platform case study.
