Architecture at a glance
There is a specific flavour of dread that arrives when you put a public API on the internet and then read your logs. My portfolio is a single Cloudflare Worker, and it exposes more than a landing page: there's an MCP endpoint at /api/mcp, a pageview tracker at /api/track-event, an authenticated webhook API at /api/webhook/*, and an admin surface behind OAuth. Most of that is fine until one badly-behaved agent decides my MCP server is its personal scratchpad and sends a few thousand requests before lunch.
So I needed rate limiting. Not the vague "Cloudflare protects you" kind actual, per-API-key, enforce-this-number-globally rate limiting that I control in code. What I learned is that the easy answer and the correct answer are two different products, and the gap between them is exactly one Durable Object.
The shape I landed on: cheap, sloppy-tolerant routes lean on the built-in Workers rate-limit binding; sensitive, must-be-accurate routes route through a Durable Object per key running a sliding window counter. One rule underneath both: never let the limiter become slower or less reliable than the thing it's protecting.
The two-tier design: the cheap path takes the free per-colo binding; the sensitive path stops at a per-key Durable Object. The reader's response never waits on a limit decision.
The constraints
Four non-negotiables framed the whole design:
- Accurate where it counts. A key capped at 60 writes/minute means 60 globally, not 60-ish per data center.
- Cheap. This is a personal site. The limiter cannot cost more than the site.
- Low latency on the hot path. A reader fetching a blog post must never wait on a rate-limit decision.
- Fail open, loudly. If the limiter breaks, requests get through and I get told an outage in the bouncer shouldn't close the club.
The easy button, and why it leaks
Cloudflare shipped a Rate Limiting binding for Workers it went GA in September 2025 and it is genuinely lovely to use. You declare a namespace in wrangler.jsonc:
{
"ratelimits": [
{
"name": "TRACK_LIMITER",
"namespace_id": "1001",
"simple": { "limit": 100, "period": 60 } // period must be 10 or 60 nothing else
}
]
}…and then a decision is one await away:
const { success } = await env.TRACK_LIMITER.limit({ key: clientIp });
if (!success) {
return new Response("Too Many Requests", { status: 429 });
}No Redis, no round trip you can feel. The counters are cached on the same machine your Worker runs on and reconciled in the background, so limit() returns without waiting on the network. For stopping one IP from spamming my pageview endpoint, this is perfect and I use it exactly there.
But read Cloudflare's own description twice, because it's the whole story:
Straight from the docs: the binding is "permissive, eventually consistent, and intentionally designed to not be used as an accurate accounting system." And: "for each unique key you pass to your rate limiting binding, there is a unique limit per Cloudflare location."
Per location. That's the trap. Cloudflare has data centers all over the planet, and a request from a caller is served by whichever one is nearest. If your limit is 60/minute and a determined caller's traffic lands across three colos, each colo independently counts to 60 and waves them through. The effective ceiling isn't 60 it's 60 times however many data centers they reach.
Figure 1 per-location counters are never reconciled; a key intended for 60/min clears ~180/min across three colos.
For pageview spam, who cares "eventually, approximately, mostly" is the correct amount of effort. But the webhook API is how content gets written to my site. If I tell an integration "you get 60 writes a minute," I mean it, everywhere, as one number. Eventually consistent and per-location is the wrong tool for a global invariant. I needed a single place in the world that knows the true count for a key. That's the literal job description of a Durable Object.
Durable Objects as the consistency anchor
A Durable Object is a tiny stateful server with a name. Two properties make it perfect here:
- It's a singleton. Every request for a given object ID is routed to the same instance, wherever it lives. There is exactly one
RATE_LIMITERobject for the name"key:abc", anywhere on Earth. - It's single-threaded with strongly-consistent storage. Requests to one object are processed one at a time against private storage. No locks, no Lua, no compare-and-swap dance the race condition that haunts every Redis rate limiter simply cannot occur, because there is no concurrency to race.
The keying trick is what makes this scale instead of melt: I derive the object ID from the API key itself. Each key gets its own object, so the work fans out across thousands of independent single-threaded servers automatically. No sharding logic, no central counter the namespace is the shard map.
import type { RateLimiterDO } from "~/lib/rate-limiter-do";
async function enforceLimit(env: Env, apiKey: string) {
// one object per key → naturally sharded, each key fully isolated.
const id = env.RATE_LIMITER.idFromName(`key:${apiKey}`);
const stub = env.RATE_LIMITER.get(id);
// RPC straight to the object; it owns the only true counter for this key.
return stub.check({ limit: 60, windowMs: 60_000 });
}That stub.check(...) is a remote call to wherever the object lives, so it does cost a round trip the price of a globally-correct answer. The trade I made on constraint #3 is that I only pay it on writes. Reads (every blog page) are served from content baked to the Worker at build time and never touch the limiter at all. The hot path stays hot; only the rare, mutating path stops at the bouncer.
Picking the algorithm: sliding window counter
Inside the object, the actual counting has to be both correct and cheap, because I'm paying for the object's storage and wall-clock time. There are three usual suspects, and only one is right.
Fixed window counter is the naive one: a counter per minute, reset on the minute. It's trivial and it has a notorious flaw the boundary burst. A caller can fire the full limit at 12:00:59 and the full limit again at 12:01:01, and in two seconds they've sent double the limit, because the window snapped over between them. For a write API, that "double" can be the difference between fine and a thundering herd.
Sliding window log fixes accuracy by storing the timestamp of every request and counting how many fall inside the trailing window. It's exact. It's also expensive: at any volume you're storing and scanning thousands of timestamps per key, and on a billed-per-row storage backend that's real money for a vanity-precise number.
Sliding window counter is the one Cloudflare itself reaches for at scale, and it's the sweet spot. You keep just two integers per key the count in the current fixed window and the count in the previous one and you weight the previous window by how much of it still falls inside the trailing window:
weight = (windowMs - elapsedInCurrent) / windowMs
estimated = previousCount * weight + currentCount
Figure 2 two counters, one multiply:
est = prev·weight + curr. No log, no boundary burst.
It kills the boundary burst without storing a single timestamp, and the approximation is tight Cloudflare has reported a total error rate around 0.003% from exactly this two-counter trick. Here's the whole limiter, living inside the object:
export class RateLimiterDO {
constructor(private state: DurableObjectState) {}
async check({ limit, windowMs }: { limit: number; windowMs: number }) {
const now = Date.now();
const windowStart = Math.floor(now / windowMs) * windowMs;
// strongly-consistent read; no other request can interleave with us.
const s = (await this.state.storage.get<Slot>("slot")) ?? {
windowStart,
current: 0,
previous: 0,
};
if (s.windowStart !== windowStart) {
// window rolled over: today's "previous" is the old "current".
const gap = windowStart - s.windowStart;
s.previous = gap === windowMs ? s.current : 0; // skipped a window? prev is dead.
s.current = 0;
s.windowStart = windowStart;
}
const elapsed = now - windowStart;
const weight = (windowMs - elapsed) / windowMs;
const estimated = s.previous * weight + s.current;
if (estimated >= limit) {
const retryAfter = Math.ceil((windowMs - elapsed) / 1000);
return { allowed: false, remaining: 0, retryAfter };
}
s.current += 1;
await this.state.storage.put("slot", s);
return {
allowed: true,
remaining: Math.floor(limit - estimated - 1),
retryAfter: 0,
};
}
}
type Slot = { windowStart: number; current: number; previous: number };Two stored fields, one multiply, and a put. The single-threaded guarantee is doing the heavy lifting: that read-modify-write is atomic for free, where in Redis it would be a hand-written Lua script specifically to avoid two requests reading the same count.
Why two rate limiters, not one
Running both the platform binding and a Durable Object felt like a smell at first surely one mechanism is cleaner? But they're priced and shaped for opposite jobs, and matching the route to the tool is the actual design:
| Platform binding | Durable Object | |
|---|---|---|
| Consistency | Per-colo, eventually consistent | Global, strongly consistent |
| Latency | ~0 (local cache) | One round trip to the object |
| Cost | Free | Billed per request + duration |
| Accuracy | Approximate | Exact-ish (0.003% error) |
| Right for | Pageview / IP spam | Per-key API quotas |
The pageview tracker fires on every navigation. Putting that through a Durable Object would mean a billed object request for every reader on every page paying for global precision to answer a question ("is this one IP spamming me?") where "approximately" is the correct answer. So /track-event uses the free, local, per-colo binding and I sleep fine.
The webhook and MCP write paths are rare and consequential, so they earn the round trip and the bill. The non-obvious decision isn't "use Durable Objects" it's refusing to use them for the 95% of traffic that doesn't need them. Precision is something you spend money on deliberately, route by route.
The bug: when one Durable Object became the bottleneck
Here's the one that cost me an evening. The per-key design is lovely when there's a key. But /api/mcp and /track-event also serve unauthenticated traffic there's no API key to derive an object ID from. My first instinct was the obvious one:
// looked reasonable. was a landmine.
const objectName = apiKey ? `key:${apiKey}` : "anonymous";Every anonymous request in the world now resolved to a single object named "anonymous". And a Durable Object has a soft limit of about 1,000 requests per second it's one single-threaded server, after all. The moment an agent started hammering the unauthenticated MCP endpoint, all of that traffic funneled into one object, which sailed past its throughput ceiling and started returning errors. The cruel part: those errors had nothing to do with my rate limits. Legitimate anonymous users were getting failures not because they were over the limit, but because the bouncer I'd built was itself a single overwhelmed turnstile.
I'd recreated the exact fragmentation problem from Figure 1, just inverted instead of too many counters, I'd built one counter that everything stampeded.
The fix is to shard the anonymous traffic back out, on purpose, by hashing the client IP into a fixed number of buckets:
const ANON_SHARDS = 16;
function limiterName(apiKey: string | null, clientIp: string): string {
if (apiKey) return `key:${apiKey}`;
// spread anon load across N objects so no single one saturates.
// fnv1a is tiny and deterministic - same IP → same shard.
const shard = fnv1a(clientIp) % ANON_SHARDS;
return `anon:${shard}`;
}
Figure 3 the hot-object failure, and the shard-by-hash fix. The cost is that the anonymous limit is now approximate per shard which, for anonymous traffic, is exactly the precision it deserves.
The lesson that stuck: a Durable Object's strong consistency is bounded by one object's throughput. A globally-correct counter is also a global bottleneck if everyone shares it. Keying per-API-key gave me sharding for free precisely because keys spread the load; the bug was the one route where I forgot to give the load anything to spread across.
Gotcha: Sharding trades accuracy for throughput. With 16 anonymous shards, the "global anonymous limit" is really 16 independent limits. That's fine when the thing you're limiting (anon abuse) doesn't need an exact number but it's a knob to turn consciously, not a default to forget.
Edge cases & gotchas
A few more things that bit, or nearly did:
- Latency is geographic. A Durable Object lives in one data center near the first request that created it, with location hints as suggestions, not guarantees. A caller on the far side of the planet pays that distance on every check. I keep it off the read path entirely, so only write callers feel it, and writes are already slow-by-nature.
- Fail open, deliberately. If the object call throws deploy blip, transient overload the Worker allows the request and emits a metric rather than 500-ing. Per constraint #4, a broken limiter must not become a broken API. The honest cost: during a limiter outage, you aren't rate limited. I'd rather under-enforce for a minute than hand a caller a wall of errors over my own bug.
- The
period: 10 | 60straitjacket. The platform binding only accepts a 10- or 60-second window. Want "300 per 5 minutes"? You can't, not with the binding that constraint alone pushes anything with a custom window onto the Durable Object, which takes anywindowMsyou hand it. - No cleanup job needed. Because the slot rolls forward on read, there's no cron sweeping stale keys. Untouched objects simply hibernate and stop billing duration. An alarm could evict truly-dead keys to reclaim storage, but at two integers per key, I haven't needed to bother.
Reading it back
A rate limiter that just says "no" is hostile. A good one tells the caller exactly where they stand, so I return the standard headers on every limited route the remaining and retryAfter come straight out of the object's response:
function rateLimitHeaders(
r: { remaining: number; retryAfter: number },
limit: number,
) {
const h = new Headers();
h.set("X-RateLimit-Limit", String(limit));
h.set("X-RateLimit-Remaining", String(r.remaining));
if (r.retryAfter > 0) h.set("Retry-After", String(r.retryAfter)); // seconds
return h;
}For observing the system itself, every 429 is written to Workers Analytics Engine with the route and whether the key was authenticated, so I can ask "who's actually hitting limits" without cracking open object storage:
SELECT blob1 AS route, COUNT() AS rejections
FROM rate_limit_events
WHERE timestamp > NOW() - INTERVAL '24' HOUR AND double1 = 429
GROUP BY route ORDER BY rejections DESC;Tradeoffs, and what I'd change
The honest ledger, because every approach has one:
- Cost is real but tiny. Durable Object requests bill at $0.15 per million past the free million-a-month, plus duration at $12.50 per million GB-seconds (with 400,000 GB-s free monthly). My write volume is low enough that the limiter rounds to zero against the $5/month Workers Paid floor. At high write volume this math flips a write-heavy API doing millions of checks a day should price it out, because every check is a billed object request.
- A round trip per write. Unavoidable for a globally-correct answer. The mitigation is architectural (keep it off reads), not magical.
- Per-key hot spots are still possible. One key doing 1,000+ writes a second would saturate its own object. For my traffic that's absurd; for a real product you'd shard hot keys the same way I sharded anon.
What I'd change: I'd reach for the platform binding first, every time, and only graduate a route to a Durable Object when I can name the specific invariant the per-colo model breaks. I built the DO path first because it's more interesting, which is exactly backwards from the cost-conscious order.
Takeaways
Three things worth carrying to your own edge:
- "Distributed" and "consistent" are a dial, not a default. The cheap, fast, eventually-consistent answer is correct far more often than it feels. Spend strong consistency only where a real invariant demands it.
- A Durable Object is a global lock you can name. That makes per-key quotas, counters, and coordination almost trivial but the same singleton that gives you correctness gives you a single throughput ceiling. Shard before you need to.
- Match the mechanism to the route, not the app. The best design here wasn't one rate limiter; it was the discipline to run two and send each request to the cheaper one that's still correct enough.
If you're on Cloudflare and you've been avoiding "real" rate limiting because it sounded like standing up Redis and praying over Lua scripts it isn't. It's one object with a name, two integers, and the good sense to only call it when the number actually has to be true.
