Rate Limiting Your API Is Not Just for Big Platforms
by Eric Hanson, Backend Developer at Clean Systems Consulting
The assumption that causes the incident
Rate limiting gets treated as a scalability feature — something you add when you have enough traffic for it to matter. This is wrong. The incidents that rate limiting prevents are not caused by legitimate high-traffic scenarios; they are caused by bugs, misconfigurations, and malicious actors that can affect a small API just as easily as a large one.
A developer puts a retry loop in their code with no backoff and deploys it pointing at your API. A misconfigured cron job fires every 100ms instead of every 100 seconds. A credential stuffing bot tries 10,000 username/password combinations against your login endpoint. None of these require your API to be large for them to cause a problem.
What to rate limit and at what granularity
Rate limiting is not one limit — it is a set of limits applied at different scopes:
Per-IP, unauthenticated endpoints: The first line of defense against bots and credential stuffing. Apply before authentication processing to reduce CPU cost of handling attack traffic.
Per-API-key or per-user, authenticated endpoints: Prevents a single bad client from consuming shared infrastructure. Allows you to give different limits to different tiers.
Per-endpoint, for sensitive operations: Password reset, email verification, 2FA attempts, and bulk export endpoints need tighter limits than general API endpoints regardless of who is calling them.
Global limits: A ceiling on total request processing capacity. Protects the service as a whole when per-client limits are not sufficient (e.g., a large number of legitimate clients each making many requests simultaneously).
The algorithms
Token bucket: Each client has a bucket that fills at a fixed rate (e.g., 100 tokens per minute). Each request consumes one token. If the bucket is empty, the request is rejected. Allows short bursts up to the bucket capacity.
Leaky bucket: Requests enter a queue. The queue drains at a fixed rate. Smooths out bursts — better for protecting downstream systems that cannot handle spikes.
Fixed window counter: Count requests in a fixed time window (e.g., 1 minute). Simple to implement and understand, but vulnerable to boundary exploitation: a client can make N requests at the end of minute 1 and N more at the start of minute 2, doubling effective throughput at the window boundary.
Sliding window log: Records the timestamp of each request and rejects any that would exceed the limit within the rolling window. Accurate but memory-intensive at high volume.
Sliding window counter: A hybrid that approximates the sliding window log using fixed window counts weighted by position in the current window. Practical for high-scale implementations. Redis's implementation via INCR and EXPIRE is the common reference.
For most APIs, token bucket gives the right tradeoff: burst tolerance with overall rate enforcement. Implement it in Redis if you have multiple API server instances (local in-memory state does not work in a distributed setup):
import redis
import time
def check_rate_limit(client_id: str, limit: int, window_seconds: int) -> bool:
r = redis.Redis()
key = f"rate:{client_id}:{int(time.time() // window_seconds)}"
current = r.incr(key)
if current == 1:
r.expire(key, window_seconds * 2)
return current <= limit
The response headers that make rate limits usable
When you rate limit, tell clients about it. The standard headers (not yet RFC, but widely adopted from GitHub/Stripe conventions):
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 342
X-RateLimit-Reset: 1745658000
Include these on every response, not just when limits are exceeded. A developer can then build adaptive clients that throttle themselves before hitting the limit.
When a limit is exceeded, return 429 Too Many Requests with a Retry-After header:
HTTP/1.1 429 Too Many Requests
Retry-After: 47
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1745658000
A well-behaved client reads Retry-After and backs off. A client with exponential backoff logic can self-heal. Without these headers, clients either hammer your API or implement arbitrary backoff that may not align with your reset boundaries.
Where to implement it
API gateway layer (Kong, AWS API Gateway, Nginx with limit_req_module): offloads rate limiting before traffic reaches your application servers. The right choice for high-traffic scenarios where you want to shed load before it reaches your backend.
Application middleware: gives you access to authenticated identity for per-user limits that the gateway cannot determine before auth processing. Combine both layers: coarse limits at the gateway, fine-grained per-user limits in the application.
Third-party rate limiting services (Upstash Rate Limit, Cloudflare Rate Limiting): useful if you do not own the gateway layer or want to avoid building Redis infrastructure.
The operational side
Set alerts on 429 response rate. A sudden spike means either a legitimate client is misbehaving or you are under attack — both require action. Log the client identity (API key or IP) with every rate limit rejection. This data is your primary tool for diagnosing whether a limit is too tight or a client needs help fixing their integration.
Document your limits clearly. Developers who know the limits build adaptive clients. Developers who discover limits through 429 errors in production build brittle clients.