Rate Limiting Is Not Just for Big Companies
by Eric Hanson, Backend Developer at Clean Systems Consulting
The Incident That Didn't Need to Happen
A developer at a partner company had a bug in their integration code. A loop that was supposed to call your API once per minute was calling it every 10 milliseconds. By the time anyone noticed, they'd sent 50,000 requests in eight minutes. Your service was handling ~200 req/sec instead of its normal 5. Database connection pool exhausted. Every other customer experiencing errors. Incident postmortem: "add rate limiting."
This is not a story about malicious actors or billion-dollar scale. It's a story about a mundane bug in client code that took down a service with no defenses against overload. Rate limiting would have contained the damage to the misbehaving client. Everyone else would have been unaffected.
What Rate Limiting Actually Protects Against
Client bugs: As above. Loops with missing backoff, retries without jitter, misconfigured polling intervals — these happen routinely in integration development. A client that accidentally calls your API at 1000x its intended rate should not be able to take down your service.
Runaway automation: A script someone wrote to bulk-process data, run without throttling, can easily generate legitimate-looking requests at volumes that exceed your capacity.
Intentional abuse: API scraping, credential stuffing (trying username/password combinations at scale), account enumeration. Rate limiting is not a complete defense against these, but it raises the cost significantly.
Cascading failures: When a downstream dependency is slow, retry logic can multiply request volume. Rate limiting at the entry point caps how much retry traffic can hit the system, preventing the retry storm from compounding the degradation.
The Algorithms
Fixed window: Count requests in a time bucket (e.g., per minute). Simple to implement and reason about. Has a boundary vulnerability: a client can send N requests at 11:59:59 and N more at 12:00:01 — 2N requests in two seconds while staying within the per-minute limit.
Sliding window log: Keep a log of timestamps for each client's requests. Count requests in the rolling window. Accurate but memory-intensive at scale.
Sliding window counter: Approximate the sliding window using two adjacent fixed windows. Accurate to within a small error margin, memory-efficient. This is the approach used by many production rate limiters.
Token bucket: A bucket fills at a constant rate up to a maximum capacity. Each request consumes one token. If the bucket is empty, the request is rejected. Allows bursting up to the bucket capacity while enforcing an average rate over time. This is the model used by AWS API Gateway, Stripe, and most major APIs.
Leaky bucket: Requests enter a queue at any rate; they exit the queue at a fixed rate. Smooths traffic bursts into a constant output rate. Appropriate for scenarios where you need consistent output throughput, not just input rate limiting.
Implementation: Where Rate Limiting Lives
At the application layer: Frameworks like Resilience4j (JVM) and rate-limiters in Express, FastAPI, etc. provide in-process rate limiting. The limitation: state lives per instance. In a horizontally scaled service, a client can bypass per-instance limits by hitting multiple instances. Requires a shared backend (Redis) for accurate enforcement across instances.
RateLimiter rateLimiter = RateLimiter.of("api-calls",
RateLimiterConfig.custom()
.limitRefreshPeriod(Duration.ofSeconds(1))
.limitForPeriod(100) // 100 requests per second
.timeoutDuration(Duration.ZERO) // fail immediately if limit exceeded
.build());
if (!rateLimiter.acquirePermission()) {
throw new RateLimitExceededException("Rate limit exceeded");
}
At the API gateway layer: AWS API Gateway, Kong, nginx with the limit_req module, Envoy — these handle rate limiting before requests reach your application. Ideal for per-client or per-endpoint limits. State is managed by the gateway infrastructure. This is the lowest-overhead option for simple cases.
Distributed rate limiting with Redis: For accurate cross-instance rate limiting in application code, use Redis with atomic operations. The Lua script pattern or Redis modules like redis-cell implement token bucket in Redis atomically.
The Response That Matters
When a request is rate limited, respond with HTTP 429 (Too Many Requests) and include:
Retry-After: 60(seconds until the client may retry)X-RateLimit-Limit: 100(the limit)X-RateLimit-Remaining: 0(remaining in current window)X-RateLimit-Reset: 1714000000(Unix timestamp when the window resets)
A client that receives a 429 with proper headers can implement correct backoff. A client that receives a 500 with no guidance will retry immediately, compounding the problem.
What to Limit and How to Scope It
Per-client (API key or IP address) rate limiting is the baseline. Consider also:
- Per-endpoint limits for expensive operations (report generation, bulk exports)
- Global limits on anonymous traffic to protect against scraping
- Differentiated limits by subscription tier if you have paying customers with different entitlements
The Practical Takeaway
If your API is in production without rate limiting, add it this week — not this quarter. Start with per-IP and per-API-key limits at your API gateway layer. Choose limits based on your current p99 capacity with headroom: if you handle 1,000 req/sec today and want to protect against overload, a per-client limit of 100 req/sec leaves room for 10 well-behaved clients before you're at capacity. Instrument 429 responses in your metrics dashboard so rate-limiting events are visible.