Rate Limiting Your API Is Not Just for Big Platforms

by Eric Hanson, Backend Developer at Clean Systems Consulting

The assumption that causes the incident

Rate limiting gets treated as a scalability feature — something you add when you have enough traffic for it to matter. This is wrong. The incidents that rate limiting prevents are not caused by legitimate high-traffic scenarios; they are caused by bugs, misconfigurations, and malicious actors that can affect a small API just as easily as a large one.

A developer puts a retry loop in their code with no backoff and deploys it pointing at your API. A misconfigured cron job fires every 100ms instead of every 100 seconds. A credential stuffing bot tries 10,000 username/password combinations against your login endpoint. None of these require your API to be large for them to cause a problem.

What to rate limit and at what granularity

Rate limiting is not one limit — it is a set of limits applied at different scopes:

Per-IP, unauthenticated endpoints: The first line of defense against bots and credential stuffing. Apply before authentication processing to reduce CPU cost of handling attack traffic.

Per-API-key or per-user, authenticated endpoints: Prevents a single bad client from consuming shared infrastructure. Allows you to give different limits to different tiers.

Per-endpoint, for sensitive operations: Password reset, email verification, 2FA attempts, and bulk export endpoints need tighter limits than general API endpoints regardless of who is calling them.

Global limits: A ceiling on total request processing capacity. Protects the service as a whole when per-client limits are not sufficient (e.g., a large number of legitimate clients each making many requests simultaneously).

The algorithms

Token bucket: Each client has a bucket that fills at a fixed rate (e.g., 100 tokens per minute). Each request consumes one token. If the bucket is empty, the request is rejected. Allows short bursts up to the bucket capacity.

Leaky bucket: Requests enter a queue. The queue drains at a fixed rate. Smooths out bursts — better for protecting downstream systems that cannot handle spikes.

Fixed window counter: Count requests in a fixed time window (e.g., 1 minute). Simple to implement and understand, but vulnerable to boundary exploitation: a client can make N requests at the end of minute 1 and N more at the start of minute 2, doubling effective throughput at the window boundary.

Sliding window log: Records the timestamp of each request and rejects any that would exceed the limit within the rolling window. Accurate but memory-intensive at high volume.

Sliding window counter: A hybrid that approximates the sliding window log using fixed window counts weighted by position in the current window. Practical for high-scale implementations. Redis's implementation via INCR and EXPIRE is the common reference.

For most APIs, token bucket gives the right tradeoff: burst tolerance with overall rate enforcement. Implement it in Redis if you have multiple API server instances (local in-memory state does not work in a distributed setup):

import redis
import time

def check_rate_limit(client_id: str, limit: int, window_seconds: int) -> bool:
    r = redis.Redis()
    key = f"rate:{client_id}:{int(time.time() // window_seconds)}"
    current = r.incr(key)
    if current == 1:
        r.expire(key, window_seconds * 2)
    return current <= limit

The response headers that make rate limits usable

When you rate limit, tell clients about it. The standard headers (not yet RFC, but widely adopted from GitHub/Stripe conventions):

X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 342
X-RateLimit-Reset: 1745658000

Include these on every response, not just when limits are exceeded. A developer can then build adaptive clients that throttle themselves before hitting the limit.

When a limit is exceeded, return 429 Too Many Requests with a Retry-After header:

HTTP/1.1 429 Too Many Requests
Retry-After: 47
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1745658000

A well-behaved client reads Retry-After and backs off. A client with exponential backoff logic can self-heal. Without these headers, clients either hammer your API or implement arbitrary backoff that may not align with your reset boundaries.

Where to implement it

API gateway layer (Kong, AWS API Gateway, Nginx with limit_req_module): offloads rate limiting before traffic reaches your application servers. The right choice for high-traffic scenarios where you want to shed load before it reaches your backend.

Application middleware: gives you access to authenticated identity for per-user limits that the gateway cannot determine before auth processing. Combine both layers: coarse limits at the gateway, fine-grained per-user limits in the application.

Third-party rate limiting services (Upstash Rate Limit, Cloudflare Rate Limiting): useful if you do not own the gateway layer or want to avoid building Redis infrastructure.

The operational side

Set alerts on 429 response rate. A sudden spike means either a legitimate client is misbehaving or you are under attack — both require action. Log the client identity (API key or IP) with every rate limit rejection. This data is your primary tool for diagnosing whether a limit is too tight or a client needs help fixing their integration.

Document your limits clearly. Developers who know the limits build adaptive clients. Developers who discover limits through 429 errors in production build brittle clients.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

What It Actually Costs to Hire a Senior Backend Developer in Sydney

You budgeted $160K for a senior backend hire. Then you saw what they actually cost once super, recruiter fees, and three months of low output were factored in.

Read more

The Head Chef Analogy: Why Teams Without a Tech Lead Fail

Imagine walking into a busy kitchen with 10 cooks and no head chef. Food is being made—but no one agrees on how it should taste.

Read more

Amazon and Microsoft Pay US Salaries in Vancouver — Local Startups Are Competing in the Wrong Currency

Vancouver's tech giants pay their engineers in US dollars at US rates. Canadian startups are making offers in a currency that's already at a structural disadvantage.

Read more

When a Software Project Goes Wrong: A Contractor’s Perspective

“It was supposed to be done last month… what happened?” From the outside, it looks like failure. From the inside, it’s usually more complicated.

Read more