Build Everything
Posts
Everything You Need To Know About Designing A Rate Limiter - Part 2

Everything You Need To Know About Designing A Rate Limiter - Part 2

Master your next system design interview by learning how to design a high-performance, scalable rate limiter.

Ben Osborn
February 16, 2025

Building a rate limiter, continued

Last time, we started to discuss how we can build a rate limiter that satisfies the following criteria:

Takes a request based on some parameters and determines whether that request should be allowed or dropped.
Highly-available and fault-tolerant.
Performant and does not add significant latency to the system.
Able to scale out to handle concurrent requests from millions of users.
Enforce rate limiting measurements consistently (i.e. if we only allow 5 requests per minute, the user should not be able to make 6 requests in a minute).

In this article, we will discuss some common rate-limiting algorithms, and walkthrough how we can use these to implement a rate limiter that satisfies the above requirements.

How should we do the rate-limiting?

There are a few common algorithms we can use for our rate limiting. Each of them has a different implementation and will come with its own set of pros and cons for our final rate-limiting system. The approaches we will discuss include:

Fixed window counter.
Sliding window counter.
Token Bucket.
Leaky Bucket.
Sliding window log.

Fixed window counter

This approach is one of the simplest and easiest to implement. We take our current fixed window, maybe being the current minute, hour, or whatever interval you want to rate limit on, and count the number of requests that occur whilst the time matches that condition (i.e. the same minute). Once the rate limit is reached, no more requests can be made. Once the time moves on (i.e. a different minute), we will restart the counter.

func AllowRequest(userID int64, limit int64) bool {
	window := time.Minute
	currMin := time.Now().Truncate(window).Unix()
	
	key := fmt.Sprintf("limit_%d_%d", userID, currMin)
	
	// Use a distributed lock for isolation
	// In practice, since we are using Redis, SetNX is easy and will work.
	// However, SetNX has some problems, so we must accept that maybe there are
	// some conditions where our lock may fail. This potentially means rarely
	// an additional request is allowed for a user, which for most cases seems
	// acceptable. Please evaluate if it is acceptable to you, and if not,
	// consider something with stronger guarantees such as Apache ZooKeeper.
	// I will make an article about distributed locking soon, so stay tuned!
	lock := acquireLock(userID)
	defer lock.release()
	
	redisClient.Incr(key)
	redisClient.Expire(key, window)
	
	val := redisClient.Get(key)
	
	return val <= limit
}

Whilst this approach is simple to implement and uses constant memory space, the main drawback is that we only consider the current interval at a time. Consider the situation where we allow 100 requests during any 1 given minute. During the first minute we allow a max of 100 requests, and then during the second minute, we allow 100 requests, and so on.

However, what if we wait until 1 minute 59 seconds, make all of our 100 requests, then at 2 minutes make another 200 requests? Within such a short interval, we have made 200 requests, which is double our limit and our service may not be able to bear it.

Now, we need to make sure our service can deal with burst that comes from double the rate limit, or halve our maximum number of requests during any given interval.

Sliding window counter

This approach is similar to the previous one, except, it aims to address the overlap between the current window and the previous window.

func GetKey(userID int64, min int64) string {
	return fmt.Sprintf("limit_%d_%d", userID, min)
}

func AllowRequest(userID int64, limit int64) bool {
	window := time.Minute // Window must be > 0
	now := time.Now()
	
	currMin := now.Truncate(window).Unix()
	prevMin := now.Add(-window).Truncate().Unix()
	
	keyCurr := GetKey(userID, currMin)
	keyPrev := GetKey(userID, prevMin)
	
	// Use a distributed lock for isolation
	lock := acquireLock(userID)
	defer lock.release()
	
	redisClient.Incr(keyCurr)
	redisClient.Expire(keyCurr, 2 * window)
	
	currVal := redisClient.Get(keyCurr)
	prevVal := redisClient.Get(keyPrev)
	
	// How many seconds have elapsed since the start of the current window
	currOverlap := now.Unix() - currMin
	windowSecs := int64(window / time.Seconds)
	
	// Calculate the weighted average
	weighted = (min(currVal, limit) * currOverlap / windowSecs) +
						(min(prevVal, limit) * (windowSecs - currOverlap) / windowSecs)
	
	return weighted <= limit
}

This approach is much simpler and easier to implement, but it is not perfect. The weighted average assumes our request pattern was linear, so it will still not compensate for bursts of requests during the minute, leading to over or under-throttling.

Token Bucket

Token Bucket uses a different approach than the previous two. Instead of keeping track of a single counter at a given time window, a token bucket acts as a metaphor, where you have a certain number of tokens in a bucket that you can use, and then the bucket refills the tokens at a constant rate.

With this approach, we can now support bursts of requests better and properly consider the timeframe, as we are allowed to consume all of the tokens in the bucket at any rate we want, as long as there are some tokens in the bucket. Instead, we are limited by how fast the bucket will refill.

For this algorithm, we will need to set some additional parameters, such as the capacity of the bucket, and the token refill rate.

From an implementation perspective, we will flip the problem, so we start with 0 tokens in the bucket and are allowed to fill it, and gradually remove items at the refill rate. You can think of this as inverting the bucket, so instead of consuming tokens, we only allow requests while there is space. It is equivalent.

For this algorithm, we need to keep track of the previous time and refill our bucket based on the refill rate and the time elapsed, which uses O(1) memory space. We could also make use of other native Redis structures here, but for now, to keep it simple, we will just store it in Redis as JSON.

type History struct {
	UpdatedAt int64 `json:"updated_at"`
	Capacity float64 `json:"allowed"`
}

func AllowRequest(userID int64, capacity int64, refillPerSecond float64) bool {	
	keyCurr := fmt.Sprintf("limit_%d", userID)
	
	// Use a distributed lock for isolation
	lock := acquireLock(userID)
	defer lock.release()
	
	body := redisClient.Get(keyCurr)
	history := History{}
	json.Unmarshal(body, &history)
	
	now := time.Now().Unix()
	
	// Make sure the time diff is not negative from clock skew
	// Minimum bucket capacity >= 0
	history.Capacity = max(
		history.Capacity - refillPerSecond * max(now - history.UpdatedAt, 0),
		0,
	)
	
	if history.Capacity + 1 >= capacity {
		return false
	}
	history.Capacity += 1
	
	// After this time, the bucket would reset anyway
	ttl := (history.Capacity / refillPerSecond) * time.Second
	redisClient.Set(keyCurr, json.Marshal(history), ttl)
	
	return true
}

Leaky Bucket

Leaky Bucket uses the analogy of a bucket being filled with water and dripping water at a constant rate. The idea behind this algorithm is that we can shape the traffic to smooth out spikes, as regardless of the number of requests we get they will only be processed at a MAXIMUM constant rate (you could still get less than your maximum number of requests), to help improve the stability of your services.

Conceptually, we start with an empty bucket with a hole in the bottom. We add requests to the bucket with some capacity, and once the capacity is reached, requests spill over (i.e. they are dropped). At the same time, given some constant leak rate, requests are processed.

There is one problem, however. This rate limiter works off a push-based model, where the rate limiter stores the requests and pushes them downstream at a fixed rate. Now, we need to introduce more logic into our rate limiter increasing its complexity, instead of just checking if we can make a request or whether we should drop it.

One approach we could take is to use our existing token bucket implementation where the refill rate is the leak rate but to also store the request at the same time. We could then have a continuously running background job pull the leak rate number of requests and send them downstream. This approach uses a push-mode solution. A better approach would be instead to publish the request to a message queue like Kafka, so the downstream can consume at its rate, turning this into a pull-based model, whilst still helping to smooth out the traffic.

What would be a better approach is to instead use a message queue, and let our downstream services consume the requests at their rate. We can then set our refill rate to be the maximum allowed consumption rate for our downstream service.

With this approach, we have converted the problem from a push-based model to a pull-based model, where the service pulls messages at its rate, and the rate limiter prevents too many messages from entering the system to be processed at one given moment.

However, with this design, we have introduced some challenges. Our system is now asynchronous. This might be fine for some use cases, but other use cases might require a synchronous response to be returned to the user, in which this kind of rate limiter might not be the best approach.

Here is the following implementation we can use to achieve our goal, reusing the code from the token bucket implementation:

func allowRequestTokenBucket(userID int64, capacity int64, refillPerSecond float64) bool {	
	// Reuse the same code from the token bucket algorithm
}

func EnqueueRequest(request interface{}, userID int64, capacity int64, leakRate float64) bool {
	// First check if we can send the request
	allowed := allowRequestTokenBucket(userID, capacity, leakRate)
	if !allowed {
		return false
	}
	
	// Enqueue the message for async processing
	return enqueueRequest(request)
}

With this implementation, you will need to consider some failure scenarios. Say the rate limiter fails, should we drop the request? This kind of thinking can be applied to any of the previous problems. Also, what happens if we allow the request by the rate limiter and consume the token from the bucket, but the enqueue message fails?

You can change the order of the updates based on your desired functionality, i.e. enqueue the message and then update the rate limiter. Alternatively, you can use a distributed transaction to ensure atomic operations. But even then, we cannot guarantee that the request will be successfully processed. These questions are all worthwhile thinking about when considering distributed systems. It should be noted, that this same problem applies to all of the other rate limiters we have discussed.

Sliding window

Unlike token bucket which allows a maximum burst capacity and refills at a constant rate, a sliding window log guarantees that a user can only make a given amount of requests during the interval you specify. When we need to be exact about the amount of requests offered to users over some window, this is the algorithm we should use.

How does it work? As the name suggests, we keep a sliding window of all the requests within the given timeframe and keep count of how many requests we have already made during our window from the current time. If this number of requests exceeds the allowed, we do not allow the user to make any more requests.

The algorithm we will use goes as follows:

Look at all previous requests made that have a creation time > now - window.
If the count is greater than or equal to the maximum allowed during the window, drop the request.
Otherwise, add the request to the list with the timestamp of now and allow the request.

Compared to the other algorithms we have seen, as we need to keep track of our request history, our algorithm has a time and memory complexity of O(w) where w is the size of the window, as we might need to look through the entire list to clear the window, even if we keep count of the size of the active request count.

We can use Redis sorted sets for an efficient implementation.

func AllowRequest(userID int64, maxWindowSize int64, window time.Duration) bool {	
	keyCurr := fmt.Sprintf("limit_%d", userID)
	
	// Use a distributed lock for isolation
	lock := acquireLock(userID)
	defer lock.release()
	
	now := time.Now()
	
	// Remove all elements from the set less than or equal to the time window
	windowStart := now.Add(-window).Unix()
	redisClient.ZRemRangeByScore(keyCurr, "-inf", fmt.Sprint(windowStart))
	
	// Get the current element count
	count := redisClient.ZCard(keyCurr)
	
	if count >= maxWindowSize {
		return false
	}
	
	// Add the current timestamp to the set
	redisClient.ZAdd(keyCurr, fmt.Sprint(now.Unix()), "1")
	
	// Set an expiry for the set overall to save memory after the window ends
	redisClient.Expire(keyCurr, window)
	
	return true
}

One thing we should be aware of is that the clock skews between our different nodes from time.Now() could lead to inconsistent rate-limiting results. One way we could fix this is by using Redis Lua scripting and using it to create the time for us (which would also remove the need for the distributed lock).

Furthermore, for the ZAdd and the Expire commands, to make sure they are executed atomically, you can use pipelining, preventing an inconsistent state. However, if the rate limiter is called enough, the expiration will eventually work, removing the empty set.

How can we scale this up to ensure high availability and performance?

Thanks to our design, scaling our implementation and ensuring high availability is quite a simple task.

Throughout this design, the implementation shows us using Redis, but we did not address why we use Redis compared to some other solutions. For this requirement, we need to support a lot of write traffic, and we also have strict performance requirements so our rate limiter minimizes additional latency on the rest of the system. In addition, we are only making key-value-based operations. Based on these requirements, we should be thinking of an in-memory key-value store, and Redis is a fantastic choice for this which has been battle-tested over the years. It also has a rich ecosystem and has built-in support for replication and sharding with Redis Cluster.

To start, we need to look at the read-and-write ratio of our workload. As every request we make to our rate limiter involves reading and writing some data, we can assume that these will be equal. As our system also needs read-after-write consistency as we cannot afford stale reads which would lead to potentially causing more requests than allowed from replication lag, all of our reads and writes should go through the same node the master. To ensure scalability for this kind of access pattern, we need to introduce sharding into our Redis cluster.

One pattern we can observe from our data is that we only read and write data using key-value operations (which is why we chose Redis in the first place). This means that each of our keys can be distributed across any of the nodes in our cluster, as long as we have some way of being able to identify which node to read or write that particular key from.

We will share our data using a consistent hashing approach, which helps to minimize the cost of rebalancing our shards if we need to add or remove a node. Sharding is a complex topic in distributed systems and critical to understanding, but we will consider it out of scope for this design, and I will instead make a separate article soon explaining how it works in detail.

In terms of how we will distribute the keys, since all of our key-value lookups are independent as we stated before, we can use a hashing strategy which will help to ensure a uniform distribution of the keys, ensuring a consistent load across all our nodes and minimizing the hot-key problem.

Therefore, the approach we will take for reading or writing data for any given key to our Redis cluster goes as follows:

Send the key to our load balancer.
Hash the key.
Place the key on the ring, and move in a clockwise direction until we find the first shard on the ring.
Send the read/write request to the master node of that shard.

We can increase the scalability of the system by increasing the granularity of the keys. Right now, our keys exist at a user ID granularity from the way our key is structured. However, by increasing additional values to rate limit into this key, such as the type of request being made, or what service should be rate-limited (in the situation where we rate limit on a per-service granularity), we can make the key more fine-grained where the load will be better spread across nodes, helping to scale traffic and avoid hot shards for users which make a lot of requests.

// Original
key := fmt.Sprintf("limit_%d", userID)

// Finer granularity helps to distribute requests across nodes
key := fmt.Sprintf("limit_%d_%s_%s", userID, requestType, service)

When considering high availability and fault tolerance for our rate limiter, we need to add replica nodes so that if our master node goes down, we can switch to the replica and have it take over to serve requests for us. Replication is a very complex topic critical to distributed systems and deserves its article soon, however, for now, we will look at our requirements and identify what will work best for us.

Our goal is to reduce the number of requests that a user can make, and our Redis storage keeps track of the requests made so far by our users. If one of our shards is unavailable, our rate limiter will temporarily not be able to connect to the node, and you must decide as a developer: do you drop all requests, or allow all requests? Dropping requests is never good as we are losing data, imagine we are dropping purchase requests for an e-commerce site, we would be leaving money on the table. However, allowing all requests has its own set of problems, as we might overwhelm our downstream services and cause an outage. So clearly, avoiding downtime for our rate limiter is the preferred option here, which can be done by having a standby replica ready to take over.

As we need to guarantee strong read-after-write consistency, we will use single leader replication, where we have 1 master node and > 0 replica nodes. For our replica, we can choose to replica changes synchronously or asynchronously. In the case of synchronous replication, it guarantees that data will be committed to the replica before we acknowledge it to the client, meaning that if the master goes down, we will not lose any data. However, the time spent committing introduces additional latency, so we need to consider, whether is it that big of a deal if in the case of a failure which should be rare, we lose some data that might allow users to make a few more requests than they were allowed. If the answer is yes, we must use synchronous replication. If the answer is no, we can use asynchronous replication, where our data is transferred to the replica in the background. However, in the case that our primary master node fails as we need to failover to the replica, we will have all of the data that has already been replicated, but depending on the replication lag, some of it might be missing. The amount of data depends on the replication lag, which typically increases if your master node becomes overwhelmed as it does not have resources to replicate.

Note that introducing redundancy introduces additional costs, as we are using memory space and resources that will not help in serving online traffic, and are only there as a backup.

Some additional thoughts

Whilst we have solved the problem, at the end of the day, dropping requests is not an ideal situation as we are losing data. One approach we can take is to consider asynchronous processing where possible instead. With this approach, we decouple our ingestion from our processing and can accept as many requests as the user wants to send (within reason), but we can process them at the speed our servers are comfortable with to not overwhelm them. In the case of the leaky bucket which is asynchronous by nature, perhaps we could block the requests from entering the system, but store them in a dead-letter queue to be automatically enqueued again later when the system is not under such high strain.

Another approach we can take is the most obvious: scale up our resources in advance. If we know we have an event coming that is going to lead to a surge in traffic, we can pre-scale our servers in advance so they can deal with the higher load. By being able to handle additional load, we can increase our rate limits, and therefore eliminate the chance the rate limit is going to be reached which would lead to the request being dropped. This is particularly easy in a cloud environment which most companies are running on nowadays. Alternatively, if you are on the cloud, you can consider auto-scaling which will take care of this for you (be careful, however), and have your rate-limit maximum capacity adjust dynamically with the number of nodes in the system. However, auto-scaling is not instant, so your rate limiter will still need to be able to prevent spiky traffic so that the system has time to scale up to serve the request.

Finally

Thanks for reading what I have to say about rate-limiting and designing one for use at a massive scale, I greatly appreciate it! If you’re interested, I’m going to be writing articles about the following topics in the future (just off the top of my head):

Sharding and consistent hashing.
Replication.
Caching.
And much more to come…

Also, if you know someone who would be interested in this content (maybe someone preparing for an upcoming big tech software engineering interview), please share it around!