Build Everything
Posts
Design a Distributed Unique ID Generator

Design a Distributed Unique ID Generator

Ace your next system design interview by learning how to build a low-latency, highly-available, and fault-tolerant unique ID generator.

Ben Osborn
March 05, 2025

What is ID Generation and why does it matter?

Most systems require a method for generating globally unique IDs to uniquely identify records or entities within the system.

Common use cases for unique ID generators include user IDs or message IDs in a chat application. When using a primary key in a database, it's often a good idea to use a globally unique ID, as this approach makes it easier to shard data in the future, unlike an auto-incrementing integer, which works well only on a single database node.

In this design, we will discuss the challenges with generating unique IDs at scale in a distributed system, and eventually derive our own implementation which can satisfy our main requirements.

Requirements

To start off, we will define some functional and non-functional requirements for our system.

For functional requirements, when we call our ID generator we want to get a unique that we have not previously seen before.

For non-functional requirements, these are the same as most of our designs:

The system is latency sensitive.
The system should scale horizontally to be able to copy with millions of concurrent requests.
The system should be highly available and fault tolerant.

Generating unique IDs

So now that we know what we need to build, how can we achieve our requirements? You might already have some ideas off the top of your head, so lets start with something very simple, and work our way up.

Random number generation

To start, if we want to generate unique IDs, why can’t we just pick a random number in a certain range and use that as our ID? If we pick a large enough number, surely this will work right?

Well, no. Even if you have a huge maximum number, eventually you will repeat the same IDs. And worse, you will likely have no way of knowing until it is too late and something breaks in your system.

Lets say you generate numbers between 0 and 9. If you generate 11 IDs, it is guaranteed by the pigeonhole principle that you must have at least two IDs that are the same in the BEST case scenario. And this does not even consider the worst case where you get unlucky and generate the ID of 1 11 times in a row.

So clearly, this approach will not satisfy our requirements.

Using a timestamp

The next approach we can consider is to use a timestamp. After all, time continuously moves forward, so we should not be able to generate the same ID twice, right?

In theory, yes. In practice, no. Let’s consider a few scenarios.

First, what happens if we attempt to generate an ID at the exact same time? Whether we use seconds, milliseconds, nanoseconds, it does not stop the rare situation where we call the ID generator at the exact same time and end up with the same ID, causing a conflict in our system. To fix this, we would need some additional way of differentiating these timestamps, such as at timestamp t1 we can have the IDs t11 and t12.

Next, we cannot guarantee that all timestamps will be continuously increasing. Why is this the case though? First, clocks do not hold time perfectly and can become out of sync between servers. For this we have the NTP protocol which will occasionally try and correct the clock drift. However, this means that if our clock was initially too fast, after the correction the clock might reset to a smaller value, causing time to go backwards, once again potentially causing a conflict in our system.

To fix this, we could keep track of what timestamps we have seen before so we do not reuse them, but now we have introduced additional storage overhead into our system and additional latency to lookup the value each time before we use it. However, this approach could satisfy our high-availability, fault tolerance, and unique requirements. One thing to note though, is since all our servers are expected to have their to be roughly synchronised, if we are generating IDs at very high load, it is reasonable to assume the we will be trying to reuse the same timestamp a lot of time, meaning that effectively we are only getting the throughput from one server with the fastest clock, and the rest of our requests will effectively be dropped as when we lookup our ID history we will see the ID already exists, resulting in a very high error rate.

While we can see this approach won’t work for us, there is one nice benefit being that our IDs now have some way of ordering them based on their generation time, as the timestamp is built into the ID. This is beneficial, as if we use this ID as an index, our IDs will be roughly sorted in chronological order, meaning we can do range queries efficiently without needing to maintain a separate “created at” index. Lets see if we can maintain this property going forward.

Using a counter

The next approach we might consider is using a single counter. After all, if we increment every time we use an ID, we are guaranteed to never get the same ID twice, meaning this will satisfy our functional requirements.

However, in terms of satisfying our non-functional requirements, this might have some challenges. Because we need to maintain a single counter, all of our ID generation must go through a single server. This is clearly not very scalable as the single node is a performance bottleneck, and does not satisfy our high availability constraints as if this single server goes down we cannot generate IDs. To solve this HA constraint, we can introduce synchronous replication where we write to a standby instance before we write to our primary server which will ensure the ID is always increasing (even if the primary node unexpectedly fails after replication but before the response is returned), but this will slow down our system and does not satisfy our scalability requirement. However, if we need to guarantee an absolute ordering of IDs in our system, it turns out this is the only way to do so.

But why can’t we simply add more servers? To maintain a consistent maximum value of our ID generator, we need to make sure all writes go to one node. Consider the scenario where we generate an ID from two nodes at the same time using a multi-master setup. Both have not had a chance to replicate their changes, so both return the same ID. Now we have a conflict in our system.

In addition, now we need some way of being able to persist the state of our counter so that the value is always increasing. Otherwise, if our node crashes and we cannot recover the state, we will not know where to restart the counter from and can create some duplicates. If we introduced the timestamp as a prefix, this would help to solve the counter recovery problem, as if our server crashes, when we restart the server time will have moved forward and we can restart the counter from 0.

What we really need is a way of being able to run our counter in parallel on different servers completely independent of others whilst guaranteeing they will never generate the same ID, which would make it very easy to horizontally scale. If we have some way to differentiate the counters such as a unique identifier for each server, we can have different counters on different servers and guarantee that the IDs we generate will never be the same.

Snowflake

As we are now familiar with some of the challenges we have when it comes to generating such IDs, we will now introduce one of the most famous ID generation algorithms called Snowflake (from Twitter).

A Snowflake ID is a 64-bit integer consisting of the following parts:

1 unused bit.
41-bit timestamp representing milliseconds.
10-bit worker ID.
12-bit sequence ID.

There are some variations, particularly around the worker ID and how they are distributed, but from an implementation perspective I think it makes most sense to keep this part as one whole section.

With a 64-bit integer, we have 18,446,744,073,709,551,616 possible unique IDs, which is likely more than we will ever need.

Our snowflake ID will take the following form all concatenated together into a single binary string which we convert into a base-10 64-bit integer.

You might notice that this ID looks similar to what we have already come up with for the timestamp and counter approaches, and you would be right. However, there is one differentiation factor, and that is the worker ID. For each thread across the different machines (called a worker) that is generating IDs, this worker ID will be unique. This means that all of our workers can operate completely independently and will never generate the same ID as another worker, and we only need to keep track of the timestamp and counter values on a per worker basis which does not need to be distributed. By introducing this simple concept of the worker ID, we can now parallelise our ID generation whilst ensuring low-latency without needing to keep track of previous requests, uniqueness, along with high-availability through horizontal scaling.

In terms of fault tolerance, if one of our workers goes down for some reason and is restarted, we need to make sure it will not generate IDs that we have not already seen before. To achieve this, we can simply assign it a new worker ID that is not being used by any other machine (we can discuss this more during implementation) and make sure that its timestamp and/or counter values have not been used by a worker before it.

There are some constraints we need to consider, however. First, even though we have a 41-bit timestamp, this assumes we are generating from the first epoch (back in 1970). However, we do not live in 1970, meaning we lose out on all of the timestamp values for our ID generation up to now. To fix this, we can make the timestamp relative to now by taking the difference between the current time and our ID generator deployment date (some fixed hardcoded value), ensuring we maximise our ID range. Our IDs will still be relatively ordered, but if we want to run a range query over a particular timestamp, we must introduce some special logic.

Next, we can see that for our worker ID we only have 10 bits (1024 possible values), meaning at the very maximum we could only run 1024 possible worker nodes in parallel at any given time, and for our sequence ID since we only have 12-bits there is a limit to how many IDs we can generate per millisecond from our timestamp per worker (4096).

Whilst for most use cases this is enough, if we wanted to increase this, we could consider increasing the size of the ID we generate from 64-bit to something like 128-bit. However, most computers and programming languages are primarily designed to handle up to 64-bit, so we might need to store this ID as a string which requires significantly more storage space required, in addition to degrading the performance of our database indexes as we cannot fit as many keys into our pages.

If you’re looking for a real implementation including code, Twitter has actually open sourced their own implementation of Snowflake which you can find here.

Managing worker IDs

In terms of our implementation, there are a few things we need to consider. Essentially, our worker ID is what makes sure that all of our different threads can generate IDs independently. However, if we start sharing the worker ID between different servers, we might end up in a situation where if the clocks are different between different servers and we accidentally generate an ID with a timestamp, worker ID, and sequence ID that already has existed before, leading to a duplicate ID.

To fix this, we can make sure that each machine in our cluster persists the same worker ID by configuring it as an environment variable, or even just hardcoding this on the machine itself. But what if we want to have multiple threads on the same machine generating IDs? For this, we can assign ranges of worker IDs and then allocate them accordingly amongst our threads. We can do this because these threads will all run on the same machine and therefore it is OK for us to dynamically allocate them as the timestamp will always be greater.

But hold on for one second. Can we guarantee that the timestamp will always be greater? We said before that NTP can turn time backwards in some cases to correct a fast clock. However, this is actually a configurable option which we can disable, and instead the clock will run slower until it matches the current time. In fact, this is recommended in Twitters own Snowflake implementation from 2010.

So now we know how to guarantee uniquely generating IDs in a scenario where the worker ID stays allocated to the same machine, and we guarantee that time cannot go backwards. However, in practice nowadays, we usually use Kubernetes to schedule our workloads. Kubernetes by default aims to maximise efficiency, and depending on the configuration can be very aggressive in terms of killing off pods and rescheduling them. What happens if we kill off a pod and put it onto another machine with the same worker ID but a slower clock from the past. We might accidentally generate the same ID again.

To fix this, we can introduce a concept called node affinity, which restricts Kubernetes to scheduling your pod to specific nodes. This way, you can guarantee that your pods with the same worker ID will always run on the same node, and guarantee that you will never accidentally go backwards in time.

But what happens if we decide to make the system more elastic, that is we can dynamically assign worker IDs to other machines? We need to be careful to make sure we do not accidentally go backwards in time, as the clocks between the two machines could be out of sync. The safest way for us to do this is manually, so we can guarantee there will be no conflicts. This might work if we occasionally manually remove a node and need to rebalance our worker IDs, but it does not meet the requirement of a dynamic system. Depending on the size of our cluster and the reliability of the underlying hardware, we might be able to get away with this as a manual job for our engineers to occasionally perform. However, once we have enough machines in our cluster, we reach a point where they will be failing regularly, making it infeasible to manually check and restart them.

If we assume our time will be in sync most of the time by a few milliseconds, which is reasonable to do given modern hardware and data centres, the time taken to redistribute our worker IDs should be long enough such that we will not reuse the same timestamp for the same worker ID. In addition, we can introduce a startup delay based on our expected time accuracy. However, in some rare cases, duplicates could still occur. To further reduce our chance of duplicates, we could use atomic clocks which have a very accurate clock skew bound. However, this would be very complex and seems overkill for most systems.

To build a truly reliable fault-tolerant system and to relax some of the timestamp guarantee problems we have previously encountered, we should accept this error can occur occasionally, and handle the situation where it does gracefully. By building our systems in ways that guarantee a consistent snapshot of the data we store, which is then used to derive all other data in the system, along with some defensive programming (i.e. atomically checking if values exist before inserting rather than naively upserting), we can eliminate the risk that comes from inserting a duplicate ID into the system, and instead return a failure. Transactions and change data capture are two methods which can help to guarantee this within your system.

Another question, is should we aim to capture all of the IDs to avoid waste. To do this, we could use persistent storage, and then when we want to retrieve some IDs we could delete them from the database. Storing our IDs would also help to prevent duplicates, as we would have an entire history of IDs.

To start, for persistent storage we need to store data on disk, which adds extra latency compared to an in-memory solution. We also need to lock our data during reads to ensure each ID is handed out once, and then delete them or mark them used as part of the same transaction. Depending on how we query these IDs in our database (i.e. use an index and start with the lowest IDs first), this could lead to very high lock contention. In addition, with our 64-bit Snowflake format, we can generate a maximum 8,388,608 IDs per millisecond, each of which are 64-bit, resulting in roughly 67 GB per second. Over the entire year, this comes out to a few exabytes, which is far too expensive to store, especially for IDs which can very easily generated. Since we generate so many IDs per second, it makes more sense to generate them on demand and keep a fixed number of them in memory for fast access. If we needed to support higher throughput IDs, we could increase the size of the ID from 64-bit to 96-bit or 128-bit, allowing for more workers.

Low-Level implementation

Because our system is latency sensitive, we want to ensure we are generating IDs as fast as possible, and that our read operation is very simple. We will design our implementation around this.

Before we deploy any code, we need to make sure our server cannot go backwards in time by configuring the underlying machine to prevent duplicate IDs from the same machine, and instead slow the system clock down until it is in sync again.

When each of our machines first starts, they will retrieve their worker ID range that will be unique to this particular server. For the static approach, we can assign the worker ID range using an environment variable used when starting the server.

For the dynamic approach, we can create entries in ZooKeeper under /range/machine-0 to /range/machine-n, where each machine, based on its suffix, is assigned a fixed block of worker IDs (e.g., machine-0 corresponds to IDs 0-63, machine-1 corresponds to IDs 64-127, and so on). Each server iterates through the nodes in the /range directory and checks if a node exists under /range/machine-n/locked-k. If the node doesn't exist, the server creates an ephemeral sequential node under /range/machine-n/locked-k and claims the range by holding that lock if its sequential node has the smallest number in the sequence.

Next, we spin up a set of threads on the machine and assign each one a worker ID. It does not make sense to have more worker IDs than threads on a single machine as the worker ID is the identifier for the thread to ensure the IDs it generates are unique, so we should aim to make it such that the number of worker ID partitions per server exactly matches the maximum number of threads we can have running on a server. As our threads are very CPU bound, we should aim to have the number of threads match the number of available cores for our server.

Each thread will be provided the worker ID as a parameter, and then from there it will need to maintain the last seen timestamp which is initialised as the current time + our expected clock skew range (to minimise the chance of duplicates from clock drift), and the sequence number which is initialised as 0.

For each ID we generate, the pseudocode will be as follows.

Get the current timestamp.
1. If the current timestamp is less than the previous timestamp (should not happen as we disabled backwards time), return false result.
2. If the current timestamp is greater than the previous timestamp, store the current timestamp as the previous timestamp and reset the sequence number to 0.
Subtract some constant arbitrary start date from the timestamp (relative to when you first deployed the application), to get the timestamp offset.
Get the current sequence number.
1. If the current sequence number is greater than or equal to the maximum allowed sequence number, return a false result.
2. Otherwise, increment the sequence number.
Generate the ID by concatenating 1-bit unused bit (0), the 41-bit timestamp offset, 10-bit worker ID, and 12-bit sequence number into a 64-bit ID, and return it as a base-10 int64.

Because the worker ID is assigned uniquely to a thread, these threads can all work independently without any kind of lock contention to maximise the number of IDs they generate.

While we could just have all of our IDs generated on demand whenever a client requests an ID from the service, it makes more sense to pre-generate them in advance and keep the IDs available in memory. We do this because even though it is fast for us to generate a single ID, there is still some overhead required, and if the user makes a batch request for a large amount of IDs we have to figure out some way of distributing this across our workers and synchronising the result, and for each worker we need to sequentially generate the IDs.

Instead, it is better if each worker continuously produces a fixed number of IDs to a fixed length queue, and then when a client requests a batch of IDs, we just read them off the central queue (think of the classic bounded producer consumer problem). To reduce the amount of lock contention on a single queue, we can have multiple queues that workers can write to randomly to distribute load. We can also take advantage of batch processing, where each worker generates a batch of IDs first, acquires the lock, and uploads their entire batch to reduce the number of lock acquisitions required.

For our entire cluster of ID generates, we will attach a load balancer, as it does not matter what server we connect to for generating our IDs. From a client perspective, to help reduce latency further and reduce the number of network trips we make, clients can prefetch and populate their own local cache of unique IDs in batches, and then make another request when the number of local IDs falls below a certain number.

You made it!

Thanks everyone for reading my guide on designing and implementing a distributed unique ID generator! I hope you learned at least one thing, and that now if asked during an interview or as part of your real-life job description, you could correctly implement a solution whilst being able to understand and explain the design choices made.

For free access to in-depth systems design content that goes beyond interview prep and teaches you how to actually build systems, subscribe to the newsletter at no cost!

In addition, if you know someone who you think would find this content useful (maybe someone preparing for a big tech job interview), I would greatly appreciate it if you could send this article to them (bonus points if you ask them to subscribe)!