- Build Everything
- Posts
- How To Build A Discord Clone - Part 1
How To Build A Discord Clone - Part 1
Impress your next system design interviewer by learning how to build your very own, highly scalable Discord clone.
What are we building?
Today, we are working on the coolest system design problem, which, in my opinion, is building Discord. I like this so much because most of us have used it before, and it’s fun to think about how it works. Because of its scale and real-time nature, it can lead to some interesting design decisions.
At a high level, here are the functional requirements of Discord we will be discussing during this design.
Users can add friends.
Users can send messages directly to their friends via DM’s.
Users have a history of all messages.
When users log in after being inactive, they receive all new messages.
Users receive notifications for new messages.
Users can send attachments.
Users can see the online status of their friends.
Users can create servers and send chats to everyone on the server.
Here are some of the non-functional requirements:
The system should be able to scale out to handle the massive loads from millions of concurrent users.
The system should be highly available and fault-tolerant.
Users should receive messages quickly in real-time.
Full disclaimer: I’ve never worked at Discord and I do not know specifically how their systems work, but this is how I would design it. All opinions are my own.
How should this work?
Friends
Like in real life, friends are the most important part of the Discord system design. After all, if you don’t have any friends, who are you going to send messages to?
To start, we will need some way of storing user information. To keep this simple, we can store the following information in a user table:
-- User table
CREATE TABLE users (
user_id BIGINT PRIMARY KEY,
username VARCHAR(255) UNIQUE,
created_at DATETIME
);
Next, we can have a friend table:
-- Friends table
CREATE TABLE friends (
user_1 BIGINT,
user_2 BIGINT,
created_at DATETIME,
accepted BOOLEAN, -- Used to check if the friend request has been accepted
PRIMARY KEY (user_1, user_2),
FOREIGN KEY (user_1) REFERENCES users(user_id) ON DELETE CASCADE,
FOREIGN KEY (user_2) REFERENCES users(user_id) ON DELETE CASCADE
);
If we want to check if the users are friends, we simply need to look at whether a record with the key user 1 and user 2 exists in the table.
If we want to do a scan of all of friends for a given user, we need the following query:
-- Check if two users are friends
SELECT COUNT(*)
FROM friends
WHERE (user_1 = :user_id_1 AND user_2 = :user_id_2 AND accepted = TRUE)
OR (user_1 = :user_id_2 AND user_2 = :user_id_1 AND accepted = TRUE);
-- Find all friends of a user
SELECT user_1, user_2
FROM friends
WHERE (user_1 = :user_id AND accepted = TRUE)
OR (user_2 = :user_id AND accepted = TRUE);
Because of the composite key, data will be stored on disk as tuples of (user1, user2). This makes our first query where we search based on the “User 1” column fast, but the inverse when we search for “User 2” will not be ordered lexicographically, and results in a full table scan (SLOW). To solve this, we need an index on (user2, user1).
-- Create an index for efficient looksups of (user2, user1)
CREATE INDEX idx_friends ON friends(user_2, user_1);
In terms of the choice of database for this “User” microservice, the write throughput for this table is expected to be fairly low, as it essentially comes in the form of user signups. Also, by adding friends we might be able to assume that this throughput is also low.
However, for our reads, we might need to consider some higher throughput requirements, for example, every time a user logs in we need to show users their list of friends, and then other services might also have checks for whether two users are friends before performing some operations. To increase our read throughput, we can simply use single-leader replication where we have one master node which accepts all writes, and then asynchronously propagates data to other replicas which will be used to serve reads. This also provides high availability, as if our master goes down, we can promote one of our replicas to become master whilst minimizing data loss. Depending on our amount of traffic, we might also need to introduce caching at a certain stage to prevent our database from becoming overwhelmed, and to speed up our queries.
If we need to scale up our writes in the worst possible scenario, there are a few approaches we can take. To start, the first approach which is the easiest might be to make our writes asynchronous by introducing a message queue. This might work for the add-friend situation if it does not matter if our friend gets the request at a later point in time. However, what happens if you’re about to play a game and you tell your friend you sent them a request and it doesn’t show up? In this case, we need to keep it synchronous.
In the case we need to scale our writes, the better approach is to partition our data and deploy it on separate shards. This approach will require some modifications to our data and will require some additional logic at the application layer.
To start, we can keep our Users
table in the same format, and then partition randomly based on user ID to ensure uniformity across partitions. We can use Snowflake to generate our UUIDs through some kind of distributed ID generator. I will be making an article on Snowflake soon, so stay tuned!
For our friends’ table, we might have slightly more effort involved in partitioning. The main reason for this is we need to do a range scan to find all of our friends. Ideally, we will have all of our friends on the same partition. Because we do two queries, one for (user1, user2) and (user2, user1) in our scan, we need to maintain 2 indexes on our data. One approach we can take is to write our data twice, that is we write one record with a partition key of user1 and another record with a partition key of user 2. This way, we no longer need to search user1 and user2 and can just search based on user1, as we know each user will have all of their friends recorded. However, this involves doubling our data and also introducing some data consistency issues as we need to make two writes across what could be 2 separate partitions.
Here is an example:
UserID | FriendUserID
1 | 2
1 | 3
2 | 1
3 | 1
We can now find all of user 1's friends by querying for user1,
and they are guaranteed to live on the same partition.
This is often referred to as a Global Secondary Index (GSI) which is supported out of the box by some databases like DynamoDB. GSIs are usually created asynchronously, so there might be some data consistency issues when we retrieve our friends. This is something we will need to take into consideration as part of our design, but if we have to scale our writes and keep them synchronous, we don't have a choice.
As part of this design, I think we can stick to using a MySQL database using single leader replication for the time being.

Sending DMs
This will be a core piece of functionality for our service, which is where we will send messages directly from one user to another user if they are friends. For now, we will ignore servers, and just deal with DMs directly.
To start, we need some way of sending and receiving messages from the server. We could just use a regular HTTP POST request to send messages to the server. But what about receiving messages from the server? We have a few different approaches, push mode or pull mode. In the case of pull mode, we have two options: long polling or short polling. For short polling, we might be spamming the server with requests every 10 seconds to poll for new messages, which puts extra strain on the server, particularly if messages are infrequent. In addition, this will not be real-time, as we make a request and if it fails we have to wait until the next request. Long polling is better as we will wait on a single connection and when the server has some data available it will send it over the request, but we still have to put extra strain on the server to handle the request, and the request will eventually timeout where another will need to be made.

The problem with polling here is that we might not get messages very frequently, meaning that in most poll requests we will get no data, which is very inefficient. Because of this, it makes sense for the server to push messages to us instead. This way we can connect to the server, the server stores our connection, and then when it has something to send it it only needs to push a message to that connection.

We have a few choices for pushing events from the server to the client. One of these is to use server-sent events or SSEs, however, these work best for unidirectional communication from the server to the client only. In our case, since we will be sending and receiving messages, we should use web sockets which allows us to establish a bidirectional TCP connection between our web client and the server. For this implementation, as Discord is supported on the website, we will assume that all of our implementations across any device will also use the same technology. However, using a regular TCP connection will also work if your choice of technology from the client side will support it.
When our user sends a message to the server, if we have some load balancer in front, we need to make sure that our message is forwarded to the right server. We can use some load balancer that acts at layer 4 so that it will continuously forward packets to the same server. However, by doing this, we lose some of the smarts of our application load balancer. We could keep our layer 7 load balancer, but we need some way of consistently routing the same request to the same server, which is where sticky sessions come in, using a consistent hashing approach based on the user's IP address to send to the same machine.

But what happens if we add or remove a server? The server holding our connection might be closed, or some connections might now be routed to different nodes. This is OK - we just need to make sure the users are gracefully disconnected from the old server with a simple ping-pong that will eventually result in a fail, and then they can reestablish a new connection which will route them to the new server. Because of this, we will want to ensure our websocket connection servers are stable and we don’t add or remove nodes too frequently.
Since the server will send the event directly to the client, the server now needs to maintain some state, being which user corresponds to a particular connection. Since the user sending the message might be connected to a different server than the user who will receive the message, for our chat application to work correctly, we need to route messages to the server to which the receiving user is connected.

How can we implement this routing? To avoid reinventing the wheel, let's take a look at layer 2 of the OSI stack the data link layer, which has been doing the same job as what we want to do for decades now. At layer 2 we have the concept of addressing with MAC addresses, and switches which do some routing. Switches receive packets through a port and then forward the packet out of another port to reach the intended destination. But how does the switch know which port to forward the packet out of to reach the desired destination? When the switch receives a message the first time, it looks up an internal table, searches for the destination MAC address, and if it sees nothing it broadcasts it out of all its ports. However, it looks at the source MAC address and adds an entry of which port the message was sent on, as the sending and receiving ports must be the same. This process happens every time we send a packet, so we always have the most up-to-date routing table. Now when a message is to be sent to that MAC address, we have an entry in the table and can send it directly to that port.

We will adopt a similar approach. We can have a separate central routing service that all messages are forwarded to which keeps an in-memory table using Redis that records the last seen server that a particular user ID came from. When the routing service receives the message, it first stores the user ID who sent the message and the server on which it was sent, and then it proceeds to look at the server on which the recipient user ID resides before calling an RPC endpoint on that server to forward the message there.

We could first check if the user ID we are looking for exists on the same node, and this way we don’t have to call our routing service. However, it might be worthwhile always sending messages to this central service for reasons such as authentication, storage, and to make sure we add the entry to the table. We will explore this soon.
There is one big problem though: what happens if the user ID is not in the table? For example, if a user immediately joins and receives a message, they might not have sent a message, and therefore they might not be in the routing table. To solve this, we can have our lightweight keep-alive forwarded to the routing service so that we can keep the table updated. We can also send a keep-alive when we first join to update the routing table immediately upon starting the app. We will also keep a short TTL on each record in the table, which will be reset on each keep-alive message. To clarify, if there is no user ID in the table, we can simply drop the message from being sent to the connection.
As this table will be dealing with high read and write throughput, Redis is a great choice as it is in-memory and very fast. However, we will need some horizontal scalability here, and thus we must introduce partitioning of our data that we can shard across multiple nodes. For this, we can use hash partitioning on the user ID, ensuring a uniform load across all nodes in the cluster.
We also have another approach we could take to make things simpler for us, which is Redis pub-sub, a lightweight message-sending protocol for at most once delivery. With Redis pub sub, our websocket servers can subscribe to a lightweight channel for each user ID connected, and then when we receive a message for a particular user ID we publish it to that topic, and Redis will take care of routing the message. It is also easy to horizontally scale thanks to Redis using the same approach they take for keys. With this approach, we need to introduce some additional logic into our websocket servers to manage subscriptions.

However, there is one thing we have not considered. In Discord, we can have multiple clients with the same user ID connected at the same time, which may all be connected to different servers. We need to make sure we send messages to all of these clients and eventually keep track of what messages we have locally. To fix, this, we will introduce a new concept the client ID, which will simply be a unique logical identifier within our system generated by Snowflake for uniquely referring to each client/device that persists over time (within our user service), all associated with our user ID. We will need to update our routing table to also include the client ID, and then we can fan out to all client IDs for the user who needs to receive the message.
Note that we can introduce this client ID to our user service similar to the approach we used previously, where we can have the user ID as the partition key and the client ID as the sort key, where both the user ID and the client ID form the primary key, but we can do range queries to find all client IDs for a given user ID.

With this approach, we are using a fanout pattern to send messages to all clients that have the recipient user ID, which is an efficient lookup using the User ID that we are using to partition across our Redis cluster. However, if we have too many clients, this may slow down operations. Therefore, it might be wise for us to cap the number of concurrent client IDs a user can have at once.
What happens if our table goes down, i.e. our Redis crashes? We should first have a standby ready to take over to avoid any downtime, but in terms of data loss, it is not a huge deal, as we can rebuild the table very quickly. And in the worst case, if a message is not delivered immediately in the case of a crash (which should be very rare), they will eventually be able to pull the message again, which we will see soon.
However, to quote Jeff Goldblum from Jurassic Park, “Your scientists were so preoccupied with whether or not they could, they didn’t stop to think if they should”, we have been so busy figuring out how to route messages between users, that we haven’t stopped to think if this message should be sent in the first place? Our first requirement is that connections should only exist between friends, but right now, we can send a message to anyone, so we should have a check before we send the message.
One simple approach here is to query the User service when we receive a message through our routing service and check if the users are friends. With the read replicas and the cache we set up in the User service, our User service should be able to support this requirement.

Another interesting approach to reduce the number of calls we make to our User service could be to instead use a ticket approach similar to that proposed by Kerberos. With this, we will first call the User service when the user wants to send a message to another user, and if the user is friends with the recipient user, they will receive a signed ticket with a short expiry time. They will then submit this ticket with their message, and the server can verify the ticket and allow the user to send the message. This relies on digital signatures. If we use asymmetric encryption, the User service can sign with their private key, and then our routing service only needs to know the public key to verify the signature. However, verifying signatures is computationally intensive, and even though it takes some of the load off our User service, it increases the load on the routing service. We could instead use a shared secret key between the routing service and user service to use symmetric encryption, but now we need to securely maintain the key.

In both situations, we need to handle the scenario where the users decide to not be friends anymore. If we use a cached value, for a short amount of time we might accidentally allow some requests to be sent to the user, which might be bad in the hypothetical situation where one user blocks another for some safety reasons. If we use the ticket approach, we have no way of invalidating the ticket apart from changing the signing key, which would invalidate all tickets immediately, which may lead to our User service being overwhelmed if all users simultaneously request a new ticket. Using the User service, we can simply invalidate the cached value of whether those two users are friends, making this a much more scalable approach.
One last thing to consider, have you ever been on the train, sent a message, lost connection, then retried sending the message when you got back online, only to realize the message had already been sent and now you look silly for sending the same message twice? To do this, we will introduce the concept of idempotency, which is where our message will only be delivered once. This will become more apparent in the next sections on how we implement this. For now, we will introduce the concept of a MessageID which we will send with all of our requests, which we must fetch before we send the message. This will also make use of Snowflake for high throughput.

Keep reading
If you are reading this from the future, part 2 should already be available, where we will continue our design by implementing features such as:
Message persistence.
Client-side implementation.
Receiving messages that were sent to offline users when they come back online.
In part 3, we will be covering:
Message notifications (at a high level).
Sending attachments.
Presence indicators (i.e. active, away, offline).
And finally servers/rooms.
Once again, thank you to everyone who made it here to read, I greatly appreciate it! if you know someone who would be interested in this content (maybe someone preparing for an upcoming big tech software engineering interview), please share it around, as I would be eternally grateful!
Click here for part 2!