Distributed Systems for the UnInitiated
I’m not going to lie — I am drawn to Distributed Systems in a way that few other areas of Computer Science do. Language Theory, Type Systems etc. while are fun just don’t have the allure in my mind that Distributed Systems have.
So what are Distributed Systems? The Definition By Andrew Tanenbaum goes like this:
A Collection of independent computers that appear to it’s user as a single coherent system.
While this definition conveys the idea very effectively, I would replace the words independent computers with independent processes.
And what do I mean by processes?
Basically, any unit of computation. This could range from tiny (Erlang LightWeight Processes) to massive (Cassandra Clusters). One subtle point worth mentioning is that these independent processes should not rely on a shared clock. With that out of the way, let’s dive in to the basics and start by hammering away at the C(Consistency), A(Availability) and P(Partition Tolerance) in the CAP Theorem. Along the way we’ll also briefly touch upon another useful concept — Durability.
A picture is better than a thousand words, so here goes:
This picture tells the following story:
- There is a DB Cluster running
- At time T0 a client comes along and tells the DB Cluster to Set Name -> “John”
- The DB Cluster says “OK”
- Some time elapses
- Now, at Time T1 (which occurs after T0 and more specifically after the DB said “OK” to the Name change) a client comes along asks for the Name, i.e. issues a “GET NAME” command
- At this point, if the DB Cluster responds with anything other than “John” it is NOT consistent. If it responds with “John” it IS consistent.
So, what is Consistency? Consistency, is basically a guarantee by the system that when you do a read, you will see the most recent write. In this case, the most recent write was “John” for the Name. So, any request that tries to lookup the value of Name after this write occurred, should return “John” to be considered a truly consistent system.
In other words, if you are a Database and you say “OK” to a write, you are obligated to return the value of that write to all subsequent reads — if you would like to call yourself a Consistent Database.
This one is easy — it basically means that you should be able to “scale your system”.
What does Scaling the System Mean?
Most systems have far more reads than writes, i.e. they have far more people consuming data than producing it. How do you scale this? The typical (and most often used) solution is to throw more read replicas at the cluster.
So, whenever someone does a read they are reading data from a node that is a read replica, i.e. it may not be the node that is responsible for handling writes.
How is Read Replication accomplished?
Most Database Systems abstract this away, but typically when you add a new read replica to your cluster you need to inform the “Leader” (who is handling the writes) as well as the other nodes in the cluster. They then (using some common protocol) exchange messages and the new node eventually “catches up”.
The previous section works in most situations. There are times though when as the system is facing scaling challenges (maybe more users?) the system is seeing an increase in writes and is unable to keep up.
The system starts slowing down. Replication takes longer and the general “snappiness” of the system starts fading away.
The solution to scaling writes often involves another popular distributed system term “Sharding”. Basically, split up a big dataset into smaller datasets (called shards) and have each shard be responsible for handling writes for that smaller dataset.
Breaking apart a dataset into smaller shards is typically done by taking some kind of “key” (this could a Primary Key on a RDBMS table or an actual Key in a Key-Value Store) and running it through some kind of Hashing Function to figure out which shard is responsible for that piece of data.
Now, since shards typically work with smaller datasets the “writes” should be snappier and more importantly, the corresponding read replicas for those shards should be faster (since the amount of data needed to sync up is less).
Coming full circle, Partition Tolerance means that the Database can scale using these approaches of Replication and Sharding and most Databases do need to scale, so it’s almost always a strong requirement.
Moving on from Partition Tolerance, Availability comes next. Availability I think can best be described as “Willingness to Handle a Request”.
What does this mean?
Let’s Assume we have the following cluster running:
A basic three Node cluster with Nodes marked N1, N2 and N3.
Now, assume nodes N2 and N3 die (for whatever reason — file system corruption, network partition etc.)
At this point, a person comes along asks the Cluster for the most recent Name:
What does the Cluster respond with?
1. Respond back with what is the most recent version of the key name, running on node N1
2. Refuse to respond back
Why would it even consider doing #2 (not responding)?
Maybe one of the other nodes (N2 or N3) that are no longer have a “more recent” version of Name and by responding back N1 may be handing back stale data.
So in other words, at this point the cluster has the option of potentially responding with Stale Data or not responding at all.
What are the Availability Implications of this?
By Not responding at all, the system is choosing Consistency over Availability, i.e. it’s saying I will either respond back with what I am sure is non-stale data data or I will not respond at all.
And there is no correct answer here — as a user of the system maybe you will be okay with stale data as long as you have some data.
Or maybe you absolutely need non-stale data — at which point you will be okay with the system not responding.
So there is a trade-off that needs to be made — do you choose Consistency over Availability or the other way around? And this a gray area again, maybe upto 1 minute of potentially stale data is okay for increased availability.
The final topic worth touching on is Durability, which can be summarized as when a state change command is issued to your distributed system when does the system respond with “Ok”?
Does it respond back when:
1. The Leader has received the state change request
2. All of the Replica’s have received the state change
3. The Leader has recorded the state change “on disk” (disk could be any system of record — ranging from memory to a cloud storage system like S3 and each have varying durability characteristics)
4. The Replica’s have recorded the state change on disk
In this list, 1 is probably the least durable with 4 being the most.
And that’s it. Hopefully with these terms under your belt you should have the necessary linguistic tools to reason about distributed systems (ranging from your good old PostGres cluster all the way to Redis Cluster).