What is a distributed system?
According to Wikipedia: A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages.
We will focus on the following four issues with distributed systems
Most distributed systems use some form of asynchronous networking to communicate. Upon waiting for a response to a requests, it is not possible to distinguish whether:
The usual remedy is to set timeouts and retry the request until it succeeds.
In a study of network failures by Microsoft Research[1], they found that:
The data is for a professionally managed data centre by a single company.
On the public cloud, failures may affect thousands of systems in parallel.
Timeouts is a fundamental design choice in asynchronous networks: Ethernet, TCP and most application protocols work with timeouts.
The problem with timeouts is that delays in asynchronous systems are unbounded. This can be due to:
Queues also experience a snowball effect; queues are getting bigger on busy systems.
Timeouts usually follow the exponential back-off rule; we double the time we check for an answer up to an upper bound. More fine-grained approaches use successful request response times to calibrate appropriate timeouts.
In a distributed system, time is the only global constant nodes can rely on to make distributed decisions on ordering problems.
Time in computers is kept in two ways:
System.getCurrentTimeMillis
).System.nanoTime
)Q: Which clock would you use to time a task? Why?
Monotonic clocks are maintained by the OS and rely on HW counters exposed by CPUs. They are (usually!) good for determining order within a node, but each node only has its own notion of time.
NTP can synchronize time across nodes with an accuracy of ms. A modern CPU can execute \(10^6\) instructions (\(\times\) number of cores) in an ms!
Moreover, leap seconds are introduced every now and then; minutes may last for 61 or 59 seconds on occasion
\(\mu s\) accuracy is possible with GPS clocks, but expensive
What is order? A way of arranging items in a way that the following properties are maintained.
When do we need to order?
Lamport introduced the eponymus timestamps in 1978[2]. Lamport timestamps define a partial causal ordering of events.
Invariant: For two events \(a\) and \(b\), if \(a \rightarrow b\), then \(LT(a) < LT(b)\).
Q: The reverse is not true: If \(LT(a) < LT(b)\), then it does not mean that \(a \rightarrow b\)!! Why?
4 nodes exchange events.
Initial state of timestamps: \([A(0), B(0), C(0), D(0)]\)
E1. \(A\) sends to \(C\): \([A(1), B(0), C(0), D(0)]\)
E2. \(C\) receives from \(A\): \([A(1), B(0), C(2), D(0)]\)
E3. \(C\) sends to \(A\): \([A(1), B(0), C(3), D(0)]\)
E4. \(A\) receives to \(C\): \([A(4), B(0), C(3), D(0)]\)
E5. \(B\) receives to \(D\): \([A(4), B(1), C(3), D(0)]\)
E6. \(D\) receives from \(B\): \([A(4), B(1), C(3), D(2)]\)
At this point, \(LT(E6) < LT(E4)\), but it does not mean that \(E6 \rightarrow E4\)! Events 4 and 6 are independent.
Vector clocks[3] can maintain total causal order.
On a system with \(N\) nodes, each node \(i\) maintains a vector \(V_i\) of size \(N\).
They are updated as follows:
Vector clocks are expensive to maintain: they require \(O(n)\) timestamps to be exchanged with each communication.
However, it has been shown[4] that we cannot do better than vector clocks!
In distributed settings, the truth is determined by concensus, which is reached through voting mechanisms.
Achieving a decision in the presence of faulty processes.
It turns out that it is impossible to make a reliable decision.
Consequences: we can only make distributed decisions using either reliable communication or more than 2 parties.
Formulated by Lamport et al.[5], the Byzantine generals problem shaped distributed systems research for the next 40 years.
The formulation is simple:
Most consensus algorithms (e.g. Paxos[6] or Raft[7]), attempt to keep the log module of replicated state machines in sync. Basic properties include:
Raft defines the following server roles:
Raft ensures that: logs are always consistent and that only servers with up-to-date logs can become leader
Terms are used to identify obsolete information
See more details in this visualization
Fischer, Linch and Patterson (FLP) theorem[8]: In an asynchronous network, consensus cannot be reached if at least one node fails in asynchronous networks
A foundational result, proving the impossibility of distributed consensus. The system model the authors assume is fairly restrictive:
In practice however, we can mitigate the consequences, as we are indeed allowed to use both clocks and/or random numbers.
Raft and Paxos work by assuming that the exchanged messages are valid and true (i.e. non-Byzantine). In open distributed systems (e.g. BitCoin) this assumption is not necessarily valid.
Most open distributed systems use public-key cryptography and node registration before they start and sign messages to avoid MITM attacks.
Still, decisions require majority votes, so the \(n/2 + 1\) rule applies.
Consistency refers to the requirement that any given transaction must change affected data only in allowed ways.
Consensus is the basis upon which we build consistency
By Erik Brewer[9]: A distributed system can only provide 2 of the following 3 guarantees
While widely cited, it is only indicative; when the network is working, systems can offer all 3 guarantees. So it can be reduced to either consistent or available when partitioned.
At any time, concurrent reads from any node return the same values. As soon as writes complete successfully, the result is immediately replicated to all nodes in an atomic manner an is made available to reads. In that sense, linearisability is a timing constraint[10].
Q: Is the above system linearisable?
It is not, as while the write operation is in flight, the system cannot return a consistent answer.
All operations last for a time block called a transaction; this involves setting up a connection to a remote system and executing the command.
Q: What does the \(cas(x,3)\) instruction do?
Return an error! Compare and swap instructions should run atomically across the cluster.
Q: What should \(read(x)\) return for client B?
It should return 1. We assume that write transactions get committed imediately when they are executed, so A’s write overwrites D’s earlier write.
Q: What should \(read(x)\) return for client C?
It should return 2. The compare-and-swap transaction started by B may conflict with the one started by D, but the system always maintains a consistent state.
This work is (c) 2017 - onwards by TU Delft and Georgios Gousios and licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.