Distributed databases

Distributed databases in a nutshell

Hey I just met you, the network’s laggy

A distributed database is a distributed system designed to provide read/write access to data, using the relational or other format.

Splitting the data among nodes

We will examine 3 topics in distributed databases:

  • Replication: Keep a copy of the same data on several different nodes.

  • Partitioning: Split the database into smaller subsets and distributed the partitions to different nodes.

  • Transactions: Mechanisms to ensure that data is kept consistent in the database

Q: Usually, distributed databases employ both replication and partitioning at the same time. Why?


MySQL async replication

Why replicate?

With replication, we keep identical copies of the data on different nodes.

D: Why do we need to replicate data?

  • To allow the system to work, even if parts of it are down
  • To have the data (geographically) close to the clients
  • To increase read throughput, by allowing more machines to serve read-only requests

Replication Architectures

In a replicated system, we have two node roles:

  • Leaders or Masters: Nodes that accept writes from clients
  • Followers, Slaves or replicas: Nodes that provide read-only access to data

Depending on how replication is configured, we can see the following architectures

  • Single leader or master-slave: A single leader accepts writes, which are distributed to followers
  • Multi leader or master-master: Multiple leaders accept writes, keep themselves in sync, then update followers
  • Leaderless replication All nodes are peers in the replication network

How does replication work?

The general idea in replicated systems is that when a write occurs in node, this write is distributed to other nodes in either of the two following modes:

  • Synchronously: Writes need to be confirmed by a configurable number of followers before the leader reports success.

  • Asynchronously: The leader reports success immediately after a write was committed to its own disk; followers apply the changes in their own pace.

D: What are the advantages/disadvantages of each replication type?

Statement based replication in databases

The leader ships all write statements to the followers, e.g. all INSERT or UPDATE queries. However, this is problematic:

SET updated_at=NOW()
WHERE id = 10

D: What is the issue with the code above in a replicated context?

Statement based replication is non-deterministic and mostly abandoned.

Write-ahead log based replication

All databases write their data to data structures, such as \(B^+\)-trees. However, before actually modifying the data structure, they write the intended change to an append-only write-ahead log (WAL).

WAL-based replication writes all changes to the leader WAL and also to the followers. The followers apply the WAL entries to get consistent data.

The main problem with WAL-based replication is that it is bound to the implementation of the underlying data structure; if this changes in the leader, the followers stop working.

Postgres and Oracle use WAL-based replication.

Logical-based replication

The database generates a stream of logical updates for each update to the WAL. Logical updates can be:

  • For new records, the values that where inserted
  • For deleted records, their unique id
  • For updated records, their id and the updated values

MongoDB and MySQL use this replication mechanism.

Process for creating a replica

  1. Take a consistent snapshot from the leader
  2. Ship it to the replica
  3. Get an id to the state of the leader’s replication log at the time the snapshot was created
  4. Initialize the replication function to the latest leader id
  5. The replica must retrieve and apply the replication log until it catches up with the leader

A real example: MySQL


| File               | Position |
| mariadb-bin.004252 | 30591477 |
1 row in set (0.00 sec)

Complications with async replication

  • Read-after-write: A client must always be able to read their own writes from any part of the system. Practically, the client can:

    • Read recent writes from the leader
    • Read session data from the leader
  • (non-) Monotonic reads: When reading from multiple replicas concurrently, a stale replica might not return records that the client read from an up to date one.

  • Causality violations: Violations of the happened-before property.

Async replication is fundamentally a consensus problem.

Multi-leader replication

The biggest problem with multi-leader replication are write conflicts. To demonstrate this, let’s think in terms of git:

                                # Clock
# User 1
git clone git://....            # t+1
git add foo.c                   # t+2
##hack hack hack
git commit -a -m 'Hacked v1'    # t+3
git push                        # t+5

# User 2
git clone git://....            # t+2
git add foo.c                   # t+5
##hack hack hack
git commit -m 'Hacked new file' # t+6
git push # fails                # t+7
git pull # CONFLICT             # t+7

If we replace user with leader node, we have exactly the same problem

How to avoid write conflicts?

  • One leader per session If session writes do not interfere (e.g., data are only stored per user), this will avoid issues altogether.

  • Converge to consistent state Apply a last-write-wins policy, ordering writes by timestamp (may loose data) or report conflict to the application and let it resolve it (same as git or Google docs)

  • Use version vectors Modelled after version clocks, they encode happens before relationships at an object level.


Types of partitioning

Why partitioning?

With partitioning, each host contains a fraction of the whole dataset.

The main reason is scalability:

  • Queries can be run in parallel, on parts of the dataset
  • Reads and writes are spread on multiple machines

Partitioning is always combined with replication. The reason is that with partitioning, a node failure will result in irreversible data corruption.

How to partition?

The following 3 strategies are

  • Range partitioning Takes into account the natural order of keys to split the dataset in the required number of partitions. Requires keys to be naturally ordered and keys to be equally distributed across the value range.
  • Hash partitioning Calculates a hash over the each item key and then produces the modulo of this hash to determine the new partition.
  • Custom partitioning Exploits locality or uniqueness properties of the data to calculate the appropriate partition to store the data to. An example would be pre-hashed data (e.g. git commits) or location specific data (e.g. all records from Europe).

Request routing

On partitioned datasets, clients need to be aware of the partitioning scheme in order to direct queries and writes to the appropriate nodes.

To hide partitioning details from the client, most partitioned systems feature a query router component siting between the client and the partitions.

The query router knows the employed partitioning scheme and directs requests to the appropriate partitions.

Partition and replication example: MongoDB

Sharding and replication in MongoDB


Optional reading (not part of the exam material)

Many clients on one database

What could potentially go wrong?

  • Many clients try to update the same data store at the same time, when….
  • the network fails, and then…
  • the database leader server cannot reach its network-mounted disk, so…
  • the database tries to fail over to a follower, but it is unreachable, and then…
  • the application writes timeout.

What is the state of the data after this scenario?

As programmers, we are mostly concerned about the code’s happy path. Systems use transactions to guard against catastrophic scenarios.

What are transactions?

Transactions[1] are blocks of operations (reads and writes), that either succeed or fail, as a whole.

Transactions are defined in the context of an application that wants to modify some data and a system that guards data consistency.

  • The application initiates a transaction, and when done, it commits it.
  • If the system deems the transaction successful, it guarantees that the modifications are safely stored
  • If the system deems the transaction unsuccessful, the transaction is rollbacked; the application can safely retry.

Transactions in relational databases

The concept of the transaction started with the first SQL database, System R; while the mechanics changed, the semantics are stable for the last 40 years.

Transactions provide the following guarantees[2]:

  • Atomicity: The transaction either succeeds or fails; in case of failure, outstanding writes are ignored
  • Consistency: any transaction will bring the database from one valid state to another
  • Isolation: Concurrent execution of transactions results in a system state that would be obtained if transactions were executed sequentially
  • Durability: once a transaction has been committed, it will remain so


Isolation guards data consistency in the face of concurrency.

Databases try to hide concurrency issues from applications by providing transaction isolation.


By default, isolation entails executing transactions as if they were serially executed. To implement serializability, databases:

  • Only execute in a single core (e.g. Redis)
  • Use 2 phase locking: all database objects (e.g. rows) have a shared lock and an exclusive lock
    • All readers acquire a shared lock
    • A writer needs to acquire the exclusive lock
    • Deadlocks can happen when a transaction \(A\) waits for \(B\) to release their exclusive locks and vice versa
  • Use MVCC and Copy-on-Write data structures

Multi-Version Concurency Control

With MVCC, the database works like Git:

  • Each transaction \(A\) sees the most recent copy of the data (git checkout -b A)
  • When \(A\) commits (git commit -a):
    • If \(A\) touched no object that was updated before by another transaction the database will create a new version (git checkout master; git merge A)
    • If \(A\) changed an object that was updated before, the database with report a conflict
  • The current state is the result of the application of all intermediate transactions on an initial state
  • Occasionally, we need to clean up ignored intermediate states (git gc)

Weaker isolation levels

While serialiazable isolation works it may be slow. Consequently, historically, databases made compromises in how they implement isolation.

  • Dirty reads: A transaction reads data written by a concurrent uncommitted transaction
  • Non Repeatable Reads: A transaction re-reads data it has previously read and finds that data modified
  • Phantom reads: Results to queries change due to other transactions being committed
Isolation DR NRR PR
Read uncommitted X X X
Read committed X X
Repeatable read X

Distributed transactions

Atomic commits

In a distributed database, a transaction spanning multiple nodes must either succeed on all nodes or fail (to maintain atomicity).

Transactions may also span multiple systems; for example, we may try to remove a record from a database and add it to a queue service in an atomic way.

In contrast to the distributed systems consensus problem, all nodes must agree on whether a transaction has been successfully committed.

The most common mechanism used to dead with distributed atomic commits is the two-phase commit (2PC) protocol.

2-phase commits

In 2PC, we find the following roles:

  • A coordinator or transaction manager
  • Participants or cohorts

The 2PC protocol makes the following assumptions:

  • There is stable, reliable storage on the cohorts
  • No participant will crash forever
  • Participants use (indestructible) write-ahead log
  • Any two nodes can communicate with each other

A transaction in 2PC

  1. A client starts a 2PC transaction by acquiring a globally unique transaction id (GTID)
  2. The client attaches the GTID to each transaction it begins with individual nodes
  3. When the client wants to commit, the coordinator sends a PREPARE message to all nodes
  4. If at least one node replies with ‘NO’, the transaction is aborted
  5. If all nodes reply with ‘YES’, the coordinator writes the decision to disk and tries forever to commit individual transactions to the cohort

D: What problems do you see with 2PC?

Problems with 2PC

  • Holding locks in nodes: While a decision is being made, faster nodes must lock objects; consequently, a slow or crashing node will kill the performance of the whole system
  • Exaclty-once semantics: All nodes participating in a distributed transaction must support transactions to guarantee exactly once semantics
  • Co-ordinator failures: Even though the co-ordinator must be able to resist corruption in case of crashes, this is not always the case and semi-aborted transactions affect performance due to locking

Image credits


J. Gray, The transaction concept: Virtues and limitations (invited paper),” in Proceedings of the seventh international conference on very large data bases - volume 7, 1981, pp. 144–154.
T. Haerder and A. Reuter, Principles of transaction-oriented database recovery,” ACM Comput. Surv., vol. 15, no. 4, pp. 287–317, Dec. 1983.
K. Kingsbury, “Jespen: On the perils of network partitions,” 2013. [Online]. Available: https://aphyr.com/tags/jepsen.