A distributed database is a distributed system designed to provide read/write access to data, using the relational or other format.
We will examine 3 topics in distributed databases:
Replication: Keep a copy of the same data on several different nodes.
Partitioning: Split the database into smaller subsets and distributed the partitions to different nodes.
Transactions: Mechanisms to ensure that data is kept consistent in the database
Q: Usually, distributed databases employ both replication and partitioning at the same time. Why?
With replication, we keep identical copies of the data on different nodes.
D: Why do we need to replicate data?
In a replicated system, we have two node roles:
Depending on how replication is configured, we can see the following architectures
The general idea in replicated systems is that when a write occurs in node, this write is distributed to other nodes in either of the two following modes:
Synchronously: Writes need to be confirmed by a configurable number of slaves before the master reports success.
Asynchronously: The master reports success immediately after a write was committed to its own disk; slaves apply the changes in their own pace.
D: What are the advantages/disadvantages of each replication type?
The master ships all write statements to the slaves, e.g. all INSERT
or UPDATE
in tact. However, this is problematic:
UPDATE foo
SET updated_at=NOW()
WHERE id = 10
D: What is the issue with the code above in a replicated context?
Statement based replication is non-deterministic and mostly abandoned.
Most databases write their data to data structures, such as \(B^+\)-trees. However, before actually modifying the data structure, they write the intended change to an append-only write-ahead log (WAL).
WAL-based replication writes all changes to the master WAL also to the slaves. The slaves apply the WAL entries to get a consistent data.
The main problem with WAL-based replication is that it is bound to the implementation of the underlying data structure; if this changes in the master, the slaves stop working.
Postgres and Oracle use WAL-based replication.
The database generates a stream of logical updates for each update to the WAL. Logical updates can be:
MongoDB and MySQL use this replication mechanism.
id
to the state of the master’s replication log at the time the snapshot was createdid
Master
> SHOW MASTER STATUS;
+--------------------+----------+
File | Position |
| +--------------------+----------+
-bin.004252 | 30591477 |
| mariadb+--------------------+----------+
1 row in set (0.00 sec)
Slave
>CHANGE MASTER TO
='10.0.0.7',
MASTER_HOST='replicator',
MASTER_USER=3306,
MASTER_PORT=20,
MASTER_CONNECT_RETRY='mariadb-bin.452',
MASTER_LOG_FILE= 30591477; MASTER_LOG_POS
> SHOW SLAVE STATUS\G
10.0.0.7
Master_Host:
Master_User: replicator3306
Master_Port: -bin.452
Master_Log_File: mariadb34791477
Read_Master_Log_Pos: -bin.000032
Relay_Log_File: relay1332
Relay_Log_Pos:
Slave_IO_Running: Yes Slave_SQL_Running: Yes
Read-after-write: A client must always be able to read their own writes from any part of the system. Practically, the client can:
(non-) Monotonic reads: When reading from multiple replicas concurrently, a stale replica might not return records that the client read from an up to date one.
Causality violations: Violations of the happened-before property.
Async replication is fundamentally a consensus problem.
The biggest problem with multi-leader replication are write conflicts. To demonstrate this, let’s think in terms of git
:
# Clock
# User 1
git clone git://.... # t+1
git add foo.c # t+2
##hack hack hack
git commit -a -m 'Hacked v1' # t+3
git push # t+5
# User 2
git clone git://.... # t+2
git add foo.c # t+5
##hack hack hack
git commit -m 'Hacked new file' # t+6
git push # fails # t+7
git pull # CONFLICT # t+7
If we replace user with master node, we have exactly the same problem
One master per session If session writes do not interfere (e.g., data are only stored per user), this will avoid issues altogether.
Converge to consistent state Apply a last-write-wins policy, ordering writes by timestamp (may loose data) or report conflict to the application and let it resolve it (same as git
or Google docs)
Use version vectors Modelled after version clocks, they encode happens before relationships at an object level.
With partitioning, each host contains a fraction of the whole dataset.
The main reason is scalability:
Partitioning is always combined with replication. The reason is that with partitioning, a node failure will result in irreversible data corruption.
The following 3 strategies are
git
commits) or location specific data (e.g. all records from Europe).On partitioned datasets, clients need to be aware of the partitioning scheme in order to direct queries and writes to the appropriate nodes.
To hide partitioning details from the client, most partitioned systems feature a query router component siting between the client and the partitions.
The query router knows the employed partitioning scheme and directs requests to the appropriate partitions.
What could potentially go wrong?
What is the state of the data after this scenario?
As programmers, we are mostly concerned about the code’s happy path. Systems use transactions to guard against catastrophic scenarios.
Transactions[1] are blocks of operations (reads and writes), that either succeed or fail, as a whole.
Transactions are defined in the context of an application that wants to modify some data and a system that guards data consistency.
The concept of the transaction started with the first SQL database, System R; while the mechanics changed, the semantics are stable for the last 40 years.
Transactions provide the following guarantees[2]:
Isolation guards data consistency in the face of concurrency.
Databases try to hide concurrency issues from applications by providing transaction isolation.
By default, isolation entails executing transactions as if they were serially executed. To implement serializability, databases:
With MVCC, the database works like Git:
git checkout -b A
)git commit -a
):
git checkout master; git merge A
)git gc
)While serialiazable isolation works it may be slow. Consequently, historically, databases made compromises in how they implement isolation.
Isolation | DR | NRR | PR |
---|---|---|---|
Read uncommitted | X | X | X |
Read committed | X | X | |
Repeatable read | X | ||
Serializable |
In a distributed database, a transaction spanning multiple nodes must either succeed on all nodes or fail (to maintain atomicity).
Transactions may also span multiple systems; for example, we may try to remove a record from a database and add it to a queue service in an atomic way.
In contrast to the distributed systems consensus problem, all nodes must agree on whether a transaction has been successfully committed.
The most common mechanism used to dead with distributed atomic commits is the two-phase commit (2PC) protocol.
In 2PC, we find the following roles:
The 2PC protocol makes the following assumptions:
D: What problems do you see with 2PC?
This work is (c) 2017, 2018 - onwards by TU Delft and Georgios Gousios and licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.