A distributed database is a distributed system designed to provide read/write access to data, using the relational or other format.
We will examine 3 topics in distributed databases:
Replication: Keep a copy of the same data on several different nodes.
Partitioning: Split the database into smaller subsets and distributed the partitions to different nodes.
Transactions: Mechanisms to ensure that data is kept consistent in the database
Q: Usually, distributed databases employ both replication and partitioning at the same time. Why?
With replication, we keep identical copies of the data on different nodes.
D: Why do we need to replicate data?
In a replicated system, we have two node roles:
Depending on how replication is configured, we can see the following architectures
The general idea in replicated systems is that when a write occurs in node, this write is distributed to other nodes in either of the two following modes:
Synchronously: Writes need to be confirmed by a configurable number of slaves before the master reports success.
Asynchronously: The master reports success immediately after a write was committed to its own disk; slaves apply the changes in their own pace.
D: What are the advantages/disadvantages of each replication type?
The master ships all write statements to the slaves, e.g. all INSERT
or UPDATE
in tact. However, this is problematic:
UPDATE foo
SET updated_at=NOW()
WHERE id = 10
D: What is the issue with the code above in a replicated context?
Statement based replication is non-deterministic and mostly abandoned.
Most databases write their data to data structures, such as \(B^+\)-trees. However, before actually modifying the data structure, they write the intended change to an append-only write-ahead log (WAL).
WAL-based replication writes all changes to the master WAL also to the slaves. The slaves apply the WAL entries to get a consistent data.
The main problem with WAL-based replication is that it is bound to the implementation of the underlying data structure; if this changes in the master, the slaves stop working.
Postgres and Oracle use WAL-based replication.
The database generates a stream of logical updates for each update to the WAL. Logical updates can be:
MongoDB and MySQL use this replication mechanism.
id
to the state of the master’s replication log at the time the snapshot was createdid
Master
> SHOW MASTER STATUS;
+--------------------+----------+
File | Position |
| +--------------------+----------+
-bin.004252 | 30591477 |
| mariadb+--------------------+----------+
1 row in set (0.00 sec)
Slave
>CHANGE MASTER TO
='10.0.0.7',
MASTER_HOST='replicator',
MASTER_USER=3306,
MASTER_PORT=20,
MASTER_CONNECT_RETRY='mariadb-bin.452',
MASTER_LOG_FILE= 30591477; MASTER_LOG_POS
> SHOW SLAVE STATUS\G
10.0.0.7
Master_Host:
Master_User: replicator3306
Master_Port: -bin.452
Master_Log_File: mariadb34791477
Read_Master_Log_Pos: -bin.000032
Relay_Log_File: relay1332
Relay_Log_Pos:
Slave_IO_Running: Yes Slave_SQL_Running: Yes
Read-after-write: A client must always be able to read their own writes from any part of the system. Practically, the client can:
(non-) Monotonic reads: When reading from multiple replicas concurrently, a stale replica might not return records that the client read from an up to date one.
Causality violations: Violations of the happened-before property.
Async replication is fundamentally a consensus problem.
The biggest problem with multi-leader replication are write conflicts. To demonstrate this, let’s think in terms of git
:
# Clock
# User 1
git clone git://.... # t+1
git add foo.c # t+2
##hack hack hack
git commit -a -m 'Hacked v1' # t+3
git push # t+5
# User 2
git clone git://.... # t+2
git add foo.c # t+5
##hack hack hack
git commit -m 'Hacked new file' # t+6
git push # fails # t+7
git pull # CONFLICT # t+7
If we replace user with master node, we have exactly the same problem
One master per session If session writes do not interfere (e.g., data are only stored per user), this will avoid issues altogether.
Converge to consistent state Apply a last-write-wins policy, ordering writes by timestamp (may loose data) or report conflict to the application and let it resolve it (same as git
or Google docs)
Use version vectors Modelled after version clocks, they encode happens before relationships at an object level.