File systems determine how data is stored and retrieved. A file system keeps track of the following data items:
The primary job of the filesystem is to make sure that the data is always accessible and in tackt. To maintain consistency, most modern filesystems use a log (remember databases!).
Common filesystems are EXT4, NTFS, APFS and ZFS
read
ing from a filesystemread
ing from a file, block
layerA distributed filesystem is a file system which is shared by being simultaneously mounted on multiple servers. Data in a distributed file system is partioned and replicated across the network. Reads and writes can occur on any node.
Q: Windows and MacOSX already have network filesystems. Why do we need extra ones?
Network filesystems, such as CIFS and NFS, are not distributed filesystems, because they rely on a centralized server to maintain consistency. The server becomes a SPOF and a performance bottleneck.
The Google Filesystem paper[1] kicked-off the Big Data revolution. Why did they need it though?
A single file can contain many objects (e.g. Web documents)
Files are divided into fixed size chunks (64MB) with unique identifiers, generated at insertion time. Files are replicated (\(\ge 3\)) across chunk servers.
Chunkservers store chunks on local disk as Linux files
(chunk_handle, byte_range)
The Master maintains a mapping between file names and chunk locations.
No caching of (meta-)data allowed! (Why?)
The master does not keep a persistent record of chunk locations, but instead queries the chunk servers at startup and then is updated by periodic polling.
GFS is a journaled filesystem: all operations are added to a log first, then applied. Periodically, log compaction creates checkpoints. The log is replicated across nodes.
If a node fails…:
Chunkservers use checksums to detect data corruption
GFS uses vector clocks to keep track of stale replicas.
HDFS started at Yahoo as an open source replica of the GFS paper, but since v2.0 it is different system.
GFS | HDFS |
---|---|
Master | NameNode |
Chunkserver | DataNode |
operation log | journal |
chunk | block |
random file writes | append-only |
multiple writer/reader | single writer, multiple readers |
chunk: 64MB data, 32bit checksums | 128MB data, separate metadata file |
The main difference with GFS is that HDFS is a user-space filesystem written in Java.
# List directory
$ hadoop fs ls /
Found 7 items
drwxr-xr-x - gousiosg sg 0 2017-10-04 08:23 /audioscrobbler
-rw-r--r-- 3 gousiosg sg 1215803135 2017-10-04 08:25 /ghtorrent-logs.txt
-rw-r--r-- 3 gousiosg sg 66773425 2017-10-04 08:23 /imdb.csv
-rw-r--r-- 3 gousiosg sg 198376 2017-10-23 12:39 /important-repos.csv
-rw-r--r-- 3 gousiosg sg 611300 2017-10-04 08:24 /odyssey.mb.txt
-rw-r--r-- 3 gousiosg sg 388422973 2017-10-03 15:40 /pullreqs.csv
# Create a new file
$ dd if=/dev/zero of=foobar bs=1M count=100
# Upload file
$ hadoop fs -put foobar /
-rw-r--r-- 3 gousiosg sg 104857600 2017-11-27 15:42 /foobar
HDFS looks like a UNIX filesystem, but does not offer the full set of operations.
For the course, we only need to know how to upload data to HDFS.
This work is (c) 2017, 2018, 2019, 2020 - onwards by TU Delft and Georgios Gousios and licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.