Distributed filesystems and databases are where Big Data lives, either temporarily or permanently. It is therefore important to get a practical feel of how things such as sharding, failover and distributed querying work.

In this assignment, your task is to configure a distributed database cluster, insert some data, perform some queries, make it fail and observe the failover procedures.

For performing the tasks below, you will use MongoDB and your programming language of choice (the recommended one is Python). MongoDB can work in standalone mode, as a normal client-server database similar to MySQL. However, what is really interesting about it is that it supports both replication and sharding (albeit sacrificing ACID properties).

Task 1: Configure a sharded cluster

Setup a MongoDB cluster, with sharding and replication enabled. For that, you will need to read about both replication and sharding in MongoDB. You need to install the latest version of MongoDB in the course VM, you can find instructions here.

# Create a directory for the whole installation
mkdir mongoassignment
cd mongoassignment

# Create data directories for each 
mkdir cfg0 cfg1 cfg2
mkdir a0 a1 a2
mkdir b0 b1 b2

# configuration nodes
mongod --configsvr --replSet conf --port 26050 --logpath log.cfg0 --logappend --dbpath cfg0 --fork
mongod --configsvr --replSet conf --port 26051 --logpath log.cfg1 --logappend --dbpath cfg1 --fork
mongod --configsvr --replSet conf --port 26052 --logpath log.cfg2 --logappend --dbpath cfg2 --fork

# replica set 1
mongod --shardsvr --replSet a --dbpath a0 --logpath log.a0 --port 27000 --logappend --smallfiles --fork
mongod --shardsvr --replSet a --dbpath a1 --logpath log.a1 --port 27001 --logappend --smallfiles --fork
mongod --shardsvr --replSet a --dbpath a2 --logpath log.a2 --port 27002 --logappend --smallfiles --fork

# replica set 2
mongod --shardsvr --replSet b --dbpath b0 --logpath log.b0 --port 28000 --logappend --smallfiles --fork
mongod --shardsvr --replSet b --dbpath b1 --logpath log.b1 --port 28001 --logappend --smallfiles --fork
mongod --shardsvr --replSet b --dbpath b2 --logpath log.b2 --port 28002 --logappend --smallfiles --fork

Then, you will need to configure the cluster

# At the MongoDB console for replica set a, add all replica set nodes
mongo --port 27000
> config = {
      _id: "a",
      version: 1,
      members: [
         { _id: 0, host : "localhost:27000" },
         { _id: 1, host : "localhost:27001" },
         { _id: 2, host : "localhost:27002" }
> rs.initiate(config)

# At the MongoDB console for replica set b, add all replica set nodes
mongo --port 28000
> config = {
      _id: "b",
      version: 1,
      members: [
         { _id: 0, host : "localhost:28000" },
         { _id: 1, host : "localhost:28001" },
         { _id: 2, host : "localhost:28002" }
> rs.initiate(config)

# At the MongoDB console for the configuration servers replica set, add all
# replica set nodes
mongo --port 26050
configsvr> config = {
      _id: "conf",
      version: 1,
      members: [
         { _id: 0, host : "localhost:26050" },
         { _id: 1, host : "localhost:26051" },
         { _id: 2, host : "localhost:26052" }
configsvr> rs.initiate(config)

And finally, you can start the query router, connect to it and setup sharding

# query router (one is enough, more are possible)
mongos --configdb conf/localhost:26050,localhost:26051,localhost:26052 --logappend --logpath log.mongos0 --port 27017

### On a different terminal
> sh.status()
> sh.addShard("a/localhost:27000")
> sh.addShard("b/localhost:28000")
> sh.getBalancerState()

# Create a database and setup sharding on it
> show databases
> use any_db_name
> db.createCollection("any_collection_name")
> show databases
> sh.enableSharding("any_db_name")

# For each collection that must be sharded, specify a sharding key
> sh.shardCollection("any_db_name.any_collection_name", { "any_field_name" : 1 } )

After you do the above, answer to the following questions:

Task 2: Creating data

Write a program that generates 100.000 values in the following format

  id: 123 //a unique integer
  name: "A random name from the list here http://www.gutenberg.org/files/3201/files/NAMES.TXT",
  address : [{ // An array populated with 1-10 elements, randomly
    street: "A random place from the list here http://www.gutenberg.org/files/3201/files/PLACES.TXT",
    number: 12 //A random integer in the range 1 -- 500

The program should also be able to connect to a MongoDB database, using the appropriate MongoDB driver for your language of choice, and write the generated values. Make sure you print a message on the terminal every time you succesfully write an entry in the database.

<space for copying the source code to>

Task 3: Perform queries

Given the database of 100.000 items you created in the step above, run the following queries at the MongoDB command prompt, in both the standalone and the sharded database version. Fill in the table below with your answers:

Description Query Result (standalone) Time (standalone) Result (cluster) Time (cluster)
1. Number of items in collection addresses db.addresses.count() 100000 10ms 100000 5ms
2. Items where the id is bigger than 50.000
3. Names starting with either ‘A’ or ‘K’
4. Unique street names whose address is even
5. Number of people based on their address

Hint: queries in MongoDB return cursors (iterators) which are lazy; you need to retrieve all results in order to evaluate how much time it takes to execute them. You can use a function like the one below at the MongoDB prompt.

mongo> function timeit(query) {
  var a = Date.now(); 
  return (Date.now() - a);
mongo> timeit(db.addresses.find({id: 1}))

Task 4: Making it fail

Delete all the data from your cluster. Start your data generator again. While the data generator works, remove one cluster node. You can kill a MongoDb process; remove a network cable; stop VM; stop a Docker container; the exact process of removing cluster nodes depends on how you setup the cluster.

After you removed the cluster node, you should observe the cluster reconfigure itself. What where the steps that where taken? How long did the process take? What happened on the client side? How many records where in the database after the data generator finishes? Describe the failover process below.

Restart the failed node. What happens?

Hint: Make sure you configure MongoDB to print its logs to a file, so that you can inspect it at a later time.