{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Assignment on Distributed Databases\n", "\n", "Distributed filesystems and databases are where Big Data lives, either\n", "temporarily or permanently. It is therefore important to get a practical\n", "feel of how things such as sharding, failover and distributed querying\n", "work.\n", "\n", "In this assignment, your task is to configure a distributed database cluster,\n", "insert some data, perform some queries, make it fail and observe the failover\n", "procedures.\n", "\n", "For performing the tasks below, you will use \n", "[MongoDB](https://docs.mongodb.com) and your programming language of \n", "choice (the recommended one is Python). MongoDB can work in standalone mode, as \n", "a normal client-server database similar to MySQL. However, what is really \n", "interesting about it is that it supports both _replication_ and _sharding_ \n", "(albeit sacrificing ACID properties).\n", "\n", "## Task 1: Configure a sharded cluster\n", "Setup a MongoDB cluster, with sharding\n", "and replication enabled. For that, you will need to read about both\n", "[replication](https://docs.mongodb.com/v3.4/replication/) and \n", "[sharding](https://docs.mongodb.com/v3.4/sharding/) in MongoDB. You need to\n", "install the latest version of MongoDB in the course VM, you can find\n", "[instructions here](https://www.digitalocean.com/community/tutorials/how-to-install-mongodb-on-ubuntu-16-04).\n", "\n", "```bash\n", "# Create a directory for the whole installation\n", "mkdir mongoassignment\n", "cd mongoassignment\n", "\n", "# Create data directories for each \n", "mkdir cfg0 cfg1 cfg2\n", "mkdir a0 a1 a2\n", "mkdir b0 b1 b2\n", "\n", "# configuration nodes\n", "mongod --configsvr --replSet conf --port 26050 --logpath log.cfg0 --logappend --dbpath cfg0 --fork\n", "mongod --configsvr --replSet conf --port 26051 --logpath log.cfg1 --logappend --dbpath cfg1 --fork\n", "mongod --configsvr --replSet conf --port 26052 --logpath log.cfg2 --logappend --dbpath cfg2 --fork\n", "\n", "# replica set 1\n", "mongod --shardsvr --replSet a --dbpath a0 --logpath log.a0 --port 27000 --logappend --smallfiles --fork\n", "mongod --shardsvr --replSet a --dbpath a1 --logpath log.a1 --port 27001 --logappend --smallfiles --fork\n", "mongod --shardsvr --replSet a --dbpath a2 --logpath log.a2 --port 27002 --logappend --smallfiles --fork\n", "\n", "# replica set 2\n", "mongod --shardsvr --replSet b --dbpath b0 --logpath log.b0 --port 28000 --logappend --smallfiles --fork\n", "mongod --shardsvr --replSet b --dbpath b1 --logpath log.b1 --port 28001 --logappend --smallfiles --fork\n", "mongod --shardsvr --replSet b --dbpath b2 --logpath log.b2 --port 28002 --logappend --smallfiles --fork\n", "```\n", "\n", "Then, you will need to configure the cluster\n", "\n", "```\n", "# At the MongoDB console for replica set a, add all replica set nodes\n", "mongo --port 27000\n", "> config = {\n", " _id: \"a\",\n", " version: 1,\n", " members: [\n", " { _id: 0, host : \"localhost:27000\" },\n", " { _id: 1, host : \"localhost:27001\" },\n", " { _id: 2, host : \"localhost:27002\" }\n", " ]\n", " }\n", "> rs.initiate(config)\n", "a:PRIMARY>\n", "\n", "# At the MongoDB console for replica set b, add all replica set nodes\n", "mongo --port 28000\n", "> config = {\n", " _id: \"b\",\n", " version: 1,\n", " members: [\n", " { _id: 0, host : \"localhost:28000\" },\n", " { _id: 1, host : \"localhost:28001\" },\n", " { _id: 2, host : \"localhost:28002\" }\n", " ]\n", " }\n", "> rs.initiate(config)\n", "b:PRIMARY>\n", "\n", "# At the MongoDB console for the configuration servers replica set, add all\n", "# replica set nodes\n", "mongo --port 26050\n", "configsvr> config = {\n", " _id: \"conf\",\n", " version: 1,\n", " members: [\n", " { _id: 0, host : \"localhost:26050\" },\n", " { _id: 1, host : \"localhost:26051\" },\n", " { _id: 2, host : \"localhost:26052\" }\n", " ]\n", " }\n", "configsvr> rs.initiate(config)\n", "conf:PRIMARY>\n", "```\n", "\n", "And finally, you can start the query router, connect to it and setup sharding\n", "\n", "```bash\n", "# query router (one is enough, more are possible)\n", "mongos --configdb conf/localhost:26050,localhost:26051,localhost:26052 --logappend --logpath log.mongos0 --port 27017\n", "\n", "### On a different terminal\n", "mongo\n", "> sh.status()\n", "> sh.addShard(\"a/localhost:27000\")\n", "> sh.addShard(\"b/localhost:28000\")\n", "> sh.getBalancerState()\n", "\n", "# Create a database and setup sharding on it\n", "> show databases\n", "> use any_db_name\n", "> db.createCollection(\"any_collection_name\")\n", "> show databases\n", "> sh.enableSharding(\"any_db_name\")\n", "\n", "# For each collection that must be sharded, specify a sharding key\n", "> sh.shardCollection(\"any_db_name.any_collection_name\", { \"any_field_name\" : 1 } )\n", "```\n", "\n", "After you do the above, answer to the following questions:\n", "\n", "- Given a dataset of [GitHub commits](https://developer.github.com/v3/repos/commits/),\n", "in JSON format, what would be the appropriate key to shard upon? Why?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- If the total dataset is 1.000.000 items and we have a 4-way sharded cluster,\n", "estimate how many items will reside on each node given the sharding key you \n", "chose above." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- What algorithm does MongoDB use for leader election?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- How can we take an incremental (e.g. every day) backup from the cluster,\n", "without overloading it?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 2: Creating data\n", "\n", "Write a program that generates 100.000 values in the following format\n", "\n", "```javascript\n", "{\n", " id: 123 //a unique integer\n", " name: \"A random name from the list here http://www.gutenberg.org/files/3201/files/NAMES.TXT\",\n", " address : [{ // An array populated with 1-10 elements, randomly\n", " street: \"A random place from the list here http://www.gutenberg.org/files/3201/files/PLACES.TXT\",\n", " number: 12 //A random integer in the range 1 -- 500\n", " }] \n", "}\n", "```\n", "\n", "The program should also be able to connect to a MongoDB database, using the \n", "appropriate MongoDB driver for your language of choice, and write\n", "the generated values. Make sure you print a message on the terminal every\n", "time you succesfully write an entry in the database.\n", "\n", "- Copy the source code for your program here:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- On a single MongoDB instance (not the cluster), create a database called\n", "`data` and a collection called `addresses`. Using the program you wrote, \n", "insert the values to the `addresses` collection. How much time does it take?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Create a database called `data` and a sharded collection called \n", "`addresses`. Select the appropriate sharding key and justify your selection. \n", "Insert the generated values to your MongoDB cluster. How much time does it \n", "take? If it takes more time than in the standalone mode, where does the extra \n", "time go? If it takes less, what makes it faster?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 3: Perform queries\n", "\n", "Given the database of 100.000 items you created in the step above, run the\n", "following queries at the MongoDB command prompt, in both the standalone and\n", "the sharded database version. \n", "\n", "**Hint**: queries in MongoDB return cursors (iterators) which are lazy; you need to \n", "retrieve all results in order to evaluate how much time it takes to execute\n", "them. You can use a function like the one below at the MongoDB prompt.\n", "\n", "```{javascript}\n", "mongo> function timeit(query) {\n", " var a = Date.now(); \n", " query.forEach(function(){}); \n", " return (Date.now() - a);\n", "}\n", "mongo> timeit(db.addresses.find({id: 1}))\n", "```\n", "Fill in the table below with your answers:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "|Description|Query |Result (standalone)|Time (standalone) |Result (cluster)|Time (cluster)|\n", "|-----------|------|-------------------|------------------|----------------|--------------|\n", "|1. Number of items in collection `addresses`| `db.addresses.count()` | 100000| 10ms | 100000| 5ms|\n", "|2. Items where the `id` is bigger than 50.000||||||\n", "|3. Names starting with either 'A' or 'K'||||||\n", "|4. Unique street names whose address is even ||||||\n", "|5. Number of people based on their address ||||||" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 4: Making it fail\n", "\n", "Delete all the data from your cluster. Start your data generator again.\n", "While the data generator works, remove one cluster node. You can kill a MongoDb\n", "process; remove a network cable; stop VM; stop a Docker container; the exact\n", "process of removing cluster nodes depends on how you setup the cluster.\n", "\n", "After you removed the cluster node, you should observe the cluster reconfigure\n", "itself. What where the steps that where taken? How long did the process take?\n", "What happened on the client side? How many records where in the database after\n", "the data generator finishes? Describe the failover process below.\n", "\n", "**Hint**: Make sure you configure MongoDB to print its logs to a file, so\n", "that you can inspect it at a later time." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Restart the failed node. What happens?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 2 }