{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Searching\n",
"\n",
"A fundamental area in Computer Science is to develop _algorithmic_ processes for\n",
"efficiently _storing_ and _retrieving_ data with good performance. Nowadays,\n",
"data is fast-produced leading to high volume of data in no time. It is\n",
"anticipated that _Internet of Things_ (IoT) devices will generate [400\n",
"zettabytes](https://goo.gl/xu4NcT) of data in 2018. Performing analytics or\n",
"science on such large datasets poses many problems with performance. Two\n",
"important building blocks in solving such problems are _Sorting_ and _Searching_\n",
"algorithms. From the previous lectures, several classes of data structures were\n",
"introduced such as _lists_, _trees_ and _graphs_ that maps a given dataset into\n",
"the organization of that data structure. On a functional-level, the data is the\n",
"same but _working_ with the data structure such as performing an _insert_ or\n",
"_finding an element_ comes with it's trade-offs to performance. This can be\n",
"important to consider depending on the character of the dataset and it's\n",
"application constraints.\n",
"\n",
"The goal of this assignment is to separate the internals of a data structure\n",
"from its operations to understand how the overall performance varies between a\n",
"sorted and unsorted data structure. The two performance aspects that you will\n",
"investigate is the **time complexity** and **space complexity**. The data\n",
"structure you will work with is a conventional 2D list structure also known as a\n",
"*Matrix*. The concepts and performance measurements discussed in this assignment\n",
"should be applicable for other data structures such as a _graph_.\n",
"\n",
"In this assignment, you will work with an open public dataset from [_Brandweer\n",
"Amsterdam-Amstelland_](https://goo.gl/tE3Ahi) that contains over **125.000**\n",
"fire alarms reported in this region. The dataset is provided in a compressed\n",
"zip-format and the resulting CSV-file will be ~124MB. While there are handy\n",
"libraries such as [_pandas_](http://pandas.pydata.org/) that can read a CSV-file\n",
"into a table, we will only restrict to using core libraries that is part of\n",
"Python. To read a CSV-file to a 2D array, we do the following:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import csv #NB: don't forget to include the csv library\n",
"\n",
"def load(csv_file):\n",
" \"\"\"\n",
" Loads the data from the file in the provided argument into a 2D array\n",
" \"\"\"\n",
" with open(csv_file) as csvDataFile:\n",
" return [row for row in csv.reader(csvDataFile, delimiter=';')]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we inspect the first element of the resulting array, we obtain information\n",
"about the structure of the dataset and this is shown below. Consecutive elements\n",
"of the array follow this format where the first element in the row is the\n",
"*incident_id* and the last element is the *gemeente*\n",
"```\n",
"[['incident_id',\n",
" 'start_tijd',\n",
" 'incident_type',\n",
" 'landelijke_meldingsclassificatie_niveau1',\n",
" 'landelijke_meldingsclassificatie_niveau2',\n",
" 'landelijke_meldingsclassificatie_niveau3',\n",
" 'datum',\n",
" 'jaar',\n",
" 'maand_nr',\n",
" 'maand_naam',\n",
" 'dag_nr',\n",
" 'dag_naam',\n",
" 'week_nr',\n",
" 'kwartaal',\n",
" 'prioriteit',\n",
" 'uur',\n",
" 'dagdeel',\n",
" 'objecttype',\n",
" 'objectfunctie',\n",
" 'buurt',\n",
" 'wijk',\n",
" 'woonplaats',\n",
" 'gemeente'],\n",
" ......\n",
" ]\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ordering the dataset\n",
"We can observe by inspecting the first 5 rows in the 2D array that the data is\n",
"_unordered_, the *incident_id* that represents the unique identification of each\n",
"fire alarm event and the creation time stamp of each event is out-of-sync.\n",
"\n",
"\n",
"**T1 (20 points):** Your first task is to implement the **insertion sort\n",
"algorithm** to sort the dataset according to the *incident_id* and the creation\n",
"date. Notice that the algorithm is slow and will not scale. Therefore only a\n",
"subset of the dataset will be used. From the function definition below, the\n",
"_table_ argument represents the 2D array, _f_ is the compare function and _limit_\n",
"describes the size of the dataset (NB: the offset is always the first element)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def insertionSort(table, f, limit=150):\n",
" \"\"\"\n",
" Takes a data table, take a subset of 150 elements and sorts it\n",
" NB: first value of the array is a schema of the data\n",
" \"\"\"\n",
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use `insertionSort` to sort the dataset on incident id and print the first 20 id's. (Use the `cmpf` function to do this)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The compare function _f_ that is provided in the argument list is used for\n",
"comparing two elements and is the piece of information the sort algorithm needs\n",
"to know in which order to sort elements. The _incident_id_ should be sorted as an\n",
"Integer but in the 2D array it is a String. Therefore we need to\n",
"convert to an Integer and also explain the _natural order_ of this type. The \"0\"\n",
"indicates incident id"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def cmpf(rowA , rowB):\n",
" return int(rowA[0]) > int(rowB[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarly, this has to be done for date comparison "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def cmpfDate(rowA , rowB):\n",
" #import datetime\n",
" #6 = datum\n",
" # 2008/01/03 00:00:00.000000000, remove microsec\n",
" return datetime.datetime.strptime(rowA[6][:-10],'%Y/%m/%d %H:%M:%S') >\n",
" datetime.datetime.strptime(rowB[6][:-10],'%Y/%m/%d %H:%M:%S')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After implementing the algorithm, try out the following _limit_ values:\n",
"\n",
"- limit=50\n",
"- limit=100\n",
"- limit=150\n",
"- limit=200\n",
"- limit=250\n",
"\n",
"Run the algorithm for each of these limit values and measure how long each of them takes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the time increases with the size of the data. An improvement to the\n",
"algorithm is to treat this problem using a *divide and conquer*-approach that\n",
"divides the dataset into sub-lists. This improvement is called the *shell\n",
"algorithm*\n",
"\n",
"\n",
"**T2 (10 points):** Your task is to modify the current implementation of the\n",
"**insertion sort** algorithm to work with *shell sort algorithm*. The modified\n",
"version should be called **gapInsertionSort** and use the arguments specified\n",
"below. You have to solve how the **startposition** and **sublistcount** are\n",
"used to make this work."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def shellSort(table, f, limit):\n",
" subtable = table[1:limit+1]\n",
" sublistcount = len(subtable)//2\n",
" while sublistcount > 0:\n",
" for startposition in range(sublistcount):\n",
" #implement the function call below\n",
" gapInsertionSort(subtable,f,startposition,sublistcount)\n",
" print(\"After increments of size\",sublistcount,\n",
" \"The list is\",table)\n",
"\n",
" sublistcount = sublistcount // 2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def gapInsertionSort(table, f, startposition, sublistcount):\n",
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use shellSort to sort the first 300 entries in the table on date and print the first 10 incident id's"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This algorithm is faster than insertion sort, but it is still not fast enough to sort all 124,000 records, so we need to have an algorithm with a lower time complexity. There are two other algorithms that have a \"O(n log n)\" time complexity but their space complexity varies. For *merge sort*, the space complexity is O (n) which means a linear growth of the space but for a *quick sort*, the space complexity is constant but the performance could degrade to \"O(n^2)\" depending on the data and selection of the pivot. For instance, if the dataset is ordered or the elements are the same. This is not the case for the dataset that we are using\n",
"\n",
"**T3 (30 points):** Your task is to implement the *quick sort algorithm*, the\n",
"stubs for the algorithm is presented here and the challenge here is choosing the\n",
"pivot such that it can in short time process all the records."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def quickSort(table):\n",
" quickSortHelper(table,0,len(alist)-1)\n",
"\n",
"def quickSortHelper(table,first,last):\n",
" pass\n",
"\n",
"def partition(alist,table,last):\n",
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use quick sort algorithm to sort the table on date and print the first 10 incident id's"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**T4 (10 points):** Run the shellSort and the quickSort algorithms on the same limit values as in the previous task and measure the run times."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Searching the dataset\n",
"The final part of the assignment is to perform find elements in a data set. We\n",
"will use two search techniques, a simple linear search and binary search.\n",
"\n",
"**T5 (20 points):** Create a linear search algorithm and your task is to find the\n",
"following id's: 93094, 66515, 47279, 122471, 16515"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def lsearch(table,f,fieldIndex,find):\n",
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use `lsearch` to find the given id's and print their date (If it can't be found print something like: N/A). Also measure how long it takes combined to find them."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Hint 1:* You can use the same compare function as in the previous tasks to\n",
"provide for *f*\n",
"\n",
"*Hint 2:* The *fieldIndex=...* represents the number of the column in the table for which you want to search, so if you use an index of 0 you will search for an incident id\n",
"\n",
"We could observe that for two of the search queries, they took considerable\n",
"time, this due to the fact that we don't know where the items are. Many\n",
"algorithms are designed with the idea that the data set is sorted. Therefore, we\n",
"will try the same on a sorted data set. This way we might not need to search the entire table to know that a certain value does not appear in it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def sortedLSearch(sortedTable,f,fieldIndex,find):\n",
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again try to search for the same id's and measure how long it takes (don't count the time it takes to sort the table), you will probably notice that the search executed faster. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can further enhance the *search process* by exploiting the properties of\n",
"using a sorted array. A binary search algorithm uses the properties of a sorted\n",
"algorithm. The process is very intuitive and can be illustrated using the\n",
"*De Telefoongids* that is a sorted dataset with name and telephone numbers ordered.\n",
"If you look for someone with the surname \"Hejderup\", you will start at the index\n",
"H and continue the search there.\n",
"\n",
"**T6 (20 points):** Your task is to implement a binary search version with the\n",
"same search queries as in the previous task."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def bsearch(sortedTable,f,fieldIndex,find):\n",
" pass"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Again try to search for the same id's and measure how long it takes (don't count the time it takes to sort the table)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Optional (Bonus task)*\n",
"\n",
"**T7 (10 points):** Last task is to find multiple entries, your task is to modify\n",
"the binary search such that all results for the 'gemeente' of Amsterdam are\n",
"returned. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def multiplebsearch(sortedTable,f,fieldIndex,find):\n",
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Print all the incident id's of incidents that happened in the 'gemeente' Amsterdam"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}