A fundamental area in Computer Science is to develop *algorithmic* processes for
efficiently *storing* and *retrieving* data with good performance. Nowadays,
data is fast-produced leading to high volume of data in no time. It is
anticipated that *Internet of Things* (IoT) devices will generate 400
zettabytes of data in 2018. Performing analytics or
science on such large datasets poses many problems with performance. Two
important building blocks in solving such problems are *Sorting* and *Searching*
algorithms. From the previous lectures, several classes of data structures were
introduced such as *lists*, *trees* and *graphs* that maps a given dataset into
the organization of that data structure. On a functional-level, the data is the
same but *working* with the data structure such as performing an *insert* or
*finding an element* comes with it's trade-offs to performance. This can be
important to consider depending on the character of the dataset and it's
application constraints.

The goal of this assignment is to separate the internals of a data structure
from its operations to understand how the overall performance varies between a
sorted and unsorted data structure. The two performance aspects that you will
investigate is the **time complexity** and **space complexity**. The data
structure you will work with is a conventional 2D list structure also known as a
*Matrix*. The concepts and performance measurements discussed in this assignment
should be applicable for other data structures such as a *graph*.

In this assignment, you will work with an open public dataset from *Brandweer
Amsterdam-Amstelland* that contains over **125.000**
fire alarms reported in this region. The dataset is provided in a compressed
zip-format and the resulting CSV-file will be ~124MB. While there are handy
libraries such as *pandas* that can read a CSV-file
into a table, we will only restrict to using core libraries that is part of
Python. To read a CSV-file to a 2D array, we do the following:

In [3]:

```
import csv #NB: don't forget to include the csv library
def load(csv_file):
"""
Loads the data from the file in the provided argument into a 2D array
"""
with open(csv_file) as csvDataFile:
return [row for row in csv.reader(csvDataFile, delimiter=';')]
```

If we inspect the first element of the resulting array, we obtain information
about the structure of the dataset and this is shown below. Consecutive elements
of the array follow this format where the first element in the row is the
*incident_id* and the last element is the *gemeente*

```
[['incident_id',
'start_tijd',
'incident_type',
'landelijke_meldingsclassificatie_niveau1',
'landelijke_meldingsclassificatie_niveau2',
'landelijke_meldingsclassificatie_niveau3',
'datum',
'jaar',
'maand_nr',
'maand_naam',
'dag_nr',
'dag_naam',
'week_nr',
'kwartaal',
'prioriteit',
'uur',
'dagdeel',
'objecttype',
'objectfunctie',
'buurt',
'wijk',
'woonplaats',
'gemeente'],
......
]
```

We can observe by inspecting the first 5 rows in the 2D array that the data is
*unordered*, the *incident_id* that represents the unique identification of each
fire alarm event and the creation time stamp of each event is out-of-sync.

**T1 (20 points):** Your first task is to implement the **insertion sort
algorithm** to sort the dataset according to the *incident_id* and the creation
date. Notice that the algorithm is slow and will not scale. Therefore only a
subset of the dataset will be used. From the function definition below, the
*table* argument represents the 2D array, *f* is the compare function and *limit*
describes the size of the dataset (NB: the offset is always the first element)

In [99]:

```
def insertionSort(table, f, limit=150):
"""
Takes a data table, take a subset of 150 elements and sorts it
NB: first value of the array is a schema of the data
"""
subtable = table[1:limit+1]
#implementation
sortedTable = []
for row in subtable:
index = 0
if (len(sortedTable) == 0):
sortedTable.append(row)
for element in sortedTable:
if f(element, row):
sortedTable.insert(index, row)
break
index += 1
if index == len(sortedTable):
sortedTable.append(row)
for i in range(0, len(sortedTable)):
table[i] = sortedTable[i]
```

Use `insertionSort`

to sort the dataset on incident id and print the first 20 id's. (Use the `cmpf`

function to do this)

In [38]:

```
incidents = load('brwaa_2010-2015.csv')
insertionSort(incidents, cmpf)
for i in range(0, 20):
print(incidents[i][0])
```

The compare function *f* that is provided in the argument list is used for
comparing two elements and is the piece of information the sort algorithm needs
to know in which order to sort elements. The _incident*id* should be sorted as an
Integer but in the 2D array it is a String. Therefore we need to
convert to an Integer and also explain the *natural order* of this type. The "0"
indicates incident id

In [14]:

```
def cmpf(rowA , rowB):
return int(rowA[0]) > int(rowB[0])
```

Similarly, this has to be done for date comparison

In [4]:

```
import datetime
def cmpfDate(rowA , rowB):
#6 = datum
# 2008/01/03 00:00:00.000000000, remove microsec
return datetime.datetime.strptime(rowA[6][:-10],'%Y/%m/%d %H:%M:%S') > datetime.datetime.strptime(rowB[6][:-10],'%Y/%m/%d %H:%M:%S')
```

After implementing the algorithm, try out the following *limit* values:

- limit=50
- limit=100
- limit=150
- limit=200
- limit=250

Run the algorithm for each of these limit values and measure how long each of them takes.

In [52]:

```
import time
for i in range(1,6):
incidents = load('brwaa_2010-2015.csv')
start_time = time.time()
insertionSort(incidents, cmpf, i*50)
print("Limit " + str(i*50) + " " + str(1000*(time.time() - start_time)) + " ms")
```

Notice that the time increases with the size of the data. An improvement to the
algorithm is to treat this problem using a *divide and conquer*-approach that
divides the dataset into sub-lists. This improvement is called the *shell
algorithm*

**T2 (10 points):** Your task is to modify the current implementation of the
**insertion sort** algorithm to work with *shell sort algorithm*. The modified
version should be called **gapInsertionSort** and use the arguments specified
below. You have to solve how the **startposition** and **sublistcount** are
used to make this work.

In [33]:

```
def shellSort(table, f, limit):
subtable = table[1:limit+1]
sublistcount = len(subtable)//2
while sublistcount > 0:
for startposition in range(sublistcount):
#implement the function call below
gapInsertionSort(subtable,f,startposition,sublistcount)
sublistcount = sublistcount // 2
return subtable
```

In [114]:

```
def gapInsertionSort(table, f, startposition, sublistcount):
for i in range(startposition+sublistcount,len(table),sublistcount):
currentvalue = table[i]
startposition = i
while startposition>=sublistcount and f(table[startposition-sublistcount],currentvalue):
table[startposition]=table[startposition-sublistcount]
startposition = startposition-sublistcount
table[startposition]=currentvalue
```

Use shellSort to sort the first 300 entries in the table on date and print the first 10 incident id's

In [115]:

```
incidents = load('brwaa_2010-2015.csv')
subtable = shellSort(incidents, cmpfDate, 300)
for i in range(10):
print(subtable[i][0])
```

This algorithm is faster than insertion sort, but it is still not fast enough to sort all 124,000 records, so we need to have an algorithm with a lower time complexity. There are two other algorithms that have a "O(n log n)" time complexity but their space complexity varies. For *merge sort*, the space complexity is O (n) which means a linear growth of the space but for a *quick sort*, the space complexity is constant but the performance could degrade to "O(n^2)" depending on the data and selection of the pivot. For instance, if the dataset is ordered or the elements are the same. This is not the case for the dataset that we are using

**T3 (30 points):** Your task is to implement the *quick sort algorithm*, the
stubs for the algorithm is presented here and the challenge here is choosing the
pivot such that it can in short time process all the records.

In [64]:

```
def quickSort(table,f):
quickSortHelper(table,f,0,len(table)-1)
def quickSortHelper(table,f,first,last):
if first<last:
splitpoint = partition(table,f,first,last)
quickSortHelper(table,f,first,splitpoint-1)
quickSortHelper(table,f,splitpoint+1,last)
def partition(table,f,first,last):
pivotvalue = table[first]
leftmark = first+1
rightmark = last
done = False
while not done:
while leftmark <= rightmark and not f(table[leftmark], pivotvalue):
leftmark = leftmark + 1
while not f(pivotvalue, table[rightmark]) and rightmark >= leftmark:
rightmark = rightmark -1
if rightmark < leftmark:
done = True
else:
temp = table[leftmark]
table[leftmark] = table[rightmark]
table[rightmark] = temp
temp = table[first]
table[first] = table[rightmark]
table[rightmark] = temp
return rightmark
```

Use quick sort algorithm to sort the table on date and print the first 10 incident id's

In [96]:

```
incidents = load('brwaa_2010-2015.csv')
subtable = incidents[1:len(incidents)]
quickSort(subtable, cmpfDate)
print([i[0] for i in subtable][0:10])
```

**T4 (10 points):** Run the shellSort and the quickSort algorithms on the same limit values as in the previous task and measure the run times.

In [117]:

```
import time
incidents = load('brwaa_2010-2015.csv')
r = range(50, 1000, 50)
print("insertionSort:")
for limit in r:
temp_incidents = incidents[1:limit+1]
start_time = time.time()
insertionSort(temp_incidents, cmpf, limit)
print("Limit " + str(limit) + " " + str(int(100000*(time.time() - start_time)) / 100) + " ms")
print("Shellsort:")
for limit in r:
temp_incidents = incidents[1:limit+1]
start_time = time.time()
shellSort(temp_incidents, cmpf, limit)
print("Limit " + str(limit) + " " + str(int(100000*(time.time() - start_time)) / 100) + " ms")
print("quickSort:")
for limit in r:
temp_incidents = incidents[1:limit+1]
start_time = time.time()
quickSort(temp_incidents, cmpf)
print("Limit " + str(limit) + " " + str(int(100000*(time.time() - start_time)) / 100) + " ms")
```

The final part of the assignment is to perform find elements in a data set. We will use two search techniques, a simple linear search and binary search.

**T5 (20 points):** Create a linear search algorithm and your task is to find the
following id's: 93094, 66515, 47279, 122471, 16515

In [135]:

```
def lsearch(table,f,fieldIndex,find):
for row in table[1:len(table)]:
# print(row[fieldIndex], find == int(row[fieldIndex]))
if find == int(row[fieldIndex]):
return row
return None
```

Use `lsearch`

to find the given id's and print their date (If it can't be found print something like: N/A). Also measure how long it takes combined to find them.

In [140]:

```
incidents = load('brwaa_2010-2015.csv')
start_time = time.time()
print(lsearch(incidents,cmpf,0,93094)[0])
print(lsearch(incidents,cmpf,0,66515)[0])
print(lsearch(incidents,cmpf,0,47279))
print(lsearch(incidents,cmpf,0,122471)[0])
print(lsearch(incidents,cmpf,0,16515))
print("lsearch took: ", int(100000*(time.time() - start_time)) / 100, " ms")
```

*Hint 1:* You can use the same compare function as in the previous tasks to
provide for *f*

*Hint 2:* The *fieldIndex=...* represents the number of the column in the table for which you want to search, so if you use an index of 0 you will search for an incident id

We could observe that for two of the search queries, they took considerable time, this due to the fact that we don't know where the items are. Many algorithms are designed with the idea that the data set is sorted. Therefore, we will try the same on a sorted data set. This way we might not need to search the entire table to know that a certain value does not appear in it.

In [157]:

```
def sortedLSearch(sortedTable,f,fieldIndex,find):
for row in sortedTable[1:len(sortedTable)]:
# print(row[fieldIndex])
if not f([find], row):
if find == int(row[fieldIndex]):
return row
else:
return None
```

Again try to search for the same id's and measure how long it takes (don't count the time it takes to sort the table), you will probably notice that the search executed faster.

In [271]:

```
incidents = load('brwaa_2010-2015.csv')
incidents = incidents[1:len(incidents)]
quickSort(incidents,cmpf)
start_time = time.time()
print(sortedLSearch(incidents,cmpf,0,93094)[0])
print(sortedLSearch(incidents,cmpf,0,66515)[0])
print(sortedLSearch(incidents,cmpf,0,47279))
print(sortedLSearch(incidents,cmpf,0,122471)[0])
print(sortedLSearch(incidents,cmpf,0,16515))
print("lsortedLSearch took: ", int(100000*(time.time() - start_time)) / 100, " ms")
```

We can further enhance the *search process* by exploiting the properties of
using a sorted array. A binary search algorithm uses the properties of a sorted
algorithm. The process is very intuitive and can be illustrated using the
*De Telefoongids* that is a sorted dataset with name and telephone numbers ordered.
If you look for someone with the surname "Hejderup", you will start at the index
H and continue the search there.

**T6 (20 points):** Your task is to implement a binary search version with the
same search queries as in the previous task.

In [183]:

```
def bsearch(sortedTable,f,fieldIndex,find):
left = 0
right = len(sortedTable)
current = (left+right)//2
while find != int(sortedTable[current][fieldIndex]):
if current <= left:
return None
if f(sortedTable[current],[find]):
right = current
else:
left = current
current = (left+right)//2
return sortedTable[current]
```

Again try to search for the same id's and measure how long it takes (don't count the time it takes to sort the table)

In [184]:

```
incidents = load('brwaa_2010-2015.csv')
incidents = incidents[1:len(incidents)]
quickSort(incidents,cmpf)
start_time = time.time()
print(bsearch(incidents,cmpf,0,93094)[0])
print(bsearch(incidents,cmpf,0,66515)[0])
print(bsearch(incidents,cmpf,0,47279))
print(bsearch(incidents,cmpf,0,122471)[0])
print(bsearch(incidents,cmpf,0,16515))
print("bsearch took: ", int(100000*(time.time() - start_time)) / 100, " ms")
```

*Optional (Bonus task)*

**T7 (10 points):** Last task is to find multiple entries, your task is to modify
the binary search such that all results for the 'gemeente' of Amsterdam are
returned.

In [265]:

```
def cmpfGemeente(rowA, rowB):
return str(rowA[22]).rstrip() > str(rowB[22]).rstrip()
def multiplebsearch(sortedTable,f,fieldIndex,find):
left = 0
right = len(sortedTable)
current = (left+right)//2
while find != str(sortedTable[current][fieldIndex]).rstrip():
print(find, str(sortedTable[current][fieldIndex]), find == str(sortedTable[current][fieldIndex]))
if current <= left:
return None
if f(sortedTable[current],[None] * fieldIndex + [find]):
right = current
else:
left = current
current = (left+right)//2
found = [sortedTable[current]]
cursor = current - 1
while find == str(sortedTable[cursor][fieldIndex]).rstrip():
found.append(sortedTable[cursor])
cursor -= 1
cursor = current + 1
while find == str(sortedTable[cursor][fieldIndex]).rstrip():
found.append(sortedTable[cursor])
cursor += 1
return found
```

Print all the incident id's of incidents that happened in the 'gemeente' Amsterdam

In [258]:

```
incidentsGemeente = load('brwaa_2010-2015.csv')
incidentsGemeente = incidents[1:len(incidentsGemeente)]
incidentsGemeente = shellSort(incidentsGemeente,cmpfGemeente, len(incidentsGemeente))
```

In [272]:

```
amsterdamIncidents = multiplebsearch(incidentsGemeente,cmpfGemeente,22,"Amsterdam")
for i in amsterdamIncidents[:10]:
print(i[0], i[22])
```