Name:

Student Number:

BSc CS/TI or minor student?:

# Instructions

• Before you start, write down your name and student number on this page. On all the following pages, write your student number at the top of the page.
• When you are asked to provide implementations, you can provide them in either Scala or Python. Your code should be generally correct, minor omissions can be tolerated, as long as the algorithm is appropriate.
• The use of material (book, slides, laptop, etc.) during the exam is not allowed.
• Use the provided paper for notes etc. Don’t write on the exam paper.
• Write clearly! If your writing cannot be deciphered, it will not be considered for grading.

The total number of points is 75. You have 180 minutes: Good Luck!

# Open Questions

Please provide brief answers to the following questions. Be precice!

## Data processing with FP

1. (2 points) Describe what is a relation and what is a Key/Value pair and comment on their differences.
1. Provide the function signatures for the following functions of the type List[A].

• (1 point) foldL:
• (1 point) reduceR:
• (1 point) flatMap:
• (1 point) scan:
• (1 point) groupBy:
2. (2 points) Implement groupBy for List[A] using foldL.

1. (2 points) Implement foldL for List[A] using foldR
1. (2 points) Implement leftJoin(xs:[(K, A)], ys:[(K, B)]) for KV pairs xs and ys where the type K signifies the key. What is the return type?
1. (2 points) How would you rewrite (in any language) the following 2 examples without side-effects?

Example 1, in Scala

var sum = 0
for (i <- 1 to 10) {
sum = sum + i
}

Example 2, in Python

x = {}
for i in [1,2,3,4,5,6]:
key = i % 2
if key in x.keys():
x[key].append(i)
else:
x[key] = [i]
1. (2 points) Describe what monads are and why they are useful. Give an example of a useful monad application.

## Distributed systems/databases/filesystems

1. (2 points) What is Amdahl’s law and what are its implications?
1. (2 points) What is the CAP conjecture and what are its implications?
1. (3 points) What are the 3 typical replication architectures? Describe one benefit and one drawback for each.
1. (2 points) Explain the guarantees given by each “letter” in an ACID compliant database.
A:

C:

I:

D:

1. (6 points) Consider a cluster of 6 machines running HDFS (1 namenode, 5 datanodes). Each node in the cluster has a total of 1TB hard disk space and 2GB of main memory available. The cluster uses a block-size of 64 MB and a replication factor of 3. The master maintains 100 bytes of metadata for each 64MB block. Imagine that we upload a 128GB file.
• (2 points) How much data (in GBs) does each datanode store?
• (4 points) Describe the operations that HDFS will do (server-side) in order to store the file.
1. (10 points) You are tasked to design a system that reliably delivers messages from one client to another (think Whatsapp). The process to deliver a message from client A to client B is the following:

1. A writes outgoing messages to a queue
2. The system retrieves it from the queue and stores it in a database
3. The system notifies B that a message is waiting
4. B connects, retrieves the message and marks it as read (let’s ignore read receipts for now)

The system needs to work 24/7 and be resilient to database and queue failures. You will be using ready-made databases and queues that support replication, sharding and transactions. How would you design it? Discuss the advantages and disadvantages of your design decisions.

## Spark

You are given the following dataset; it contains citation information in the IEEE CS format (lines that end with \ continue to the next line):

S. Ryza, U. Laserson, S. Owen, and J. Wills, Advanced analytics with spark: \
Patterns for learning from data at scale. O’Reilly Media, Inc., 2015.
H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning spark: \
Lightning-fast big data analysis. O’Reilly Media, Inc., 2015.
B. Chambers and M. Zaharia, Spark: The definitive guide. O’Reilly Media, Inc., 2017.
M. Kleppmann, Designing data-intensive applications. O’Reilly Media, Inc., 2017.
H. Karau and R. Warren, High performance spark. O’Reilly Media, Inc., 2017.
T. H. Cormen, C. E. Leiserson, Ronald L. Rivest, and C. Stein, Introduction \
to algorithms (3rd ed.). MIT press, 2009.
P. Louridas, Real world algorithms. MIT press, 2017.

The format looks like this:

author1, author2, ... , authorN, title. publisher, year.
1. (2 points) Write code to load the data into an RDD. Then, convert the RDD into a DataFrame.
1. (2 points) Write a query (in SQL or programmatically) to print a list of publisher names along with the number of publications in the dataset. For the given dataset, the output should look like:
O’Reilly Media, Inc., 5
MIT press, 2
1. (2 points) Write a query (in SQL or programmatically) to find the author with the most publications. In our case that should be: M. Zaharia.
1. (2 points) In the 2 queries you have writen above, which operators cause a data shuffle?
1. (2 points) Explain how Spark deals with fault tolerance.
1. (2 points) Explain how Spark schedules jobs on a cluster.

## Streaming

Only answer the following questions if you are a BSc TI student!

1. (3 points) Why do we need streaming windows? What is a range and what is a slice in a streaming window?
1. (3 points) Say that we have 5 different event sources and 20 clients that can consume all of them; briefly describe 3 ways in which we can connect the sources to the clients, and comment on the relative drawbacks of each.

## Unix

1. (2 points) Write a program that given a directory of text files, prints the 10 most common words in those files (across all files)

For the next 2 questions, you are given a CSV file whose contents look like the following:

$cat fileA user_id, account_id, balance 10, 12, 1000 10, 1, 32 12, 122, 5 ...$ cat fileB
account_id, transaction_id, amount
12, 332, 21
122, 21, 20
...
1. (2 points) Write a pipeline to find and report duplicate lines in fileA
1. (2 points) Write a pipeline to list all transactions per user.

# Multiple choice questions

Please select and circle only one answer per question.

1. (1 point) What is the correct function signature for reduce on Spark RDDs?

1. RDD[A].reduce(f: (A,B) -> B)
2. RDD[A].reduce(f: (A,A) -> A)
3. RDD[A].reduce(init: B, seqOp: (A, B) -> A, combOp: (B, B) -> B)
4. RDD[A].reduce(init:A, f: (A,A) -> A)
2. (1 point) Distributed systems:

1. Share memory
2. Share CPUs
3. Share networks
4. None of the above
3. (1 point) Lamport timestamps

1. enable us to establish total causal ordering of events
2. enable us to establish a total order of events
3. given 2 events A and B, enable us to determine which one happend first
4. none of the above
4. (1 point) The serializable transaction isolation level protects transcations against:

1. Dirty reads
2. Non-repeatable reads
3. Phantom reads
4. all of the above
5. (1 point) Which of the following methods is part of the Observer interface, for dealing with push-based data consumption?

1. def subscribe(obs: Observer[A]): Unit
2. def onNext(a: A): Unit
3. def map(f: (A) -> B): [B]
4. def onExit(): Unit
6. (1 point) An transformation in Spark:

1. Schedules a pipeline run
2. Runs a pipeline and reports a result
3. Triggers a data shuffle
4. Does nothing
7. (1 point) Which of the following is a likely computation order of applying reduceR to a list of 10 integers with the ‘+’ operator?

1. (((((((((0 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)
2. (((1 + 2) + (3 + 4)) + 5) + (((6 + 7) + (8 + 9)) + 0)
3. (1 + (2 + (3 + (4 + (5 + (6 + (7 + (8 + (9 + 0)))))))))
4. (1 + 2) + (3 + 4) + (5 + 6) + (7 + 8) + (9 + 0)
8. (1 point) Collaborative filtering:

1. Allows us to predict a user’s preferences using his/her previous choices
2. Allows us to predict a user’s preferences using other similar user’s preferences
3. Is a technique where swarms of collaborating users inspect and filter content before it reaches its destination
4. Is used by Google to filter out search results
9. (1 point) What is precision in the context of Machine Learning? ($$TP$$ = True Positive, $$FP$$ = False Positive, $$TN$$ = True Negative, $$FN$$ = False Negative)

1. $$\frac{TP}{TP + FP}$$
2. $$\frac{TP}{TP + FN}$$
3. $$\frac{FP}{FP + TN}$$
4. None of the above
10. (1 point) What is Byzantine fault tolerance?

1. Resilience against multiple node failures
2. Resilience against suboptimal voting
3. Tolerating node failures when electing cluster leaders
4. A mechanism used by the Byzantine empire for the good of its people
11. (1 point) What is eventual consistency?

1. At any time, the system is linearisable
2. At any time, concurrent reads from any node return the same values
3. If writes stop, all reads will return the same value after a while
4. If writes stop, a distributed system will become consistent
12. (1 point) Which of the following RDD API calls is a performance killer in Spark?

1. reduceByKey
2. keyBy
3. groupByKey
4. aggregatebyKey
13. (1 point) Copy-on-write is a technique to:

1. Enable efficient immutable data
2. Implement filesystems
3. Share read-only buffers
4. All the above
14. (1 point) What is the correct signature for rightOuterJoin on Spark RDDs?

1. RDD[(K,V)].rightOuterJoin(other: RDD[(K, W)]): RDD[(K, (Option[V], W))]
2. RDD[(K,V)].rightOuterJoin(other: RDD[(K, W)]): RDD[(K, (Option[V], Option[W]))]
3. RDD[(K,V)].rightOuterJoin(other: RDD[(K, W)]): RDD[(K, (V, W))]
4. RDD[(K,V)].rightOuterJoin(other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
15. (1 point) A GFS chunkserver/HDFS datanode is responsible to:

1. Store filesystem path information
2. Split the data into partitions
3. Store data partitions
4. Replicate the data onto multiple disks