# Instructions

• Fill in your answers on a separate answer sheet that will be automatically processed. Before you start, write down your name and student number on that sheet. Failure to do any of the above will result in the grade 0!

• Use the provided paper for notes etc. Don’t write notes on the exam paper or the answer sheet.

• Given $$C$$ correct answers and $$W$$ wrong answers, the grade $$G$$ will be determined by the follow formula: $G = \frac{C - \frac{W}{3}}{40} \times 10, G > 0$

• Not answering a question (i.e. leaving it blank) is not considered either a correct on an incorrect answer.

• For each question, there is only one answer that is correct.

You have 150 minutes (2.5 hours): Good Luck!

For some answers that could have been missinterpreted due to language ambiguities, 2 options are marked as accepted

# Exam questions

1. Consensus algorithms (e.g., Paxos or Raft):

1. enable 2 nodes to decide on something
2. work by replicating state machines <—
3. are based on vector clocks
4. enable a distributed system to agree on a common time
2. A GFS chunkserver/HDFS datanode is responsible to:

1. Store filesystem path information
2. Split the data into partitions
3. Store data partitions <—
4. Replicate the data onto multiple disks
3. Which of the following is the computation order of applying reduceL to a list of 10 integers with the ‘+’ operator?

1. (((((((((0 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) <—
2. (((1 + 2) + (3 + 4)) + 5) + (((6 + 7) + (8 + 9)) + 0)
3. (1 + (2 + (3 + (4 + (5 + (6 + (7 + (8 + (9 + 0)))))))))
4. (1 + 2) + (3 + 4) + (5 + 6) + (7 + 8) + (9 + 0)
4. We need to calculate the average temperature of the last 10 minutes every 30 minutes from a stream of measurements taken every minute. What type of window do we need to use?

1. Sliding window
2. Tumbling window
3. Jumping window <—
4. Session window
5. We have the following datasets:

• A => All the cars currently traveling through an intersection.

• B => All the cars that have been parked in a garage for the last 3 months.

1. Both datasets are unbounded data sets
2. Both datasets are bounded data sets
3. A is an unbounded data set and B is a bounded data set <—
4. A is a bounded data set and B is an unbounded data set
6. What kind of stream will be produced by the following Flink code snippet:

dataStream.map(c => (c.id, 1)).keyBy(x => x._1).sum(1)
1. Stream of the sums per ID <—
2. Stream with a single number (the total sum)
3. Stream of increasing integers
4. Runtime Exception
7. Which of the following best describes synchronous replication?

1. The master reports success after its cache has been flushed.
2. The master reports success after a configurable number of slaves have confirmed writes. <—
3. The master reports success after a write was committed to its own disk.
4. The master reports success after a read operation on its own disk results in the same outcome as a read operation on the disk of any of the slaves.
8. Which of the following is true about Lamport timestamps and vector clocks?

1. Lamport timestamps and vector clocks both maintain total causal orders
2. To use Lamport timestamps or vector clocks, each node in a system needs to maintain a vector of size N, where N is the total number of nodes in the system
3. They both guarantee that if LP/VC(a) < LP/VC(b) for event a and b, then a must happen before b
4. Both guarantee that messages will arrive in order

1. What is the correct function signature for leftOuterJoin on Spark RDDs?

1. RDD[(K,V)].leftOuterJoin(other: RDD[(K, W)]): RDD[(K, (Option[V], W))]
2. RDD[(K,V)].leftOuterJoin(other: RDD[(K, W)]): RDD[(K, (Option[V], Option[W]))]
3. RDD[(K,V)].leftOuterJoin(other: RDD[(K, W)]): RDD[(K, (V, W))]
4. RDD[(K,V)].leftOuterJoin(other: RDD[(K, W)]): RDD[(K, (V, Option[W]))] <—
2. Which of the following function(s) is/are higher order?

• $$A(a: Z, b: X) \rightarrow X$$
• $$B(a:[T], x:T \rightarrow V) \rightarrow [V]$$
• $$C(a: T, f: V) \rightarrow T$$
1. Only A
2. A and B
3. A and C
4. Only B <—
3. Which higher order function does the following signature correspond to? $(xs: [A], f: (A, B) \rightarrow B, acc: B): B$

1. foldL
2. reduceByKey
3. foldR <—
4. zip
4. Which of the following statements about Watermarks in stream processing systems is not correct.

1. The timestamp of the watermark signifies that all events carrying a timestamp up to the watermark timestamp should have arrived.
2. Watermarks allow events that are out of order to be processed as long as the event arrives before the first watermark that has a later event-time than the event itself. <—
3. A watermark can trigger multiple tumbling windows at once.
4. The session gaps for session windows do not depend on what watermarks are used.
5. Given that t is the timestamp an event is processed by a stream processor, event time skew is:

1. t−s, where s is the timestamp of the latest event processed. <—
2. t−s, where s is the timestamp when the last window trigger was fired.
3. t−s, where s is the timestamp of the first event processed.
4. t−s, where s is the timestamp of the actual timestamp of an event.
6. Which of the following is not a replication architecture:

1. Master - Master
3. Master - Slave
4. Eventually consistent <—
7. Which of the following statements is false? In the context of Big Data Processing, ETL pipelines:

1. Usually feed machine learning systems
2. Consume most of the developer/analyst project time
3. It is the job of managers and CEOs to design, code and execute them <—
4. Require special systems due to high volumes of data
8. Which of the following methods is part of the Observer interface, for dealing with push-based data consumption?

1. def subscribe(obs: Observer[A]): Unit
2. def onNext(a: A): Unit <—
3. def map(f: (A) -> B): [B]
4. def onExit(): Unit
9. A transformation in Spark:

1. Schedules a pipeline run <—
2. Runs a pipeline and reports a result
3. Triggers a data shuffle
4. Does nothing <—
10. What is Byzantine fault tolerance?

1. Resilience against multiple node failures
2. Resilience against node failures when electing cluster leaders
3. Resilience against suboptimal voting <—
4. A mechanism used by the Byzantine empire for the good of its people
11. Consider a cluster of 5 machines running HDFS (1 namenode, 4 datanodes). Each node in the cluster has a total of 1TB hard disk space and 128GB of main memory available. The cluster uses a block-size of 64 MB and a replication factor of 3. The master maintains 100 bytes of metadata for each 64MB block. Imagine that we upload a 128GB file. How much data does each datanode store?

1. 32GB
2. 64GB
3. 96GB <—
4. 128GB
12. Which of the following statements is not true? An operating system kernel:

1. Provides a unified interface to hardware devices
2. Is the main library loaded when the computer starts <—
3. Contains and runs device drivers
4. Splits the computer resources to multiple users
13. When multiple senders/receivers are involved, we need external ordering scheme. Which type of order is dependent on “happens before” relationships?

1. Reflexive order
2. Total order
3. Causal order <—
4. Asymptotic order
14. Which of the following statements about microbatching (in streaming systems) is correct?

1. Every event is processed as soon as it arrives
2. The engine buffers events before processing them further <—
3. Microbatching allows element-wise operations to be executed per event
4. Microbatching enables us to process batch data in a streaming fashion
15. In the case of Spark, narrow dependencies:

1. Are expensive, as they require extensive data shuffles
2. Represent element-wise operations <—
3. Allow for recalculating parts of the RDD, if a node dies
4. Trigger computations
16. What is the correct function signature for reduce on Spark RDDs?

1. RDD[A].reduce(f: (A,B) -> B)
2. RDD[A].reduce(f: (A,A) -> A) <—
3. RDD[A].reduce(init: B, seqOp: (A, B) -> A, combOp: (B, B) -> B)
4. RDD[A].reduce(init:A, f: (A,A) -> A)
17. Vector clocks

1. enable us to establish partial causal ordering of events <—
2. require space $$O(n^2)$$ for $$n$$ processes
3. given 2 events A and B, enable us to determine which one happened first <—
4. are a protocol for replicating time machines
18. Choose the correct implementation of the Monad interface in Scala:

1.     trait Monad[M[_]] {
def unit[S](a: S) : M[S]
def flatMap[S, T] (m: M[S], f: S => M[T]) : Monad[T]
}
<—
2.       trait Monad[M[_]] {
def unit[S](a: S) : M[S]
def map[T] (m: M[S], f: S => M[T]) : Monad[T]
}
3.    trait Monad[M[_]] {
def map[T] (m: M[S], f: S => M[T]) : Monad[T]
def reduce[T,B](init: B, f: (B,T) => M[B]): Monad[B]
}
4.    trait Monad[M[_]] {
def map[T] (m: M[S], f: S => M[T]) : Monad[T]
def flatMap[S, T] (m: M[S], f: S => M[T]) : T
}
19. An (in-memory) immutable data structure:

1. Updates values in place when updates are requested by the client
2. Can be safely shared across multiple machines
3. Can be safely shared across multiple threads <—
4. None of the above
20. What is eventual consistency?

1. At any time, the system is linearisable
2. At any time, concurrent reads from any node return the same values
3. If writes stop, all reads will return the same value after a while <—
4. If writes stop, a distributed system will become consistent
21. Immutability enables us to:

1. Modify data with multiple threads
2. Create classes that do not change
4. All the above
22. Which higher order function does the following code snippet correspond to:

  def f[A](xs: List[A], ys: List[B]) : List[(A, B)] = (xs, ys) match {
case (_, Nil) => Nil
case (Nil, _) => Nil
case (a :: xss, b :: yss) => (a, b) :: f(xss, yss)
}
1. foldL
2. reduceR
3. scanL
4. zip <—
23. Given the following value in Scala:

val statement = List(List("some", "questions"), List("are", "difficult"))

which of the following sequences of function calls would converted it to the string: “some questions are difficult”

1. statement.flatMap(x => x).flatten
2. statement.flatten.reduce((a, b) => a + " " + b) <—
3. statement.foldLeft(List[String]())((x : List[String], y : List[String]) => x ::: y)
4. statement.reduceByValue(_ + _)
24. What does Amdahl’s law prescribe?

1. A way to determine how many processors we need
2. A way to compute the theoretical speedup a program get when using multiple CPUs <—
3. A way to compute what portion of a program is parallel
4. A way to determine whether we need to scale up or out
25. Multi-leader systems have issues with write conflicts. Which of the following is the most plausible way of resolving them?

1. Maintain a write lock. Only one client at any moment can acquire it.
2. Use Lamport timestamps: this will help determine the order of incoming writes.
3. Report conflicts to clients and let clients handle them. <—
4. Move conflicting writes to another cluster leader.
26. A Unix pipe (denoted by |) enables us to:

1. Move text from the input of a command to the output of another command
2. Copy data from one location on the hard drive to another
3. Move structured records between two commands
4. Move unstructured data between two commands <—
27. Which of the following statements are true?

1. The C in ACID and CAP mean the same thing
1. Big data engineering is concerned with building pipelines, while big data analytics is concerned with discovering patterns, so we could say that big data analytics is the same as machine learning (ML), which also involves discovering patterns
1. Only a is true
2. Only b is true
3. Both are true
4. Both are false <—
28. Given that file.txt is a two column CSV file, what does the following Unix command do?

  $sed -e 's/^$$.*$$,$$.*$$$/\2 \1/' < file.txt | sort
1. Changes the order of columns in a 2 column file, then sorts by column 1 <—
2. Concatenates columns 1 and 2, then sorts by column 2 <—
3. Removes columns 1 and 2, then sort by whatever remains
4. Appends columns 1 and 2 to file.txt and then sorts it
29. For our new budget-constrained startup, called DelayedGram, we are building an application that serves millions of cat videos to millions of users in parallel. Which architecture would be more suitable?

1. A queue that stores and forwards a user request to a centralized database, which also contains the videos
2. A web server that serves videos using a cloud-based object store (e.g. Amazon S3) <—
3. A stream processor that registers the request and starts streaming video bytes to the user
4. An in-memory cache that holds all videos, served by a web server
30. Why do Spark RDDs contain lineage information?

1. To enable fault tolerance <—
2. To increase parallelism
3. To deal with issues in data partitioning
4. To enable distributed snapshots
31. Which one of the following statements is true, in the context of Unix?

1. awk enables reduce-like operations <—
2. sed enables reduce-like operations
3. xargs enables reduce-like operations
4. ls enables reduce-like operations
32. Which of the following RDD API calls is a performance killer in Spark?

1. reduceByKey
2. keyBy
3. groupByKey <—
4. aggregatebyKey