Fill in your answers on a separate answer sheet that will be automatically processed. Before you start, write down your name and student number on that sheet. Failure to do any of the above will result in the grade 0!
Use the provided paper for notes etc. Don’t write notes on the exam paper or the answer sheet.
Given \(C\) correct answers and \(W\) wrong answers, the grade \(G\) will be determined by the follow formula: \[ G = \frac{C - \frac{W}{3}}{40} \times 10, G > 0 \]
Not answering a question (i.e. leaving it blank) is not considered either a correct on an incorrect answer.
For each question, there is only one answer that is correct.
You have 150 minutes (2.5 hours): Good Luck!
For some answers that could have been missinterpreted due to language ambiguities, 2 options are marked as accepted
Consensus algorithms (e.g., Paxos or Raft):
A GFS chunkserver/HDFS datanode is responsible to:
Which of the following is the computation order of applying
reduceL
to a list of 10 integers with the ‘+’ operator?
We need to calculate the average temperature of the last 10 minutes every 30 minutes from a stream of measurements taken every minute. What type of window do we need to use?
We have the following datasets:
A
=> All the cars currently traveling through an
intersection.
B
=> All the cars that have been parked in a
garage for the last 3 months.
A
is an unbounded data set and B
is a
bounded data set <—A
is a bounded data set and B
is an
unbounded data setWhat kind of stream will be produced by the following Flink code snippet:
.map(c => (c.id, 1)).keyBy(x => x._1).sum(1) dataStream
Which of the following best describes synchronous replication?
Which of the following is true about Lamport timestamps and vector clocks?
This question only had wrong answers, so it was removed
What is the correct function signature for
leftOuterJoin
on Spark RDDs?
RDD[(K,V)].leftOuterJoin(other: RDD[(K, W)]): RDD[(K, (Option[V], W))]
RDD[(K,V)].leftOuterJoin(other: RDD[(K, W)]): RDD[(K, (Option[V], Option[W]))]
RDD[(K,V)].leftOuterJoin(other: RDD[(K, W)]): RDD[(K, (V, W))]
RDD[(K,V)].leftOuterJoin(other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
<—Which of the following function(s) is/are higher order?
Which higher order function does the following signature correspond to? \[ (xs: [A], f: (A, B) \rightarrow B, acc: B): B \]
foldL
reduceByKey
foldR
<—zip
Which of the following statements about Watermarks in stream processing systems is not correct.
Given that t is the timestamp an event is processed by a stream processor, event time skew is:
Which of the following is not a replication architecture:
Which of the following statements is false? In the context of Big Data Processing, ETL pipelines:
Which of the following methods is part of the
Observer
interface, for dealing with push-based data
consumption?
def subscribe(obs: Observer[A]): Unit
def onNext(a: A): Unit
<—def map(f: (A) -> B): [B]
def onExit(): Unit
A transformation in Spark:
What is Byzantine fault tolerance?
Consider a cluster of 5 machines running HDFS (1 namenode, 4 datanodes). Each node in the cluster has a total of 1TB hard disk space and 128GB of main memory available. The cluster uses a block-size of 64 MB and a replication factor of 3. The master maintains 100 bytes of metadata for each 64MB block. Imagine that we upload a 128GB file. How much data does each datanode store?
Which of the following statements is not true? An operating system kernel:
When multiple senders/receivers are involved, we need external ordering scheme. Which type of order is dependent on “happens before” relationships?
Which of the following statements about microbatching (in streaming systems) is correct?
In the case of Spark, narrow dependencies:
What is the correct function signature for reduce on Spark RDDs?
RDD[A].reduce(f: (A,B) -> B)
RDD[A].reduce(f: (A,A) -> A)
<—RDD[A].reduce(init: B, seqOp: (A, B) -> A, combOp: (B, B) -> B)
RDD[A].reduce(init:A, f: (A,A) -> A)
Vector clocks
Choose the correct implementation of the Monad
interface in Scala:
trait Monad[M[_]] {
def unit[S](a: S) : M[S]
def flatMap[S, T] (m: M[S], f: S => M[T]) : Monad[T]
}
trait Monad[M[_]] {
def unit[S](a: S) : M[S]
def map[T] (m: M[S], f: S => M[T]) : Monad[T]
}
trait Monad[M[_]] {
def map[T] (m: M[S], f: S => M[T]) : Monad[T]
def reduce[T,B](init: B, f: (B,T) => M[B]): Monad[B]
}
trait Monad[M[_]] {
def map[T] (m: M[S], f: S => M[T]) : Monad[T]
def flatMap[S, T] (m: M[S], f: S => M[T]) : T
}
An (in-memory) immutable data structure:
What is eventual consistency?
Immutability enables us to:
Which higher order function does the following code snippet correspond to:
def f[A](xs: List[A], ys: List[B]) : List[(A, B)] = (xs, ys) match {
case (_, Nil) => Nil
case (Nil, _) => Nil
case (a :: xss, b :: yss) => (a, b) :: f(xss, yss)
}
foldL
reduceR
scanL
zip
<—Given the following value in Scala:
val statement = List(List("some", "questions"), List("are", "difficult"))
which of the following sequences of function calls would converted it to the string: “some questions are difficult”
statement.flatMap(x => x).flatten
statement.flatten.reduce((a, b) => a + " " + b)
<—statement.foldLeft(List[String]())((x : List[String], y : List[String]) => x ::: y)
statement.reduceByValue(_ + _)
What does Amdahl’s law prescribe?
Multi-leader systems have issues with write conflicts. Which of the following is the most plausible way of resolving them?
A Unix pipe (denoted by |
) enables us to:
Which of the following statements are true?
Given that file.txt
is a two column CSV file, what
does the following Unix command do?
$ sed -e 's/^\(.*\),\(.*\)$/\2 \1/' < file.txt | sort
file.txt
and then sorts
itFor our new budget-constrained startup, called DelayedGram, we are building an application that serves millions of cat videos to millions of users in parallel. Which architecture would be more suitable?
Why do Spark RDDs contain lineage information?
Which one of the following statements is true, in the context of Unix?
awk
enables reduce
-like operations
<—sed
enables reduce
-like operationsxargs
enables reduce
-like operationsls
enables reduce
-like operationsWhich of the following RDD API calls is a performance killer in Spark?
reduceByKey
keyBy
groupByKey
<—aggregatebyKey