Big data processing
We have covered a lot of practical/engineering topics with this
course.
Most of the work we have done was about programming big data
systems, but we spent a lot of time to understand how those systems are
engineered.
To succesfuly finish this course, you must be able to answer the
questions in the following sections without
thinking!
1. Big data
- Why is big data important?
- What do the 3Vs of big data mean?
- What is the ETL cycle?
- What is the difference between stream and batch processing?
2. Functional programming
- What is the essense of FP?
- What does \(f(x: A, y: [B]) \rightarrow
C\) mean?
- Why is lazyness a virtue in BDP?
- What is a monad and what is it used for?
- How can we exploit immutability?
3. Data processing with FP
- What is the difference between element-wise and aggregation
operations?
- What is the function signature for
foldL
?
- What is the difference between
reduceL
and
reduceR
?
- How can we implement
map
, filter
,
zip
etc with reduce
?
- How can we implement a
join
between KV pairs?
- (How) Can we re-write an SQL query with FP primitives?
4. Unix
- What is a pipe(-line)?
- Which
map
-like operations does Unix support?
- Which
reduce
-like operations does Unix support?
- How can we:
- Find all files that contain a pattern?
- Process data as they come?
- Compare file contents?
- Run commands in parallel?
5. Distributed systems
- What is the key difference between distributed and parallel
systems?
- What does Amdhal’s law tell us?
- What are the key problems with distributed systems?
- How do we deal with time being unreliable?
- How do we make decisions in distributed settings?
- How many nodes do we need?
- What is the CAP theorem?
- What are the different consistency models?
- What is causal consistency?
- What is sequential consistency and what is linearisability?
6. Distributed databases and filesystems
- Why do we need to replicate data?
- What are the most common replication architectures?
- Why do we need to partition datasets?
- What are the most common transaction isolation levels?
- How does HDFS store a file?
7. Spark
- What are Spark RDDs? Why was Spark so revolutionary?
- What is the difference between RDDs and Pair RDDs? Why do we need
both?
- What are the key Spark API calls?
- What are wide and narrow dependencies?
- How does Spark deal with faults?
- What types of partitioning can we employ for dist systems like
Spark?
- How does Catalyst optimize queries?
8. Stream processing
- When is a problem a data streaming problem?
- Why do we need streaming windows?
- What types of windows do we get with stream processing?
- What is the difference between event, processing and ingestion
time?
- What is the difference between microbatching and stream
processing?
- What is the problem with state in streaming systems?
- How can we disseminate events from producers to consumers?
- How do we take consistent snapshots?
9. Graph processing
- What is the best way of reprensenting graphs in memory and in a
distributed system?
- How can we traverse a graph stored in an SQL database?
- What is the bulk synchronous parallel model?
- How does Pregel implement the BSP model?