Big data processing
We have covered a lot of practical/engineering topics with this course.
Most of the work we have done was about programming big data systems, but we spent a lot of time to understand how those systems are engineered.
To succesfuly finish this course, you must be able to answer the questions in the following sections without thinking!
1. Big data
- Why is big data important?
- What are do the 3Vs of big data mean?
- What is the ETL cycle?
- What is the difference between stream and batch processing?
2. Functional programming
- What is the essense of FP?
- What does \(f(x: A, y: [B]) \rightarrow C\) mean?
- Why is lazyness a virtue in BDP?
- What is a Monad and what is it used for?
- How can we exploit immutability?
3. Data processing with FP
- What is the difference between
reduceL
and reduceR
?
- How can we implement
map
, filter
, zip
etc with reduce
?
- How can we implement a
join
between KV pairs?
- (How) Can we re-write an SQL query with FP primitives?
4. Distributed systems
- What is the key difference between distributed and parallel systems?
- What does Amdhal’s law tell us?
- What are the key problems with distributed systems?
- How do we deal with time being unreliable?
- How do we make decisions in distributed settings?
- How many nodes do we need?
- What is the CAP theorem?
- What types of guarantees does a linearisable system offer?
5. Distributed databases
- Why do we need to replicate data?
- What are the most common replication architectures?
- Why do we need to partition datasets?
- What are the most common transaciton isolation levels?
- What does ACID mean?
- What is the difference between linearisability and transaction isolation?
6. Spark
- What are Spark RDDs? Why was Spark so revolutionary?
- What is the difference between RDDs and Pair RDDs? Why do we need both?
- What are the key Spark API calls?
- What are wide and narrow dependencies?
- How does Spark deal with faults?
- What types of partitioning can we employ for dist systems like Spark?
- How does Catalyst optimize queries?
7. Stream processing
- When is a problem a data streaming problem?
- Why do we need streaming windows?
- What types of windows do we get with stream processing?
- What is the difference between event, processing and ingestion time?
- What is the difference between microbatching and stream processing?
- What is the problem with state in streaming systems?
- How can we disseminate events from producers to consumers?