Things to remember about BDP

Big data processing

We have covered a lot of practical/engineering topics with this course.

Most of the work we have done was about programming big data systems, but we spent a lot of time to understand how those systems are engineered.

To succesfuly finish this course, you must be able to answer the questions in the following sections without thinking!

1. Big data

Why is big data important?
What do the 3Vs of big data mean?
What is the ETL cycle?
What is the difference between stream and batch processing?

2. Functional programming

What is the essense of FP?
What does \(f(x: A, y: [B]) \rightarrow C\) mean?
Why is lazyness a virtue in BDP?
What is a monad and what is it used for?
How can we exploit immutability?

3. Data processing with FP

What is the difference between element-wise and aggregation operations?
What is the function signature for foldL?
What is the difference between reduceL and reduceR?
How can we implement map, filter, zip etc with reduce?
How can we implement a join between KV pairs?
(How) Can we re-write an SQL query with FP primitives?

4. Unix

What is a pipe(-line)?
Which map-like operations does Unix support?
Which reduce-like operations does Unix support?
How can we:
- Find all files that contain a pattern?
- Process data as they come?
- Compare file contents?
- Run commands in parallel?

5. Distributed systems

What is the key difference between distributed and parallel systems?
What does Amdhal’s law tell us?
What are the key problems with distributed systems?
How do we deal with time being unreliable?
How do we make decisions in distributed settings?
- How many nodes do we need?
What is the CAP theorem?
What types of guarantees does a linearisable system offer?

6. Distributed databases and filesystems

Why do we need to replicate data?
What are the most common replication architectures?
Why do we need to partition datasets?
What are the most common transaction isolation levels?
What does ACID mean?
How does HDFS store a file?

7. Spark

What are Spark RDDs? Why was Spark so revolutionary?
What is the difference between RDDs and Pair RDDs? Why do we need both?
What are the key Spark API calls?
What are wide and narrow dependencies?
How does Spark deal with faults?
What types of partitioning can we employ for dist systems like Spark?
How does Catalyst optimize queries?

8. Stream processing

When is a problem a data streaming problem?
Why do we need streaming windows?
What types of windows do we get with stream processing?
What is the difference between event, processing and ingestion time?
What is the difference between microbatching and stream processing?
What is the problem with state in streaming systems?
How can we disseminate events from producers to consumers?
How do we take consistent snapshots?

Bibliography

Copyright

This work is (c) 2017, 2018, 2019 - onwards by TU Delft and Georgios Gousios and licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.