General description

The term “Big Data” describes datasets that are either too big or change too fast or both to be processed on a single computer.

Big Data Processing provides an introduction to systems used to process Big Data. The main focus of the course is programming and engineering big data systems; initially, the course explores general programming primitives that span across big data systems and touches upon distributed systems. Then, the course examines in detail the implementation of data analysis algorithms in Spark, in the context of batch and graph processing applications, and Flink, in the context of streaming applications.

The course is also optional for the Minor “Software Design and Application”.

Learning objectives

After the end of the course, all students should be able to:

Course Organization

Assignments

You can find the course assignments linked through this page. All assignments are mandatory.

The assignments are submitted through CPM

The student groups must submit each assignment before 23:59 on the day of the deadline.

The assignments are automatically graded.

Late submission: All submissions must be handed in time, with no exceptions. Any late submission will be discarded and will be graded with 0. In case of provable sickness, please contact the course teacher to arrange a case-specific deadline.

Contents

Week Date Topic Teacher Assignment (Deadline)
1 14/11 Course introduction, Big and Fast data, Intro to course PLs GG
1 15/11 The Unix programming environment GG Unix (28/11)
2 21/11 Programming for Big Data (1) GG Functional programming (4/12)
2 22/11 Programming for Big Data (2) GG
3 28/11 Distributed Systems GG
3 29/11 Distributed Databases Distributed filesystems GG
4 5/12 Spark: RDDs and Pair RDDs GG Spark (18/12)
4 6/12 Spark Internals JR
5 12/12 Spark SQL, Spark use cases: Synonyms with Word2Vec, Recommending bands, Predicting pull request merges GG
5 13/12 Graphs GG
6 19/12 Stream processing GG Streaming (14/1)
6 20/12 Stream processing systems GG
7 8/1 Recap GG
7 9/1 No lecture GG

Teachers

  • GG: Georgios Gousios
  • JR: Jan Rellermeyer

Assessment

Example exam material

Resit policy

There will be an exam-only resit during Q3/4. You are allowed to transfer your assignment grade as a whole. This means that you will not be able to re-submit individual assignments. Effectively, you can only resit your written exam.


The course, by design, touches upon various current technologies; as such, there is no single source of truth. The following is an indicative list of resources where more information can be found.

Bibliography

[1] I. Robinson, J. Webber, and E. Eifrem, Graph databases: New opportunities for connected data. Springer, 2015.

[2] C. Martella, R. Shaposhnik, D. Logothetis, and S. Harenberg, Practical graph analytics with apache giraph. Springer, 2015.

[3] S. Ryza, U. Laserson, S. Owen, and J. Wills, Advanced analytics with spark: Patterns for learning from data at scale. O’Reilly Media, Inc., 2015.

[4] H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning spark: Lightning-fast big data analysis. O’Reilly Media, Inc., 2015.

[5] B. Chambers and M. Zaharia, Spark: The definitive guide. O’Reilly Media, Inc., 2017.

[6] M. Kleppmann, Designing data-intensive applications. O’Reilly Media, Inc., 2017.

[7] H. Karau and R. Warren, High performance spark. O’Reilly Media, Inc., 2017.

[8] J. Laskowski, “Mastering apache spark 2,” 2017. [Online]. Available: https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details.

[9] T. Akidau, S. Chernyak, and R. Lax, Streaming systems: The what, where, when, and how of large-scale data processing. O’Reilly, 2018.