The term “Big Data” describes datasets that are either too big or change too fast or both to be processed on a single computer.
Big Data Processing provides an introduction to systems used to process Big Data. The main focus of the course is understanding the underpinnings of, programming and engineering big data systems; initially, the course explores general programming primitives that span across big data systems and touches upon distributed systems. Then, the course examines in detail the implementation of data analysis algorithms in Spark, in the context of batch processing applications, and Flink, in the context of streaming applications.
After the end of the course, all students should be able to:
5 ECTS: This means that you need to devote at least 140 hours of study for this course.
Online meetings: The course consists of 12 2-hour meetings. You are not required, but you are strongly encouraged, to attend.
Homework: In the homework assignments, you will have to write code or reply to open questions. You will always work in pairs.
Groups: The students are responsible to form pairs and communicate them to the course TAs, by registering them to CPM.
Labs: 4 hours per week, designed to help you work together and get support from teaching assistants.
Teaching Assistants: Teaching assistants will be available during lab hours to provide your with feedback on your assignments. Do be active in asking questions, but don’t expect them to provide you with solutions.
|1||2/9||Course introduction, Big and Fast data, Intro to course PLs||GG|
|1||4/9||The Unix programming environment, Diomidis’s slides||DS||Unix (16/09)|
|2||9/9||Programming for Big Data (1)||GG|
|2||11/9||Programming for Big Data (2)||GG||Functional programming (30/09)|
|5||30/9||Spark: RDDs and Pair RDDs||GG|
|6||7/10||Spark SQL, Spark use cases: Synonyms with Word2Vec, Recommending bands, Predicting pull request merges||GG||Spark (21/10)|
|7||16/10||Stream processing systems||GG||Flink (01/11)|
|8||21/10||Data engineering on the cloud||GG|
|9||28/10||Recap, Answers to recap questions (Quintin van Leersum and Mikhail Epifanov)|
Portions of this course have been converted to online educational material by other TU Delft teachers. Please take a look at the following EdX MOOCs / ProfEds:
Use them at your discretion to improve your skills.
(TU Delft only): You can find the Collegerama recordings from 2019 here. Please note that the course contents have sligthly changed this year, so do not base your exam studying on the old lectures.
You can find the course assignments on Brightspace and linked through this page. There will be 4 assignments instead of 5 due to circumstances; the assignment about distributed systems has been dropped.
All assignments are mandatory.
For submission, we will use CPM. The course name is CSE2520: Big Data Processing
The student groups must submit each assignment before 23:59 on the day of the deadline.
Late submission: All submissions must be handed in time, with no exceptions. Any late submission will be discarded and will be graded with 0. In case of provable sickness, please contact the course teacher to arrange a case-specific deadline.
Lab assignments (40%): Grade calculated as mean grade for all assignments. There is no minimum grade per individual assignment. If you don’t submit an assignment, or the submission is late, you will get a 0. Each assignment counts for
20% 25% of the lab part. The final lab grade has a minimum of 5.
Written Exam (60%): Closed-book exam, multiple choice. Minimum grade: 5
There will be an exam-only resit during Q2/3. You are allowed to transfer your assignment grade to the resit as a whole. This means that you will not be able to re-submit individual assignments. Effectively, you can only resit your written exam.
The course, by design, touches upon various current technologies; as such, there is no single source of truth. The following is an indicative list of resources where more information can be found. If you were to buy a single book about this course, I would recommend .
 M. Kleppmann, Designing data-intensive applications. O’Reilly Media, Inc., 2017.
 I. Robinson, J. Webber, and E. Eifrem, Graph databases: New opportunities for connected data. Springer, 2015.
 C. Martella, R. Shaposhnik, D. Logothetis, and S. Harenberg, Practical graph analytics with apache Giraph. Springer, 2015.
 S. Ryza, U. Laserson, S. Owen, and J. Wills, Advanced analytics with spark: Patterns for learning from data at scale. O’Reilly Media, Inc., 2015.
 H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning spark: Lightning-fast big data analysis. O’Reilly Media, Inc., 2015.
 B. Chambers and M. Zaharia, Spark: The definitive guide. O’Reilly Media, Inc., 2017.
 H. Karau and R. Warren, High performance spark. O’Reilly Media, Inc., 2017.
 J. Laskowski, “Mastering apache spark 2,” 2017. [Online]. Available: https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details.
 T. Akidau, S. Chernyak, and R. Lax, Streaming systems: The what, where, when, and how of large-scale data processing. O’Reilly, 2018.
This work is (c) 2017, 2018, 2019, 2020 - onwards by TU Delft and Georgios Gousios and licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.