General description

The term “Big Data” describes datasets that are either too big or change too fast or both to be processed on a single computer.

Big Data Processing provides an introduction to systems and algorithms used to process Big Data. The main focus of the course is programming and engineering big data systems; initially, the course explores general programming primitives that span across big data systems and touches upon distributed data storage systems. Then, the course examines in detail the implementation of data analysis algorithms in Hadoop (Map/Reduce) and Spark, in the context of batch, streaming, and graph processing applications.

Every week, students will have to do an assignment, consisting mostly of coding exercises. To stir things up, the last assignment will include a (optional) programming/performance competition, similar in style to the popular Terasort benchmark.

The course is also optional for the Minor “Software Design and Application”. Part of the course is thus dedicated to basic data processing.

Learning objectives

[all students] After the end of the course, all students should be able to:

[BSc students] - Describe in which scenaria streaming algorithms are most applicable - Apply basic streaming algorithms in practical problems

[minor version] - Design and apply basic data processing pipelines - Understand basic data analysis concepts (such as aggregation, correlation and linear modelling)

Course Organization

Assignments

You can find assignments linked through this page. All assignments (except one) are mandatory.

Your submission material is a Jupyter notebook including the full assignment text, your solutions and the results of running your solutions on the provided datasets.

You submit your assignments during the Thursday morning lab sessions. You are expected to be at the lab at the designated timeslot assigned to your group. Timeslots will be announced well in advance.

At submission, you must be able to demonstrate a notebook with your solution running live. The TAs will compare your results with the golden standard and grade your solution in place.

Late submission: All submissions must be handed in time, with no exceptions. In case of provable sickness, please contact the course teacher to arrange a case-specific deadline.

Contents

Week Lecture Who? Topic Teacher Lecture Notes Assignment Deadline
13/11 1 All Course introduction, Big and Fast data GG Intro, Big and Fast Data, Intro to course PLs
13/11 2 All Programming for Big Data (1) GG Programming Techniques for Big Data Big Data Processing 29/11/2017
20/11 1 All Programming for Big Data (2) GG Programming Techniques for Big Data
20/11 2 All Distributed systems basics GG Distributed Systems
27/11 1 All Distributed databases GG Distributed Databases Distributed Databases 5/12/2017
27/11 2 All Map/Reduce and Hadoop GG
4/12 1 All Spark: RDDs and Pair RDDs GG Spark introduction
4/12 2 All Spark Internals JR Spark 19/12/2017
11/12 1 All Spark SQL / Data processing GG Spark SQL, Synonyms with Word2Vec
11/12 2 All Data processing with Spark GG Recommending bands, Predicting pull request merges A stats library for Spark Optional Exam day
18/12 1 BSc Stream processing AK
18/12 2 BSc Stream processing systems AK 9/1/2018
18/12 1 Minors Introduction to Data Science (1) GG
18/12 2 Minors Introduction to Data science (2) GG 9/1/2018
8/1 1 All Big Graphs GG
8/1 2 All Graph processing systems GG 16/1/2018

Teachers

  • GG: Georgios Gousios
  • AK: Asterios Katsifodimos
  • JR: Jan Rellermeyer

Assessment

Resit policy

The will be a resit during Q3/4. You are allowed to transfer your assignment grade as a whole. This means that you will not be able to re-submit individual assignments. Effectively, you can only resit your written exam.

Bibliography

The course, by design, touches upon various current technologies; as such, there is no single source of truth. The following is an indicative list of resources where more information can be found.

[1] S. Ryza, U. Laserson, S. Owen, and J. Wills, Advanced analytics with spark: Patterns for learning from data at scale. O’Reilly Media, Inc., 2015.

[2] H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning spark: Lightning-fast big data analysis. O’Reilly Media, Inc., 2015.

[3] B. Chambers and M. Zaharia, Spark: The definitive guide. O’Reilly Media, Inc., 2017.

[4] M. Kleppmann, Designing data-intensive applications. O’Reilly Media, Inc., 2017.

[5] H. Karau and R. Warren, High performance spark. O’Reilly Media, Inc., 2017.

[6] J. Laskowski, “Mastering apache spark 2,” 2017. [Online]. Available: https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details.