TI2736-B: Big Data Processing

General description

The term “Big Data” describes datasets that are either too big or change too fast or both to be processed on a single computer.

Big Data Processing provides an introduction to systems used to process Big Data. The main focus of the course is programming and engineering big data systems; initially, the course explores general programming primitives that span across big data systems and touches upon distributed systems. Then, the course examines in detail the implementation of data analysis algorithms in Spark, in the context of batch processing applications, and Flink, in the context of streaming applications.

The course is also optional for the Minor “Software Design and Application.”

Learning objectives

After the end of the course, all students should be able to:

Explain the different dimensions of big data problems
Understand why classical algorithms fail on many big data problems
Understand, explain and apply basic data processing operations (filtering, folding, projecting etc)
Understand and explain basic techniques (vector clocks, consensus) in distributed systems
Understand and explain basic data management techniques in distributed databases
Understand and explain the major components of the Spark framework
Create Spark-based algorithms for novel (unseen) practical problems
Explain the difference between iterative/non-iterative algorithms
Design iterative algorithms for simple practical problems.
Describe in which scenaria streaming algorithms are most applicable
Apply basic streaming algorithms in practical problems, using Flink

Course Organization

5 ECTS: This means that you need to devote at least 140 hours of study for this course.
Lectures: The course consists of 12 2-hour lectures. You are not required, but you are strongly encouraged, to attend.
Homework: In the homework assignments, you will have to write code or reply to open questions. You will always work in pairs.
Groups: The students are responsible to form pairs and communicate them to the course TAs, by registering them to CPM.
Labs: 4 hours per week, designed to help you work together and get support from teaching assistants.
Teaching Assistants: Teaching assistants will be available during lab hours to provide your with feedback on your assignemnts. Do be active in asking questions, but don’t expect them to provide you with solutions.

Week	Date	Topic	Teacher	Assignment (Deadline)
1	14/11	Course introduction, Big and Fast data, Intro to course PLs	GG
1	15/11	The Unix programming environment	GG	Unix (jupyter, solutions)
2	21/11	Programming for Big Data (1)	GG	Functional programming: Scala (jupyter, solutions), Python (jupyter, solutions)
2	22/11	Programming for Big Data (2)	GG
3	28/11	Distributed Systems	JR	More reading: Distributed Systems
3	29/11	Distributed Databases and Fileystems	JR	More reading Distributed Databases, Distributed filesystems
4	5/12	Spark: RDDs and Pair RDDs	GG	Spark: Scala (jupyter,solutions), Python (jupyter, solutions)
4	6/12	Spark Internals	JR
5	12/12	Spark SQL, Spark use cases: Synonyms with Word2Vec, Recommending bands, Predicting pull request merges	GG
5	13/12	Live Data Processing	GG
6	19/12	Stream processing	GG	Streaming, solutions (14/1) (Note: Optional for minor students)
6	20/12	Stream processing systems	GG
7	8/1	Recap, Answers to recap questions (Quintin van Leersum and Mikhail Epifanov)	GG
7	9/1	No lecture

Teachers

GG: Georgios Gousios
JR: Jan Rellermeyer

TAs

The head TA is Yoshi van den Akker. The TA team is managed by Goshia Migut.

Auke Schaap
Kanav Anand
Chia-Lun Yeh
Caspar Krijgsman
Danny Plenge
Jordi Smit
Lisette Veldkamp
Yoshi van den Akker

Assignments

You can find the course assignments linked through this page.

For BSc students, all assignments are mandatory.
For Minor students, the first 3 assignments are mandatory. You can also try the fourth; your result will count as extra points to your final grade.

For submission, we will use CPM. The course name is TI2736-B: Big Data Processing

You need to signup to enroll and also declare your pairs
To submit, hit the overview button and select the appropriate assignment
All the assignments have deadlines
Feedback and grading is automatic: the results are available on CPM.
Technical support: ask the Mattermost channel
- If no feedback after 1 hour: DO ASK THE TAs.

The student groups must submit each assignment before 23:59 on the day of the deadline.

Late submission: All submissions must be handed in time, with no exceptions. Any late submission will be discarded and will be graded with 0. In case of provable sickness, please contact the course teacher to arrange a case-specific deadline.

Assessment

Assignments (40%): Grade calculated as mean grade for all assignments. No minimum grade. If you don’t submit an assignment, or the submission is late, you will get a 0.
- For BSc students: each assignment counts for 10% of the total grade
- For minor students: each assignment counts for 13% of the total grade. Any points you get if you do the optional last assignment is counted as extra up to a total 1 grade.
Written Exam (60%): Closed-book exam. Minimum grade: 5

Example exam material

Model exam, solutions

Resit policy

There will be an exam-only resit during Q3/4. You are allowed to transfer your assignment grade to the resit as a whole. This means that you will not be able to re-submit individual assignments. Effectively, you can only resit your written exam.

Course resources

Lab sessions every Monday morning
- All TAs will be around
- Work together to solve assignments
- Ask questions about the next assignment
- Ask questions about your grades
You are welcome to join the BDP 2018-2019 Mattermost channel
The course VM: Contains all software (Spark, HDFS, Flink) you may need pre-installed
- Find it here
- Provided only for convenience: no technical support
- Default account username/password: bigdata/bigdata

The course, by design, touches upon various current technologies; as such, there is no single source of truth. The following is an indicative list of resources where more information can be found.

Bibliography

[1]

J. Laskowski, “Mastering apache spark 2,” 2017. [Online]. Available: https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details.

[2]

M. Kleppmann, Designing data-intensive applications. O’Reilly Media, Inc., 2017.

[3]

S. Ryza, U. Laserson, S. Owen, and J. Wills, Advanced analytics with spark: Patterns for learning from data at scale. O’Reilly Media, Inc., 2015.

[4]

H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning spark: Lightning-fast big data analysis. O’Reilly Media, Inc., 2015.

[5]

H. Karau and R. Warren, High performance spark. O’Reilly Media, Inc., 2017.

[6]

B. Chambers and M. Zaharia, Spark: The definitive guide. O’Reilly Media, Inc., 2017.

[7]

T. Akidau, S. Chernyak, and R. Lax, Streaming systems: The what, where, when, and how of large-scale data processing. O’Reilly, 2018.

[8]

C. Martella, R. Shaposhnik, D. Logothetis, and S. Harenberg, Practical graph analytics with apache Giraph. Springer, 2015.

[9]

I. Robinson, J. Webber, and E. Eifrem, Graph databases: New opportunities for connected data. Springer, 2015.

TI2736-B: Big Data Processing

Course information

Georgios Gousios

09 September 2021