General description

The 5EC minor project enables minor students obtain the experience of working in teams while developing software in the context of the Big Data domain. The project consists of a data analysis or data pipeline task and is carried out in groups (4 people groups).

Learning Objectives

Course Organization

The course is self-study based. You will need to carry out a project with an open-ended goal, but you will need to go through the process of learning the required libraries, technologies and data sources yourself along with your peers.

In general, to carry out the project, you are advised to read ahead the contents of the Big Data Processing course, especially the Spark and stream processing ones.

We will meet for two hours every Tuesday morning, 08:45 – 10:45. You are welcome to ask questions, technical or othewise or just to hang out to check what others are doing.

The course is worth 5EC x 4 people = 560 hours. This is a lot of work! Please exploit this work wisely to learn tools and libraries that will make your subsequent work (during the remainder of your studies) more pleasant!

Project topics

You will need to come up with a (big) data analysis task that bridges your field with the topics you are studying in the minor. Please write a short proposal along the lines of the ones following and send it to the course instructor.

The following topics are indicative and are based on the course instructor’s research interests (software analytics). Only use them if you ’re out of ideas of what to do!

Software Aging

Software Applications are rapidly being developed to meet the demands of its end-users with new and performant features. This trend is particularly apparent in app stores for iOS or Android devices where apps are regularly being updated with improvements and performance fixes.

The task of this project is to investigate to what degree we can run outdated versions of a Software Project today. You will be investigating software projects on Github where projects are versioned and old software releases are available. You will also develop a set of metrics and use sources such as TravisCI to find out indicators whether an old version of a project will be able to run. To this end, you will be able to answer something in these lines: “For a Github project A, there are 16 versions, 6 versions are likely to run today while the 10 other versions are not”

Reliability analysis of APIs

More and more developers use APIs to build client applications. Key role for the robustness of these applications play the reliability of their API-builders. However, particular APIs that can receive invalid external inputs (from users, databases, etc.) can lead applications to execution failures. The proposed project has as a goal the processing of both the commits and issues of software projects (e.g. hosted on GitHub) for the identification of used risky APIs that can cause specific types of errors (e.g. MalformedURLException, PatternSyntaxException, NumberFormatException) at runtime. A data set of projects that use these APIs can help in the study and improvement of the reliability of the blamed APIs.

Streaming Repository Security

The aggregation of both projects and deployment configurations on GitHub has made those projects particularly vulnerable to sensitive data leaks. For reasons that have to do with ease of use or just pure negligence and mistakes, it is quite common for GitHub users to push passwords, database connection strings, cloud provider one-time passwords and environment variables and private SSH keys to public repositories. Once this information is made public, it is impossible to retract it as projects such as GHTorrent and GitHub Archive archive this information, while GitHub’s real-time event stream makes it easy for adversaries to attack the exposed systems almost immediately. The aim of the proposed project is to explore this phenomenon.

Extracting facts from large amounts of unstructured texts

Travis CI is the most popular Continuous Integration (CI) service on GitHub. CI inspires confidence to developers because it automatically builds and tests their entire project for every single change they commit. The results of a CI build, however, are hard to interpret, as the output from the build goes into one unstructured, large text file. This means that developers sometimes have to sift through thousands of lines of text just to find out which of their tests failed. Our service TravisTorrent, a “Travis CI build log treasure trove” (CEO of Travis CI Gmbh), makes this information readily available in database form and thus easy to query. Currently, however, we can only analyze build logs from Ruby, Java, Python, and Go. The aim of this project is to extend our build log capabilities to other programming languages like C#, PHP, or Swift, using the existing parsers as a template and point of reference.


Each group (max 4 people) must submit the following:

The final grade will be based on an intermediate (20%) and final presentation (30%) of the results and an evaluation of the report and source code you submit (50%).

Important dates

The presentations will take place during the Tuesday morning sessions.

Year of 2017 projects