10 Nov 2018
Teaching Software Analytics

At the TU Delft CS master's program, students have to take at least one seminar course. As opposed to normal courses, where a more traditional teaching method is the norm, in seminar courses the students have to read papers – lots of them. This makes the format ideal for courses that teach student topics at a field's cutting edge. In the previous quarter (Sep - Oct), I taught such a course: Software Analytics. It was the first time I taught both a seminar course and Software Analytics, which also happens to be my field of expertise. I was therefore relatively... Read more

03 Oct 2018
Introducing the FASTEN project

A popular form of software reuse is the use of Open-Source Software (OSS) libraries, hosted on centralized code repositories, such as Maven or NPM. Developers only need to declare dependencies to external libraries, and automated tools make them available to the workspace of the project. In recent years, we have seen package management fail in spectacular ways: In the lefpad incident, a developer broke a significant part of the Internet by just removing a package from NPM. Equifax lost $4 billion because they deemed a security update unnecessary. A Linux kernel developer engaged in a series of litigation actions against... Read more

27 May 2017
Report from ICSE 2017

On the week of May 18-27, I travelled to Argentina to attend MSR and ICSE. For people not familiar with software engineering research, ICSE is the flagship conference of the field and one of the few that receive an A* rating in the Core conference rankings. All aspiring software engineering researchers aim to publish there, which makes it very competive (~16% acceptance rate). This year, ICSE took place in the beautiful city of Buenos Aires, a place that reminded me of home (well, Athens) more than any other city I have been to. I really enjoyed ICSE this year. The... Read more

08 Mar 2016
The Issue 32 incident – An update

Many of you are aware of the GHTorrent issue 32. To sum up the discussion in a couple of lines, various developers included in GHTorrent wanted their email removed from it (which I did) and then wanted all emails to be excluded from the dataset (which I refused to do). The reasons behind the requests where privacy and the right to do what ever one wants with their personal data (email in many jurisdictions is considered personal data). What caused the whole thread was that researchers used GHTorrent as a source of emails for research surveys which were sent to... Read more

26 Jun 2015
How do project contributors use pull requests on Github?

with Alberto Bacchelli Distributed software development projects employ collaboration models and patterns to streamline the process of integrating incoming contributions. Classic forms of code contributions to collaborative projects include change sets sent to development mailing lists or issue tracking systems and direct access to the version control system. More recently however, a big portion of open source development happens on GitHub. One of the main reasons for this is the fact that contributing to a GitHub project is a relatively pain-free experience. Or is it? In Apr 2014, we run a survey among contibutors (also: integrators) to Github projects trying... Read more

02 Apr 2015
How to run a large scale survey

If you know me well, this blog post might seem strange. I have always been a proponent of quantitative methods and big data. Despite this, in April 2014, I run a survey that got filled in by 1,500 people. One part of the survey analysis will be presented at ICSE 2015 this year, while we submitted the second part to FSE 2015 (still twiddling our thumbs about the results). In wake of the ICSE 2015 publication, many colleagues asked me how I managed to get so many responses. Here is how I did it. Target an audience: The broader the... Read more

03 Oct 2014
How do project owners use pull requests on Github?

Pull-based development as a distributed development model is a distinct way of collaborating in software development. In this model, the project’s main repository is not shared among potential contributors; instead, contributors fork (clone) the repository and make their changes independent of each other. In the pull-based model, the role of the integrator is crucial. The integrator must act as a guardian for the project’s quality while at the same time keeping several (often, more than ten) contributions "in-flight" by communicating modification requirements to the original contributors. Being a part of a development team, the integrator must facilitate consensus-reaching discussions and... Read more

07 Jul 2014
The computer scientist's guide to speech development

During the last 20 months, I 've been having fun with my daughter's (from now on: little λ) efforts to learn to speak. Up to now, the whole process can be split in 4 phases. The random noise phase This starts at around 4 months. The baby mumbles random noises initially (aaa, usually) and, as the brain develops, more focused 2 letter syllables (ma-ma, pa-pa etc). Nothing interesting here, apart from the fact the baby can combine various stimuli (noise, vision etc) with oral expressions (say ma-ma when she listens mummy whispering at night), which computers are not very capable... Read more

29 May 2014
What's new in GHTorrent land?

A lot of people (around 30 on last count) have been using GHTorrent lately as an easy to use source for accessing the wealth of data Github has. Portions of the dataset appear in the MSR14 and VISSOFT14 data challenges, while at least 15 papers at this year's MSR and ICSE conferences are based on it. In this blog post, I summarize the long list of changes that happened in the GHTorrent land since Sep 2013. Introducing Lean GHTorrent Obtaining and restoring the full GHTorrent dataset is serious business: one has to download and restore more than 3TB of MongoDB... Read more

27 Mar 2014
The triumph of online collaboration

For a research paper I am working on, we wanted to analyze the top 30 "most collaborative" projects on Github. Defining a quantitative metric of collaboration and sorting projects according to it is not an easy task, as collaboration is in many cases implicit and not recorded, while not all actions of collaboration are equal. As a proxy, we chose to measure the number of people that perform changes that mutate the state of a repository. On Github, we could identify the following: A: Create a commit to a repository B: Perform a code review on an individual commit C:... Read more

Older posts
Performance x 1
Java x 2
JVM x 1
C++ x 1
Research x 2
MSR x 6
Tools x 1
spam x 1
security x 1
greek x 2
bureaucracy x 1
passport x 1
MachineLearning x 1
R x 2
Graphs x 1
Github x 6
GHTorrent x 6
PullRequest x 1
Rx x 1
Hacking x 1
Scala x 1
politics x 1
crisis x 1
teapot x 1
php x 1
hhvm x 1
fp x 1
debug x 1
unix x 1
report x 1
pull-request x 1
collaboration x 2
speech x 1
pullrequest x 2
integrator x 2
survey x 1
qualitative x 1
ghtorrent x 1
legal x 1
openaccess x 1
research x 3
ICSE x 1