Assignment: Unix (minors)

As we have seen during the lecture, the Unix command line is extremely versatile. This comes at a cost: you need to know which tool is suitable for the job or to build it. For this assignment, you will have to use a Unix command line (i.e. in the course's VM, a Mac or Ubuntu for Windows) to accomplish the indicated tasks. Not all tools you will need are readily available in the default installation of Xubuntu that we are using in the course. You need to search online (or implement!) for the appropriate tools and command line options.

Note: To develop and run those commands in Jupyter, you need to install the Bash kernel. Alternatively, you can create a file where you paste all programs/pipelines, print it and upload it to the assignment directory.

Shell programming

T (10 points) Write a pipeline that converts (recursively!) a directory structure full of .wav files to .mp3s.

In [ ]:
find . -type f -name *.wav | xargs -P 2 -I {} lame {}

T (10 points) Write a program that given a directory of text files, it prints the 10 most common words in those files (across all files).

In [ ]:
find . -type f -name "*.txt" | xargs cat | tr ' ' '\n' | sort | uniq -c | sort -n | tail -n 10

T (10 points) Write a program that will print all files (recursively!) that where not accessed the last 30 days.

Hint: Use date to format date strings

In [ ]:
find . -type f -atime -30d

T (10 points) Write a program that checks whether all the links in this web page work and reports the ones that do not. You can check whether a link works by inspecting the HTTP return header for 404 errors.

In [8]:
curl -s http://ghtorrent.org/downloads.html |
grep -o http://[^\"]* |
while read url; do
    if [ ! -z "`curl -s -I $url |head -n 1| grep 404`" ]; then
        echo "$url does not work"        
    fi
done
http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2015-08-07.tar.gz does not work
http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2015-06-18.tar.gz does not work
http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2015-04-01.tar.gz does not work
http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2015-01-04.tar.gz does not work
http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2014-11-10.tar.gz does not work
http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2014-08-18.tar.gz does not work
http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2014-04-02.tar.gz does not work
http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2014-01-02.tar.gz does not work
http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2013-10-12.tar.gz does not work

T (10 points) Write a program that will create a tar.gz archive out of a directory of source code, ommiting all files that are binary (i.e. non-text).

Hint: Use file --mime

In [ ]:
find . -type f |
xargs -I {} file --mime {}|
grep -v  binary|
cut -f1 -d':'|
tar zcvf ../non-binaries.tar.gz -T -

T (10 point) Implement a case-insensitive spell checker. Given an input file, it should report all words not in the dictionary.

Hint: You can use this dictionary file

In [ ]:
cat file | 
tr [:upper:] [:lower:] |
tr ' ' '\n' |
tr -d '[:punct:]' |
sort | 
uniq |
comm -13 /usr/share/dict/words -

Data processing

For the assignments in this section, we use the same pullreqs.csv and interesting.csv files that we used in the Spark assignment.

Write pipelines to calculate answers to the following questions (they may look familiar :-)):

T (10 points) Count the number of WARNing messages

In [1]:
grep ^WARN ../datasets/ghtorrent-logs.txt| wc -l
  132158

T (10 points) How many repositories where processed in total? Use the api_client lines only.

In [2]:
grep api_client.rb ../datasets/ghtorrent-logs.txt| 
egrep -v "DEBUG|WARN|ERROR"| 
grep -o "https://[^,?']*"| 
cut -f5,6 -d'/'|
grep '/' |
sort |
uniq |
wc
   68541   68541 1652441

T (10 points) Which client did most HTTP requests?

In [3]:
grep api_client.rb ../datasets/ghtorrent-logs.txt|
cut -f3 -d ',' |
cut -f2 -d ' '|
sort |uniq -c |sort -n |
tail -n 5
27631 ghtorrent-42
30774 ghtorrent-40
31401 ghtorrent-20
100906 ghtorrent-21
135978 ghtorrent-13

T (10 points) What is the most active repository?

In [4]:
grep api_client.rb ../datasets/ghtorrent-logs.txt| 
egrep -v "DEBUG|WARN|ERROR"| 
grep -o "https://[^,?']*"|
cut -f5,6 -d'/'|
grep '/' |
sort |
uniq -c |
sort -n |
tail -n 5
1059 ssbattousai/Cuda36
1107 kubernetes/kubernetes
2295 obophenotype/human-phenotype-ontology
2571 shuhongwu/hockeyapp
4084 mithro/chromium-infra

T (10 points): Which access keys are failing most often?

In [5]:
grep api_client.rb ../datasets/ghtorrent-logs.txt |
grep WARN|
sed -e 's/^.*Access: \([^ ,]*\).*$/\1/'|
grep -v "Unauthorised request"|
sort |uniq -c |sort -n|
tail -n 5
 368 2776f3ba0a5
 371 c1240f63b5b
1134 9115020fb01
1340 46f11b5791b
79623 ac6168f8776

T (10 points): Which of the interesting repositories has the most failed API calls?

In [6]:
grep api_client.rb ../datasets/ghtorrent-logs.txt|
grep ^WARN|
grep -o "https://[^,?']*"|
cut -f5,6 -d'/'|
sort |uniq -c  > failed-with-counts

cat ../datasets/important-repos.csv |cut -f2 -d','|cut -f5,6 -d'/'|sort > important

join -1 2 failed-with-counts important |sort -n -k 2 | tail -n 5
yangshuying/leetcode 2
zooppa/administrate-field-carrierwave 2
lorch1010/PanelProject 3
wireapp/wire-ios 3
asmagin/sitecore-foundation-codegeneration-composition 5