Assignment: Unix

Wouter Zorgdrager and Georgios Gousios

As we have seen during the lecture, the Unix command line is extremely versatile. This comes at a cost: you need to know which tool is suitable for the job or to build it. For this assignment, you will have to use a Unix command line (i.e. in the course's VM, a Mac or Ubuntu for Windows) to accomplish the indicated tasks. Not all tools you will need are readily available in the default installation of Xubuntu that we are using in the course. You need to search online (or implement!) for the appropriate tools and command line options.

Note: To develop and run those commands in Jupyter, you need to install the Bash kernel. Alternatively, you can create a file where you paste all programs/pipelines, print it and upload it to the assignment directory.

There 120 points to collect. To get a 10, you need 110.

Shell programming

Q1 (10 points) Write a pipeline that converts (recursively!) a directory structure full of .wav files to .mp3s.

In [ ]:
find . -type f -name '*.wav' | xargs -P 2 -I {} lame {}

Q2 (10 points) Write a program that given a directory structure of text files, it prints the 10 most common words in those files (across all files).

The output should look like this (example is for 3 most common items):

5 most_common_word 2 less_common_word 1 least_common_word

In [ ]:
find . -type f -name "*.txt" | xargs cat | tr [:upper:] [:lower:] | tr -d [:punct:] | tr ' ' '\n' | sort | uniq -c | sort -n | tail -n 10

Q3 (10 points) Given this dataset (which you will also be using later on), write a program that obtains the star count, watcher count and number of forks for first 10 repositories. The output must look like the following:

url,stars,watchers,forks
https://api.github.com/repos/8d8d/Think.Admin,0,0,0
https://api.github.com/repos/971638267/RetrofitAndRxjavaforRecyclerview,12,12,3
https://api.github.com/repos/9alsacelost/greenDAO,0,0,0
https://api.github.com/repos/a1265137718/ZoomHeader,0,0,0
https://api.github.com/repos/12345678/NonExistingRepo,,,

Note: Use "repositories.csv" as filename, otherwise automatic grading will fail.

Warning The GitHub API requires authentication, otherwise it is limited to 60 requests per hour. Make sure you setup an OAuth key and use this key for doing the requests. After the assignment deadline, you can remove it. If you fail to submit a solution including a key, the grading will fail

In [ ]:
echo url,stars,watchers,forks
head -n 11 repositories.csv |
tail -n 10 |
while read repoline; do
    url=$(echo $repoline| cut -f2 -d',')
    auth="?access_token=ADD_YOUR_OATH_TOKEN"
    counts=$(curl -s $url$auth | jq '[.stargazers_count, .watchers_count, .forks_count] | @csv' | tr -d '"' )
    echo $url,$counts
done

Q4 (10 points) Write a program that downloads all JPEG pictures in a web page. JPEG pictures are identified by the extensions *.jpg and *.jpeg. Your program must accept the URL to process as an argument. For example:

$ jpegdl https://nu.nl
[...]
$ ls
zi7xo0pa6h6x_std320.jpg
hsnxkcyat176_std320.jpg
[...]
In [ ]:
#!/usr/bin/env bash

curl -s $1|
grep img| 
cut -f2 -d'"'|
egrep "jpg|jpeg" | 
xargs -P 8 wget

Q5 (10 points) All Unix systems have a dictionary file) residing under /usr/share/dict/words or /usr/dict/words. Use it to implement a (rudimentary) spellchecker. Your spellchecker should read a file named foo.txt and print a list all the words in the document to be checked that are not in the dictionary. An example usage session can be seen below.

$ cat foo.txt
I am a nicelly formated sentence,
but I contain errors.

$ cat foo.txt | spellchecker
nicelly
formated
In [ ]:
cat foo.txt |
tr [:upper:] [:lower:] |
tr ' ' '\n'|
tr -d ',.' |
sort |
uniq |
comm -13 /usr/share/dict/words -

Q6 (10 points) Given this repository at commit 05681455d905586f940e0e00e find the sizes for all versions of all test files (assume that all test files are under src/test/java). The output must look like the following:

blob_id blob_path size_in_bytes

for example:

da14c3975e src/test/java/nl/tudelft/jpacman/npc/ghost/NavigationTest.java 165
1fbe0d836d src/test/java/nl/tudelft/jpacman/sprite/SpriteTest.java 108
3bde982975 src/test/java/nl/tudelft/jpacman/sprite/SpriteTest.java 108
6286792e43 src/test/java/nl/tudelft/jpacman/sprite/SpriteTest.java 110

A blob in Git represents a file version.

In [ ]:
# Clone the repo
git clone https://github.com/SERG-Delft/jpacman-framework
cd jpacman-framework
git checkout -b tests 05681455d905586f940e

# Get a conveniently formatted git log
git log --pretty="%h,%t,%ae" |
while read logline; do
  # Get a Git tree object
  tree=$(echo $logline | cut -f2 -d ',')
  # Get a recursive listing of all files and filter test ones
  git ls-tree --abbrev=10 -r $tree | grep "src/test/java" | tr '\t' ' ' |  cut -f3,4 -d ' '
done |
# Remove duplicate entries
sort |
uniq |
while read testfile; do
  # Get the contents of each file path
  blob=$(echo $testfile | cut -f1 -d' ')
  # Caclulate its size
  size=$(git --no-pager show $blob | wc -l | tr -d ' ')
  echo $testfile $size
done |
sort -k 2

Data processing

For the assignments in this section, we use the following dataset containing repositories.

Write pipelines to calculate answers to the following questions:

Note: Use "repositories.csv" as filename, otherwise automatic grading will fail.

Q7 (10 points) Count the number of repositories written in Java.

In [ ]:
cut -d ',' -f 5 repositories.csv | 
grep -w "Java" | 
wc -l

Q8 (10 points) How many repositories were forked and written in PHP?

In [ ]:
cut -d ',' -f 5,7 repositories.csv |
grep -w "PHP" |
grep [0-9] |
wc -l

Q9 (10 points) Which owner_id owns most repositories?

In [ ]:
cut -d ',' -f 3 repositories.csv |
sort |
uniq -c |
sort -r |
head -1 |
sed 's/^ *[0-9]* //'

Q10 (10 points) Which repositories are created between 01-01-2017 and 24-03-2017 (both inclusive)? Print the names.

In [ ]:
cut -d ',' -f 4,6 repositories.csv |
egrep "2017-(0[1-2]-([0-3][0-9])|03-(0[1-9]|1[0-9]|2[0-4]))" |
cut -d ',' -f 1

Q11 (10 points): Print the 10 most used programming languages sorted on popularity.

In [ ]:
cut -d ',' -f 5 repositories.csv |
grep -v 'NULL' |
sort |
uniq -c |
sort -r |
head -10 |
sed 's/^ *[0-9]* //'

Q12 (10 points): Print the username of the owners whose repositories are deleted.

Hint: Use the url field.

In [ ]:
cut -d ',' -f 2,8 repositories.csv |
grep ',1' |
grep -o 'https:\/\/api\.github\.com\/repos\/[A-Za-z0-9-]*' |
sed 's/https:\/\/api.github.com\/repos\///'