As we have seen during the lecture, the Unix command line is extremely versatile. This comes at a cost: you need to know which tool is suitable for the job or to build it. For this assignment, you will have to use a Unix command line (i.e. in the course's VM, a Mac or Ubuntu for Windows) to accomplish the indicated tasks. Not all tools you will need are readily available in the default installation of Xubuntu that we are using in the course. You need to search online (or implement!) for the appropriate tools and command line options.
Note: To develop and run those commands in Jupyter, you need to install the Bash kernel. Alternatively, you can create a file where you paste all programs/pipelines, print it and upload it to the assignment directory.
There 120 points to collect. To get a 10, you need 110.
Q1 (10 points) Write a pipeline that converts (recursively!) a directory
structure full of .wav
files to .mp3
s.
Q2 (10 points) Write a program that given a directory structure of text files, it prints the 10 most common words in those files (across all files).
The output should look like this (example is for 3 most common items):
5 most_common_word
2 less_common_word
1 least_common_word
Q3 (10 points) Given this dataset (which you will also be using later on), write a program that obtains the star count, watcher count and number of forks for first 10 repositories from the GitHub API. The output must look like the following:
url,stars,watchers,forks
https://api.github.com/repos/8d8d/Think.Admin,0,0,0
https://api.github.com/repos/971638267/RetrofitAndRxjavaforRecyclerview,12,12,3
https://api.github.com/repos/9alsacelost/greenDAO,0,0,0
https://api.github.com/repos/a1265137718/ZoomHeader,0,0,0
https://api.github.com/repos/12345678/NonExistingRepo,,,
Note Make sure you use this path for the csv file: repositories.csv
otherwise auto-grading will fail.
Warning The GitHub API requires authentication, otherwise it is limited to 60 requests per hour. Make sure you setup an OAuth key and use this key for doing the requests. After the assignment deadline, you can remove it. If you fail to submit a solution including a key, the grading will fail
Q4 (10 points) Write a program that downloads all JPEG pictures in a web
page. JPEG pictures are identified by the extensions *.jpg
and *.jpeg
.
Your program must accept the URL to process as an argument. For example:
$ jpegdl https://www.nu.nl
[...]
$ ls
zi7xo0pa6h6x_std320.jpg
hsnxkcyat176_std320.jpg
[...]
Q5 (10 points) All Unix systems have a
dictionary file) residing under
/usr/share/dict/words
or /usr/dict/words
. Use it to implement a (rudimentary)
spellchecker. Your spellchecker should read a file named foo.txt
and print
a list all the words in the document to be checked that are not in the dictionary.
An example usage session can be seen below.
$ cat foo.txt
I am a nicelly formatted sentence,
but I contain errors.
$ cat foo.txt | spellchecker
nicelly
formated
Q6 (10 points) Given this repository
at commit 05681455d905586f940e0e00e, find the sizes for all versions of all test files (assume that all test files are under src/test/java
). The output must look like the following:
blob_id blob_path size_in_bytes
for example:
da14c3975e src/test/java/nl/tudelft/jpacman/npc/ghost/NavigationTest.java 165
1fbe0d836d src/test/java/nl/tudelft/jpacman/sprite/SpriteTest.java 108
3bde982975 src/test/java/nl/tudelft/jpacman/sprite/SpriteTest.java 108
6286792e43 src/test/java/nl/tudelft/jpacman/sprite/SpriteTest.java 110
A blob in Git represents a file version.
For the assignments in this section, we use the following dataset containing repositories.
Write pipelines to calculate answers to the following questions:
Note: Use "repositories.csv"
as filename, otherwise automatic grading will fail.
Q7 (10 points) Count the number of repositories written in Java
.
Q8 (10 points) How many repositories were forked and written in PHP?
Q9 (10 points) Which owner_id
owns most repositories?
Q10 (10 points) Which repositories are created between 01-01-2017
and 24-03-2017
(both inclusive)? Print the names.
Q11 (10 points): Print the 10
most used programming languages sorted on popularity.
Q12 (10 points): Print the username
of the owners whose repositories are deleted.
Hint: Use the url field.