{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Assignment: Unix (minors)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we have seen during the lecture, the Unix command line is extremely\n", "versatile. This comes at a cost: you need to know which tool is suitable for the\n", "job or to build it. For this assignment, you will have to use a Unix command line \n", "(i.e. in the course's VM, a Mac or Ubuntu for Windows) \n", "to accomplish the indicated tasks. Not all tools you\n", "will need are readily available in the default installation of Xubuntu that we \n", "are using in the course. You need to search online (or implement!) for \n", "the appropriate tools and command line options.\n", "\n", "_Note_: To develop and run those commands in Jupyter, you need to install the\n", "[Bash kernel](https://github.com/takluyver/bash_kernel). Alternatively, you\n", "can create a file where you paste all programs/pipelines, print it and upload\n", "it to the assignment directory.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Shell programming" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**T (10 points)** Write a pipeline that converts (recursively!) a directory\n", "structure full of `.wav` files to `.mp3`s." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**T (10 points)** Write a program that given a directory of text files, it\n", "prints the 10 most common words in those files (across all files)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**T (10 points)** Write a program that will print all files (recursively!) \n", "that where not accessed the last 30 days.\n", "\n", "_Hint_: Use `date` to format date strings" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**T (10 points)** Write a program that checks whether all the\n", "links in [this web page](http://ghtorrent.org/downloads.html) work and reports\n", "the ones that do not. You can check whether a link works by inspecting the HTTP\n", "return header for 404 errors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**T (10 points)** Write a program that will create a `tar.gz` archive out\n", "of a directory of source code, ommiting all files that are binary (i.e. non-text).\n", "\n", "_Hint_: Use `file --mime`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**T (10 point)** Implement a case-insensitive spell checker. Given an input file,\n", "it should report all words not in the dictionary.\n", "\n", "_Hint_: You can use [this](https://raw.githubusercontent.com/dwyl/english-words/master/words.txt) dictionary file" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data processing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the assignments in this section, we use the \n", "same [log file](https://drive.google.com/file/d/0B9Rx0uhucsroYWJxdEpPd2JYcjg/view?usp=sharing)\n", "and [interesting.csv](https://drive.google.com/open?id=0B9Rx0uhucsroRHNVTFpzMV9OUGs)\n", "files that we used in the Spark assignment.\n", "\n", "Write pipelines to calculate answers to the following questions (they may look familiar :-)):" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**T (10 points)** Count the number of WARNing messages" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**T (10 points)** How many repositories where processed in total? Use the `api_client`\n", "lines only." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**T (10 points)** Which client did most HTTP requests?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**T (10 points)** What is the most active repository?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**T (10 points)**: Which access keys are failing most often?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**T (10 points)**: Which of the _interesting_ repositories has the most failed\n", "API calls?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Bash", "language": "bash", "name": "bash" }, "language_info": { "codemirror_mode": "shell", "file_extension": ".sh", "mimetype": "text/x-sh", "name": "bash" } }, "nbformat": 4, "nbformat_minor": 2 }