{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Graphs\n",
"\n",
"Graphs are one of the most common data structures in computer science;\n",
"graph-based modeling of problems is at the heart of many systems we use\n",
"every day, such as routing in Google Maps or recommending friends on Facebook.\n",
"\n",
"[Social Network Analysis](https://en.wikipedia.org/wiki/Social_network_analysis)\n",
"(SNA) is a branch of sociology that explores social structures through the use\n",
"of analytical tools, such as graphs. In this assignment, you will implement\n",
"basic social network analysis algorithms on a graph extracted from GitHub.\n",
"\n",
"You can find the graph data [at this link](https://drive.google.com/file/d/0B9Rx0uhucsroT1dzYjlfVU9nWUU/view?usp=sharing). The data looks like\n",
"this (CSV format):\n",
"\n",
"```\n",
"follower,followed\n",
"1570,9236\n",
"9236,1570\n",
"13256,9236\n",
"9236,13256\n",
"13256,1570\n",
"1570,13256\n",
"```\n",
"\n",
"If we take the first line, it means that user `1570` follows user `9236`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading the graph\n",
"\n",
"**T (10 points):** Write a function that takes as input a file name and\n",
"loads the data into an adjacency list representation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def load(graph_file='github.csv'):\n",
" \"\"\"\n",
" Loads the data from the file in the provided argument into an in-memory\n",
" graph (as an adjacency list)\n",
" \"\"\"\n",
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Test your implementation:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From now on, you must use the graph returned by `load` in all the assignments\n",
"below.\n",
"\n",
"## Basic graph metrics\n",
"\n",
"**T (10 points):** Who are the 10 most connected users?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def in_degree(graph, node_id):\n",
" pass\n",
"\n",
"def most_connected(graph, n = 10):\n",
" \"\"\"Returns the ids of the top-n most connected users\"\"\"\n",
" pass\n",
" \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"_Hint_: it helps if you first define a method called `in_degree` that calculates\n",
"the number of incoming edges in to a node.\n",
"\n",
"**T (10 points):** What is the mean and what is the median number of connections?"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computing shortest paths\n",
"\n",
"Shortest paths are the basis for many network measures. You will need to implement Dijkstra algorithm.\n",
"\n",
"**T (20 points):** Write a function that computes the shortest paths\n",
"between all node pairs in the graph. \n",
"\n",
"_Hint_: Choose the appropriate shortest path algorithm for undirected graph\n",
"with no edge weights. A pair of nodes $(n_1, n_2)$ is, for our purposes,\n",
"equivalent to the pair $(n_2, n_1)$.\n",
"\n",
"_Hint_: How to find all unique node pairs? Given that you create a non-duplicate\n",
"list of all your nodes, you can use Python's `itertools.combinations` function \n",
"like so:\n",
"```{python}\n",
"from itertools import combinations\n",
"a = [1,2,3,4,5]\n",
"pairs = list(combinations(a, 2))\n",
"print pairs\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def shortest_path(graph, source):\n",
" pass\n",
"\n",
"def shortest_paths(graph):\n",
" \"\"\"\n",
" Computes the shortest paths between any pair of nodes in the graph\n",
" \n",
" @return A dictionary whose keys are node pairs and values are sequences\n",
" indicating the shortest path between the node pair.\n",
" \"\"\"\n",
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Print the first 10 paths here:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ranking important users\n",
"\n",
"One of the primary uses of SNA is to identify important/influencial nodes.\n",
"A typical metric we use to quantify the importance of a node is centrality.\n",
"Several [centrality measures](https://en.wikipedia.org/wiki/Centrality) \n",
"exist; for our purposes it is enough to calculate the **Betweeness Centrality**\n",
"of each node. The pseudocode to calculate it is given below.\n",
"\n",
"To compute the betweenness of a node $n$\n",
"\n",
"1. For each pair of nodes $(v1, v2)$, compute the shortest paths between them\n",
"2. For each pair of nodes $(v1, v2)$ determine the fraction of shortest paths\n",
"that include $n$\n",
"3. Sum this fraction over all pairs of vertices $(v1, v2)$\n",
"\n",
"**T (30 points):** Write a function that computes the Betweenness centrality for\n",
"all nodes in the provided network"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def betweenness(graph):\n",
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**T (10 points):** Use the function above to rank the nodes (users) in\n",
"terms of importance and print 10 most important users with their importance."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}