Assignment on Spark¶

Gosia Migut and Georgios Gousios¶

In this assignment, we will use Spark to have a look at the Movie Lens dataset containing user generated ratings for movies. The dataset comes in 3 files:

ratings.dat contains the ratings in the following format: UserID::MovieID::Rating::Timestamp
users.dat contains demographic information about the users: UserID::Gender::Age::Occupation::Zip-code
movies.dat contains meta information about the movies: MovieID::Title::Genres

Refer to the README for the detailed description of the data.

Note: when using the files use the filepath data/[file].dat, otherwise automatic grading will fail.

Grade: This assignment consists of 105 points. You need to collect them all to get a 10! All cells that are graded include the expected answer. Your task is to write the code that comes up with the expected result. The automated grading process will be run on a different dataset.

Loading and parsing the file¶

Q1 (5 points): Download the ratings file, parse it and load it in an RDD named ratings.

def parse_file(element):
    return element.split('::', 4)

# load data to RDD and use parse_ratings function to parse it.
ratings = sc.textFile("data/ratings.dat")
ratings = ratings.map(parse_file).cache()
ratings

PythonRDD[2] at RDD at PythonRDD.scala:48

Q2 (5 points): How many lines does the ratings RDD contain?

ratings.count()

1000209

Basic filtering and counting¶

Q3 (5 points): Count how many times the rating '1' has been given.

ratings.filter(lambda x: x[2] == '1').count()

56174

Q4 (5 points): Count how many unique movies have been rated.

ratings.groupBy(lambda x: x[1]).count()

3706

Q5 (5 points): Which user gave most ratings? Return the userID and number of ratings.

ratings.groupBy(lambda x: x[0]).map(lambda x: (x[0], x[1].__len__())).max(key=lambda x: x[1])

(u'4169', 2314)

Q6 (5 points): Which user gave most '5' ratings? Return the userID and number of ratings.

ratings.filter(lambda x: x[2] == '5')\
.groupBy(lambda x: x[0])\
.map(lambda x: (x[0], x[1].__len__()))\
.max(key=lambda x: x[1])

(u'4277', 571)

Q7 (5 points): Which movie was rated most times? Return the movieID and number of ratings.

ratings.groupBy(lambda x: x[1]).map(lambda x: (x[0], x[1].__len__())).max(key=lambda x: x[1])

(u'2858', 3428)

Joining¶

Now we will look at two additional files from the Movie Lens dataset.

Q8 (5 points): Read the movies and users files into RDDs. How many records are there in each RDD?

#load movies dataset to RDD, parse and cache it.
movies = sc.textFile("data/movies.dat")
movies = movies.map(parse_file).cache()

#how many records are in movies RDD's?
movies.count()

3883

#load users dataset to RDD, parse and cache it.
users = sc.textFile("data/users.dat")
users = users.map(parse_file).cache()

#how many records are in users RDD's?
users.count()

6040

As you probably have noticed there are more movies in the movies dataset than rated movies.

Q9 (5 points): How many of the movies are a comedy?

movies.filter(lambda x: x[2].count('Comedy')!=0).count()

1200

Q10 (10 points): Which comedy has the most ratings? Return the title and the number of rankings. Answer this question by joining two datasets.

#group ratings by unique movie
#join with the filtered comedies
#count ratings per comedy
ratings.groupBy(lambda x: x[1]).keyBy(lambda x: x[0]) \
.join(movies.filter(lambda x: x[2].count('Comedy')!=0).keyBy(lambda x: x[0])) \
.map(lambda (key, (k,v)): (key, v[1], k[1].__len__())) \
.max(key=lambda x: x[2])

(u'2858', u'American Beauty (1999)', 3428)

Q11 (10 points): For users under 18 years old (category 1), what is the frequency of each star rating? Return a list/array with the rating and the number of times it appears, e.g. Array((4,16), (1,3), (3,9), (5,62), (2,2))

users.filter(lambda x: x[2]=='1')\
.keyBy(lambda x: x[0])\
.join(ratings.keyBy(lambda x: x[0]))\
.map(lambda (key, (k, v)): v)\
.keyBy(lambda x: x[2])\
.map(lambda x: (x[0], 1))\
.reduceByKey(lambda a, b: a+b)\
.collect()

[(u'1', 2238), (u'5', 6802), (u'4', 8808), (u'3', 6380), (u'2', 2983)]

Indexing¶

As you have noticed, typical operations on RDDs require grouping on a specific part of each record and then calculating specific counts given the groups. While this operation can be achieved with the groupBy family of functions, it is often useful to create a structure called an inverted index. An inverted index creates an 1..n mapping from the record part to all occurencies of the record in the dataset. For example, if the dataset looks like the following:

col1,col2,col3
A,1,foo
B,1,bar
C,2,foo
D,3,baz
E,1,foobar

an inverted index on col2 would look like

1 -> [(A,1,foo), (B,1,bar), (E,1,foobar)]
2 -> [(C,2,foo)]
3 -> [(D,3,baz)]

Inverted indexes enable us to quickly access precalculated partitions of the dataset. Let's compute an inverted index on the rating field of `ratings-student.dat.

Q12 (5 points): Compute the number of unique users that rated the movies with movie_IDs 2858, 356 and 2329.

ratings.filter(lambda x: x[1] in ["2858", "356", "2329"])\
.groupBy(lambda x: x[0]).count()

4213

Measure the time (in seconds) it takes to make this computation.

import time
start_time_1 = time.time()
ratings.filter(lambda x: x[1] in ["2858", "356", "2329"]).groupBy(lambda x: x[0]).count()
print(time.time() - start_time_1)

0.906870126724

Q13 (5 points): Create an inverted index on ratings, field movie_ID. Print the first item.

idx = ratings.groupBy(lambda x: x[1])
list(list(idx.lookup('1'))[0])

[[u'1', u'1', u'5', u'978824268'], [u'6', u'1', u'4', u'978237008'], [u'8', u'1', u'4', u'978233496'], [u'9', u'1', u'5', u'978225952'], [u'10', u'1', u'5', u'978226474'], [u'18', u'1', u'4', u'978154768'], [u'19', u'1', u'5', u'978555994'], [u'21', u'1', u'3', u'978139347'], [u'23', u'1', u'4', u'978463614'], [u'26', u'1', u'3', u'978130703'], [u'28', u'1', u'3', u'978985309'], [u'34', u'1', u'5', u'978102970'], [u'36', u'1', u'5', u'978061285'], [u'38', u'1', u'5', u'978046225'], [u'44', u'1', u'5', u'978019369'], [u'45', u'1', u'4', u'977990044'], [u'48', u'1', u'4', u'977975909'], [u'49', u'1', u'5', u'977972501'], [u'51', u'1', u'5', u'977947828'], [u'56', u'1', u'5', u'977938855'], [u'60', u'1', u'4', u'977931983'], [u'65', u'1', u'5', u'991368774'], [u'68', u'1', u'3', u'991376026'], [u'73', u'1', u'3', u'977867812'], [u'75', u'1', u'5', u'977851099'], [u'76', u'1', u'5', u'977847069'], [u'78', u'1', u'4', u'978570648'], [u'80', u'1', u'3', u'977786904'], [u'90', u'1', u'3', u'993872933'], [u'92', u'1', u'4', u'977646817'], [u'96', u'1', u'4', u'980563195'], [u'99', u'1', u'3', u'982873678'], [u'109', u'1', u'4', u'977519208'], [u'112', u'1', u'5', u'977508486'], [u'114', u'1', u'3', u'977506130'], [u'117', u'1', u'3', u'977498304'], [u'118', u'1', u'3', u'977501973'], [u'119', u'1', u'5', u'977513481'], [u'121', u'1', u'5', u'977459322'], [u'123', u'1', u'3', u'978046755'], [u'131', u'1', u'3', u'977453922'], [u'132', u'1', u'5', u'977427737'], [u'134', u'1', u'4', u'977845495'], [u'136', u'1', u'2', u'977530792'], [u'139', u'1', u'4', u'977360234'], [u'142', u'1', u'4', u'977352868'], [u'146', u'1', u'5', u'977348756'], [u'147', u'1', u'4', u'977336480'], [u'148', u'1', u'5', u'977335193'], [u'149', u'1', u'3', u'978190837'], [u'150', u'1', u'4', u'978161692'], [u'151', u'1', u'2', u'992133251'], [u'152', u'1', u'4', u'977268921'], [u'156', u'1', u'5', u'977249672'], [u'157', u'1', u'5', u'977249416'], [u'162', u'1', u'5', u'982104496'], [u'163', u'1', u'4', u'977213935'], [u'168', u'1', u'4', u'977177003'], [u'169', u'1', u'5', u'977199102'], [u'173', u'1', u'4', u'995304562'], [u'175', u'1', u'4', u'977113154'], [u'181', u'1', u'2', u'977088743'], [u'182', u'1', u'5', u'977085326'], [u'184', u'1', u'5', u'1038970374'], [u'186', u'1', u'5', u'977059438'], [u'187', u'1', u'4', u'977232655'], [u'190', u'1', u'3', u'977027054'], [u'193', u'1', u'4', u'1025569964'], [u'194', u'1', u'4', u'977014303'], [u'195', u'1', u'5', u'986048425'], [u'198', u'1', u'5', u'976980054'], [u'202', u'1', u'4', u'976939841'], [u'204', u'1', u'4', u'978050004'], [u'213', u'1', u'5', u'976900243'], [u'214', u'1', u'4', u'976900595'], [u'215', u'1', u'4', u'979174987'], [u'220', u'1', u'5', u'976836351'], [u'223', u'1', u'5', u'976832460'], [u'224', u'1', u'3', u'976831787'], [u'225', u'1', u'4', u'977257251'], [u'230', u'1', u'3', u'976820070'], [u'231', u'1', u'5', u'1040445432'], [u'232', u'1', u'4', u'976812596'], [u'236', u'1', u'4', u'976765293'], [u'237', u'1', u'4', u'976761905'], [u'239', u'1', u'5', u'993359328'], [u'243', u'1', u'4', u'976749399'], [u'246', u'1', u'5', u'976728579'], [u'255', u'1', u'4', u'976683184'], [u'258', u'1', u'5', u'976671645'], [u'263', u'1', u'4', u'976651989'], [u'264', u'1', u'4', u'976714119'], [u'271', u'1', u'4', u'976636109'], [u'272', u'1', u'5', u'976695356'], [u'273', u'1', u'5', u'976774952'], [u'284', u'1', u'5', u'976570517'], [u'293', u'1', u'2', u'976561599'], [u'294', u'1', u'3', u'976542545'], [u'299', u'1', u'3', u'976659877'], [u'300', u'1', u'5', u'976506060'], [u'301', u'1', u'5', u'976505707'], [u'302', u'1', u'4', u'976505024'], [u'306', u'1', u'3', u'977365910'], [u'307', u'1', u'5', u'976486678'], [u'308', u'1', u'4', u'995833809'], [u'310', u'1', u'4', u'976482056'], [u'314', u'1', u'4', u'976751685'], [u'321', u'1', u'2', u'976417163'], [u'325', u'1', u'5', u'976402092'], [u'326', u'1', u'3', u'977605721'], [u'329', u'1', u'3', u'976393427'], [u'333', u'1', u'5', u'996356395'], [u'337', u'1', u'4', u'976349046'], [u'338', u'1', u'4', u'976347032'], [u'340', u'1', u'4', u'976342108'], [u'343', u'1', u'5', u'1003251499'], [u'346', u'1', u'5', u'976334600'], [u'350', u'1', u'5', u'976326454'], [u'351', u'1', u'4', u'976677602'], [u'352', u'1', u'4', u'976330185'], [u'355', u'1', u'4', u'976321278'], [u'366', u'1', u'3', u'976311000'], [u'368', u'1', u'3', u'976311812'], [u'369', u'1', u'5', u'976309628'], [u'372', u'1', u'4', u'976548746'], [u'376', u'1', u'4', u'980620359'], [u'378', u'1', u'4', u'976302158'], [u'380', u'1', u'4', u'976316203'], [u'385', u'1', u'5', u'976300183'], [u'386', u'1', u'2', u'976300599'], [u'389', u'1', u'5', u'976298977'], [u'392', u'1', u'4', u'976297032'], [u'395', u'1', u'3', u'976299298'], [u'402', u'1', u'4', u'976295369'], [u'403', u'1', u'4', u'976304372'], [u'411', u'1', u'5', u'997066018'], [u'412', u'1', u'4', u'976294055'], [u'413', u'1', u'5', u'976287901'], [u'417', u'1', u'4', u'976285594'], [u'418', u'1', u'4', u'976285897'], [u'420', u'1', u'4', u'983206889'], [u'424', u'1', u'4', u'976306643'], [u'425', u'1', u'3', u'976282933'], [u'428', u'1', u'4', u'976260419'], [u'429', u'1', u'5', u'976258377'], [u'434', u'1', u'5', u'976248571'], [u'437', u'1', u'4', u'976277994'], [u'438', u'1', u'5', u'976245352'], [u'444', u'1', u'4', u'976240730'], [u'451', u'1', u'5', u'976415081'], [u'453', u'1', u'3', u'976232941'], [u'456', u'1', u'5', u'976299858'], [u'462', u'1', u'3', u'976230643'], [u'463', u'1', u'5', u'976227510'], [u'467', u'1', u'4', u'976226815'], [u'474', u'1', u'5', u'976237654'], [u'478', u'1', u'5', u'976941051'], [u'479', u'1', u'5', u'976218440'], [u'482', u'1', u'2', u'1002083483'], [u'490', u'1', u'4', u'976215478'], [u'491', u'1', u'2', u'976214774'], [u'495', u'1', u'3', u'976215194'], [u'496', u'1', u'5', u'976505066'], [u'499', u'1', u'3', u'976215021'], [u'509', u'1', u'4', u'976206020'], [u'516', u'1', u'3', u'976279416'], [u'519', u'1', u'4', u'979336270'], [u'523', u'1', u'5', u'976193525'], [u'524', u'1', u'5', u'976167828'], [u'528', u'1', u'5', u'976245400'], [u'529', u'1', u'5', u'977449223'], [u'531', u'1', u'5', u'978643765'], [u'533', u'1', u'4', u'976295615'], [u'536', u'1', u'5', u'976138606'], [u'539', u'1', u'4', u'976140441'], [u'541', u'1', u'5', u'976124683'], [u'543', u'1', u'5', u'976156300'], [u'546', u'1', u'4', u'976068943'], [u'549', u'1', u'5', u'976070010'], [u'550', u'1', u'5', u'976066733'], [u'555', u'1', u'5', u'976055820'], [u'556', u'1', u'4', u'976055729'], [u'563', u'1', u'5', u'976044797'], [u'566', u'1', u'3', u'976039064'], [u'574', u'1', u'4', u'975991241'], [u'575', u'1', u'4', u'975989029'], [u'576', u'1', u'4', u'975981105'], [u'577', u'1', u'5', u'975979958'], [u'579', u'1', u'4', u'975977259'], [u'583', u'1', u'5', u'975957743'], [u'585', u'1', u'4', u'975949065'], [u'588', u'1', u'3', u'985731996'], [u'590', u'1', u'5', u'975913115'], [u'595', u'1', u'4', u'975896378'], [u'601', u'1', u'3', u'975880589'], [u'605', u'1', u'5', u'975869440'], [u'606', u'1', u'5', u'975869266'], [u'610', u'1', u'5', u'975861277'], [u'611', u'1', u'4', u'977155870'], [u'613', u'1', u'5', u'975812101'], [u'615', u'1', u'4', u'975806717'], [u'620', u'1', u'4', u'975800651'], [u'623', u'1', u'5', u'975793228'], [u'624', u'1', u'3', u'975787681'], [u'626', u'1', u'5', u'975819155'], [u'628', u'1', u'4', u'975781783'], [u'629', u'1', u'4', u'975784232'], [u'630', u'1', u'4', u'976207250'], [u'631', u'1', u'4', u'975778929'], [u'634', u'1', u'5', u'975774591'], [u'635', u'1', u'5', u'979142907'], [u'637', u'1', u'5', u'975739154'], [u'641', u'1', u'5', u'1019443119'], [u'645', u'1', u'5', u'975715505'], [u'646', u'1', u'5', u'975782835'], [u'653', u'1', u'3', u'975700204'], [u'656', u'1', u'4', u'975740103'], [u'660', u'1', u'3', u'975690271'], [u'662', u'1', u'3', u'975661584'], [u'664', u'1', u'4', u'975992648'], [u'669', u'1', u'5', u'975631430'], [u'676', u'1', u'5', u'975649112'], [u'678', u'1', u'5', u'975608685'], [u'681', u'1', u'4', u'975603446'], [u'685', u'1', u'3', u'975602509'], [u'690', u'1', u'5', u'975584052'], [u'691', u'1', u'4', u'1010907849'], [u'692', u'1', u'5', u'975779953'], [u'696', u'1', u'4', u'975563075'], [u'697', u'1', u'4', u'975560617'], [u'698', u'1', u'3', u'1015123819'], [u'699', u'1', u'4', u'975557923'], [u'701', u'1', u'3', u'979094230'], [u'702', u'1', u'4', u'975551677'], [u'705', u'1', u'4', u'975547527'], [u'707', u'1', u'5', u'975542819'], [u'710', u'1', u'5', u'975540360'], [u'712', u'1', u'4', u'975539400'], [u'714', u'1', u'5', u'975535074'], [u'715', u'1', u'4', u'975600607'], [u'716', u'1', u'5', u'976500370'], [u'718', u'1', u'5', u'975528326'], [u'719', u'1', u'4', u'975528288'], [u'721', u'1', u'3', u'975695147'], [u'722', u'1', u'5', u'975524460'], [u'726', u'1', u'5', u'975519140'], [u'729', u'1', u'5', u'975512206'], [u'731', u'1', u'5', u'975528946'], [u'733', u'1', u'5', u'975481380'], [u'735', u'1', u'4', u'975646419'], [u'736', u'1', u'3', u'1045710680'], [u'737', u'1', u'3', u'976058569'], [u'739', u'1', u'5', u'975475262'], [u'743', u'1', u'3', u'975625123'], [u'746', u'1', u'4', u'975469261'], [u'749', u'1', u'5', u'975477144'], [u'752', u'1', u'5', u'975610699'], [u'753', u'1', u'5', u'975458133'], [u'754', u'1', u'4', u'975538937'], [u'755', u'1', u'4', u'997812961'], [u'757', u'1', u'3', u'975454506'], [u'761', u'1', u'4', u'975451381'], [u'765', u'1', u'5', u'975451744'], [u'770', u'1', u'4', u'975443050'], [u'774', u'1', u'3', u'975438800'], [u'776', u'1', u'5', u'975605122'], [u'777', u'1', u'3', u'975438109'], [u'780', u'1', u'4', u'975435937'], [u'781', u'1', u'3', u'975434906'], [u'783', u'1', u'4', u'975701689'], [u'788', u'1', u'4', u'975428905'], [u'791', u'1', u'5', u'981400121'], [u'795', u'1', u'4', u'975417180'], [u'798', u'1', u'3', u'975409257'], [u'801', u'1', u'5', u'975401059'], [u'802', u'1', u'5', u'975406133'], [u'808', u'1', u'4', u'975431123'], [u'809', u'1', u'3', u'975395592']

Q14 (5 points): Compute the number of unique users that rated the movies with movie_IDs 2858, 356 and 2329 using the index

idx.filter(lambda x: x[0] in ["2858", "356", "2329"])\
.map(lambda x: x[1])\
.flatMap(lambda x: list(x))\
.groupBy(lambda x: x[0]).count()

4213

Measure the time (in seconds) it takes to compute the same result using the index.

# Measure the time of this computation.
start_time_1 = time.time()
idx.filter(lambda x: x[0] in ["2858", "356", "2329"])\
.map(lambda x: x[1])\
.flatMap(lambda x: list(x))\
.groupBy(lambda x: x[0]).count()
print(time.time() - start_time_1)

1.16708803177

You should have noticed difference in performance. Is the indexed version faster? If yes, why? If not, why not? Discuss this with your partner.

Dataframes¶

Q15 (5 points): Create a data frame from the ratings RDD and count the number of lines in it. Also register the data frame as an SQL table

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = ratings.toDF()
sqlContext.registerDataFrameAsTable(df, "table")
df.count()

1000209

Q16 (5 points): Provide the statistical summary of the column containing ratings (use Spark function that returns a table with count, mean, stddev, min, max).

Hint: To select the correct column you might first want to print the datatypes and names of each of the columns.

df.describe("_3").show()

+-------+------------------+
|summary|                _3|
+-------+------------------+
|  count|           1000209|
|   mean| 3.581564453029317|
| stddev|1.1171018453732544|
|    min|                 1|
|    max|                 5|
+-------+------------------+

Q17 (5 points): Count how many times the rating '1' has been given, by filtering it from the ratings DataFrame. Measure the execution time and compare with the execution time of the same query using RDD. Think for yourself when it would be usefull to use DataFrames and when not.

# Count number of ratings "1"
df.filter(df._3=='1').count()

56174

Q18 (5 points): Count how many times the rating '1' has been given, using an SQL query.

sqlContext.sql("SELECT count(*) FROM table WHERE _3 ==1").show()

+--------+
|count(1)|
+--------+
|   56174|
+--------+

Q19 (5 points): Which user gave most '5' ratings? Return the userID and number of ratings, using an SQL query.

sqlContext.sql("select _1, count(*) as num_ratings from table where _3 = 5 group by _1 order by num_ratings desc").show(1)

+----+-----------+
|  _1|num_ratings|
+----+-----------+
|4277|        571|
+----+-----------+
only showing top 1 row