Netflix prize was an open competition for the best collaborative filtering algorithm, which started in 2006. BellKor’s Pragmatic Chaos team from AT&T Labs won the prize back in 2009. This Spark application will use Spark’s 2.4 built-in ALS algorithm to create a recommendation model for the data set from the competition.
Preprocessing1.scala will create a four-column dataframe (movieId, userId, rating, date) from the original data and write it to the intermediate folder. Preprocessing2.scala will use data from the intermediate step and write it in a snappy compressed parquet format. Exploring.scala prints some sample data and general info about used data. ProbeParser.scala and QualifyingParser.scala are scripts for transforming original logs to Spark-friendly csv tabular structure. Training.scala will fit ALS model to the training set using k-fold validation and a hyper-parameter grid with 64 different values. Predicting.scala loads trained ALS model, calculates predicted values on the data from the original probe.txt and evaluates model’s RMSE.
Trained ALS model is located under /src/main/resources/model. It’s RMSE is 0.8904. I withheld preprocessed data from the intermediate steps in accordance with Netflix’s official instruction not to redistribute the data. Training data set has slightly over 100 million instances. Probe data set has about 1.4 million records, while qualifying data set has about 2.8 million records. Uncompressed training data set is 2.6 GB.
Next steps would be to use model on qualifying data set, to enrich data with additional features, like genre, or to use ensemble methods.
Netflix prize - Wiki Netflix Prize - Slides
Dataset Netflix does not provide access to the original data set, probably due to the legal issues. Nonetheless, it can be downloaded from the archived UCI ML repository or Academic torrents. Notice that there are much more votes in the last four years, and that the ratings are slightly skewed to the right.
Netflix Prize Data Set - UCI Netflix Prize Data Set - AA