A quick visual guide to recommender systems (user based, item based, and matrix factorization) and the code behind making an apache spark MatrxFactorization Model with the ALS function.
22. Data
Preparation
train, test = ratings.randomSplit([0.7,0.3],7856)
train.count()
70,005
test.count()
29,995
train.cache()
test.cache()
23. Modeling
rank = 5 # Latent Factors to be made
numIterations = 10 # Times to repeat process
#Create the model on the training data
model = ALS.train(train, rank, numIterations)
25. Modeling /
Evaluation
# For Product X, Find N Users to Sell To
model.recommendUsers(242,100)
# For User Y Find N Products to Promote
model.recommendProducts(196,10)
#Predict Single Product for Single User
model.predict(196, 242)
26. Modeling /
Evaluation
# Predict Multi Users and Multi Products
# Pre-Processing
pred_input = train.map(lambda x:(x[0],x[1]))
# Lots of Predictions
pred = model.predictAll(pred_input)
#Returns Ratings(user, item, prediction)
(196, 242)
Rating(user=894, product=1560, rating=3.845)
27. Evaluation
User Item Actual Pred
196 242 3.0 3.91
186 302 3.0 3.29
22 377 1.0 1.09
244 51 2.0 3.66
298 474 4.0 4.11
TRAINING
RMSE: 0.763
28. Evaluation
#Organize the data to make (user, product) the key)
true_reorg = train.map(lambda x:((x[0],x[1]), x[2]))
pred_reorg = pred.map(lambda x:((x[0],x[1]), x[2]))
#Do the actual join
true_pred = true_reorg.join(pred_reorg)
from math import sqrt
MSE = true_pred.map(lambda r: (r[1][0] - r[1][1])**2).mean()
RMSE = sqrt(MSE)
#Results in 0.7629908117414474
((582, 1014), (4.0, 3.397))
((196, 242), 3.0)
32. RECAP
rank = 5; numIterations = 10;
#Create the model on the training data
model = ALS.train(train, rank, numIterations)
# Lots of Predictions
pred = model.predictAll(pred_input)
#Examine Model Features
model.productFeatures()
# Save your model!
model.save(sc,"../out/ml-model")