SlideShare une entreprise Scribd logo
1  sur  23
© 2014 MapR Technologies 1© 2014 MapR Technologies
Parallel and Iterative Processing for
Machine Learning Recommendations with
Spark
© 2014 MapR Technologies 2
Agenda
• Collaborative Filtering with Spark
• Model training
• Alternating Least Squares
• The code
© 2014 MapR Technologies 3
Collaborative Filtering with Spark
• Recommend Items
– (filtering)
• Based on User preferences data
– (collaborative)
© 2014 MapR Technologies 4
Train a Model to Make Predictions
New
Data
Model Predictions
Training
Data
ModelAlgorithm
Ted and Carol like Movie B and C
Bob likes Movie B, What might he like ?
Bob likes Movie B, Predict C
© 2014 MapR Technologies 5
Alternating Least Squares
• approximates sparse user item rating matrix
– as product of two dense matrices, User and Item factor matrices
– tries to learn the hidden features of each user and item
– algorithm alternatively fixes one factor matrix and solves for the other
© 2014 MapR Technologies 6
ML Cross Validation Process
Data
Model
Training/
Building
Test Model
Predictions
Test
Set
Train Test loop
Training
Set
© 2014 MapR Technologies 7
Typical MapReduce Workflows
Input to
Job 1
SequenceFile
Last Job
Maps Reduces
SequenceFile
Job 1
Maps Reduces
SequenceFile
Job 2
Maps Reduces
Output from
Job 1
Output from
Job 2
Input to
last job
Output from
last job
HDFS
Iteration is slow because it writes/reads data to disk
© 2014 MapR Technologies 8
Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
• read only collection of
elements
• operated on in parallel
• Partitions Cached in
memory
© 2014 MapR Technologies 10
Ratings Data
© 2014 MapR Technologies 11
Parse Input
// parse input UserID::MovieID::Rating
def parseRating(str: String): Rating= {
val fields = str.split("::")
Rating(fields(0).toInt, fields(1).toInt,
fields(2).toDouble)
}
// create an RDD of Ratings objects
val ratingsRDD = ratingText.map(parseRating).cache()
© 2014 MapR Technologies 12
Build Model
Data
Build
Model
Test
Set
Training
Set
split ratings RDD into training data RDD (80%)
and test data RDD (20%)
build a user product matrix model
© 2014 MapR Technologies 13
Create Model
// Randomly split ratings RDD into training data RDD (80%)
and test data RDD (20%)
val splits = ratingsRDD.randomSplit(Array(0.8, 0.2), 0L)
val trainingRatingsRDD = splits(0).cache()
val testRatingsRDD = splits(1).cache()
// build a ALS user product matrix model with rank=20,
iterations=10
val model = (new
ALS().setRank(20).setIterations(10).run(trainingRatingsRDD))
© 2014 MapR Technologies 14
Get predictions
// get predicted ratings to compare to test ratings
// call model.predict with test Userid, MovieId input data
val predictionsForTestRDD = model.predict(testUserProductRDD)
User, Movie
Test
Data
Model
Predicted
Ratings
© 2014 MapR Technologies 15
Compare predictions to Tests
Join predicted ratings to test ratings in order to compare
((user, product),test rating) ((user, product), predicted rating)
((user, product),(test rating, predicted rating))
Key, Value Key, Value
Key, Value
© 2014 MapR Technologies 16
Test Model
// prepare predictions for comparison
val predictionsKeyedByUserProductRDD = predictionsForTestRDD.map{
case Rating(user, product, rating) => ((user, product), rating)
}
// prepare test for comparison
val testKeyedByUserProductRDD = testRatingsRDD.map{
case Rating(user, product, rating) => ((user, product), rating)
}
//Join the test with predictions
val testAndPredictionsJoinedRDD = testKeyedByUserProductRDD
.join(predictionsKeyedByUserProductRDD)
© 2014 MapR Technologies 17
Compare predictions to Tests
Find False positives: Where
test rating <= 1 and predicted rating >= 4
((user, product),(test rating, predicted rating))
Key, Value
© 2014 MapR Technologies 18
Test Model
val falsePositives =(testAndPredictionsJoinedRDD.filter{
case ((user, product), (ratingT, ratingP)) =>
(ratingT <= 1 && ratingP >=4)
})
falsePositives.take(2)
Array[((Int, Int), (Double, Double))] =
((3842,2858),(1.0,4.106488210964762)),
((6031,3194),(1.0,4.790778049100913))
© 2014 MapR Technologies 19
Test Model Mean Absolute Error
//Evaluate the model using Mean Absolute Error (MAE) between
test and predictions
val meanAbsoluteError = testAndPredictionsJoinedRDD.map {
case ((user, product), (testRating, predRating)) =>
val err = (testRating - predRating)
Math.abs(err)
}.mean()
meanAbsoluteError: Double = 0.7244940545944053
© 2014 MapR Technologies 20
Soon to Come
• Spark On Demand Training
– https://www.mapr.com/services/mapr-academy/
• Blogs and Tutorials:
– Movie Recommendations with Collaborative Filtering
– Spark Streaming
© 2014 MapR Technologies 21
Machine Learning Blog
• https://www.mapr.com/blog/parallel-and-iterative-processing-
machine-learning-recommendations-spark
© 2014 MapR Technologies 22
Spark on MapR
• Certified Spark Distribution
• Fully supported and packaged by MapR in partnership with
Databricks
– mapr-spark package with Spark, Shark, Spark Streaming today
– Spark-python, GraphX and MLLib soon
• YARN integration
– Spark can then allocate resources from cluster when needed
© 2014 MapR Technologies 23
References
• Spark web site: http://spark.apache.org/
• https://databricks.com/
• Spark on MapR:
– http://www.mapr.com/products/apache-spark
• Spark SQL and DataFrame Guide
• Apache Spark vs. MapReduce – Whiteboard Walkthrough
• Learning Spark - O'Reilly Book
• Apache Spark
© 2014 MapR Technologies 24
Q&A
@mapr maprtech
Engage with us!
MapR
maprtech
mapr-technologies

Contenu connexe

Tendances

Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection ProcessBenjamin Bengfort
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahouttanuvir
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with YellowbrickRebecca Bilbro
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ FyberDaniel Hen
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
Ml2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionMl2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionankit_ppt
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret packageVivian S. Zhang
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLConverting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLVillu Ruusmann
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedVivian S. Zhang
 
Why biased matrix factorization works well?
Why biased matrix factorization works well?Why biased matrix factorization works well?
Why biased matrix factorization works well?Joonyoung Yi
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
A Neural Autoregressive Approach to Collaborative Filtering (CF-NADE) Slide
A Neural Autoregressive Approach to Collaborative Filtering (CF-NADE) Slide A Neural Autoregressive Approach to Collaborative Filtering (CF-NADE) Slide
A Neural Autoregressive Approach to Collaborative Filtering (CF-NADE) Slide Joonyoung Yi
 
Linear Regression (Machine Learning)
Linear Regression (Machine Learning)Linear Regression (Machine Learning)
Linear Regression (Machine Learning)Omkar Rane
 
Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahoutGaurav Kasliwal
 

Tendances (20)

Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
 
Converting R to PMML
Converting R to PMMLConverting R to PMML
Converting R to PMML
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Xgboost
XgboostXgboost
Xgboost
 
Ml2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionMl2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regression
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLConverting Scikit-Learn to PMML
Converting Scikit-Learn to PMML
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 
Xgboost
XgboostXgboost
Xgboost
 
Why biased matrix factorization works well?
Why biased matrix factorization works well?Why biased matrix factorization works well?
Why biased matrix factorization works well?
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
A Neural Autoregressive Approach to Collaborative Filtering (CF-NADE) Slide
A Neural Autoregressive Approach to Collaborative Filtering (CF-NADE) Slide A Neural Autoregressive Approach to Collaborative Filtering (CF-NADE) Slide
A Neural Autoregressive Approach to Collaborative Filtering (CF-NADE) Slide
 
Linear Regression (Machine Learning)
Linear Regression (Machine Learning)Linear Regression (Machine Learning)
Linear Regression (Machine Learning)
 
End of Year Presentation
End of Year PresentationEnd of Year Presentation
End of Year Presentation
 
Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahout
 

En vedette

Artificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTEL
Artificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTELArtificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTEL
Artificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTELAssist
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...Spark Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
AWS re:Invent 2016: State of the Union: Amazon Alexa and Recent Advances in C...
AWS re:Invent 2016: State of the Union: Amazon Alexa and Recent Advances in C...AWS re:Invent 2016: State of the Union: Amazon Alexa and Recent Advances in C...
AWS re:Invent 2016: State of the Union: Amazon Alexa and Recent Advances in C...Amazon Web Services
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkDB Tsai
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 Databricks
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with SparkChris Johnson
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with SparkChris Johnson
 
The Chatbots Are Coming: A Guide to Chatbots, AI and Conversational Interfaces
The Chatbots Are Coming: A Guide to Chatbots, AI and Conversational InterfacesThe Chatbots Are Coming: A Guide to Chatbots, AI and Conversational Interfaces
The Chatbots Are Coming: A Guide to Chatbots, AI and Conversational InterfacesTWG
 

En vedette (10)

Realizing AI Conversational Bot
Realizing AI Conversational BotRealizing AI Conversational Bot
Realizing AI Conversational Bot
 
Artificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTEL
Artificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTELArtificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTEL
Artificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTEL
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
AWS re:Invent 2016: State of the Union: Amazon Alexa and Recent Advances in C...
AWS re:Invent 2016: State of the Union: Amazon Alexa and Recent Advances in C...AWS re:Invent 2016: State of the Union: Amazon Alexa and Recent Advances in C...
AWS re:Invent 2016: State of the Union: Amazon Alexa and Recent Advances in C...
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with Spark
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with Spark
 
The Chatbots Are Coming: A Guide to Chatbots, AI and Conversational Interfaces
The Chatbots Are Coming: A Guide to Chatbots, AI and Conversational InterfacesThe Chatbots Are Coming: A Guide to Chatbots, AI and Conversational Interfaces
The Chatbots Are Coming: A Guide to Chatbots, AI and Conversational Interfaces
 

Similaire à Parallel and Iterative Processing for Machine Learning Recommendations with Spark

Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine LearningCarol McDonald
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France
 
Testing Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitEric Wendelin
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONIRJET Journal
 
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...GeeksLab Odessa
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesCarol McDonald
 
ASDSpec, Model Based Testing
ASDSpec, Model Based TestingASDSpec, Model Based Testing
ASDSpec, Model Based TestingRachid Kherrazi
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahoutsscdotopen
 
Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkFree Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkMapR Technologies
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 
Data Science in the Elastic Stack
Data Science in the Elastic StackData Science in the Elastic Stack
Data Science in the Elastic StackRochelle Sonnenberg
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Arvind Surve
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Arvind Surve
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetupamarsri
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to MahoutTed Dunning
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGMapR Technologies
 

Similaire à Parallel and Iterative Processing for Machine Learning Recommendations with Spark (20)

Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Testing Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnit
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
 
Analytics with Spark
Analytics with SparkAnalytics with Spark
Analytics with Spark
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
ASDSpec, Model Based Testing
ASDSpec, Model Based TestingASDSpec, Model Based Testing
ASDSpec, Model Based Testing
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahout
 
Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkFree Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache Spark
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Data Science in the Elastic Stack
Data Science in the Elastic StackData Science in the Elastic Stack
Data Science in the Elastic Stack
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
 

Plus de MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

Plus de MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Dernier

Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Dernier (20)

Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

Parallel and Iterative Processing for Machine Learning Recommendations with Spark

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies Parallel and Iterative Processing for Machine Learning Recommendations with Spark
  • 2. © 2014 MapR Technologies 2 Agenda • Collaborative Filtering with Spark • Model training • Alternating Least Squares • The code
  • 3. © 2014 MapR Technologies 3 Collaborative Filtering with Spark • Recommend Items – (filtering) • Based on User preferences data – (collaborative)
  • 4. © 2014 MapR Technologies 4 Train a Model to Make Predictions New Data Model Predictions Training Data ModelAlgorithm Ted and Carol like Movie B and C Bob likes Movie B, What might he like ? Bob likes Movie B, Predict C
  • 5. © 2014 MapR Technologies 5 Alternating Least Squares • approximates sparse user item rating matrix – as product of two dense matrices, User and Item factor matrices – tries to learn the hidden features of each user and item – algorithm alternatively fixes one factor matrix and solves for the other
  • 6. © 2014 MapR Technologies 6 ML Cross Validation Process Data Model Training/ Building Test Model Predictions Test Set Train Test loop Training Set
  • 7. © 2014 MapR Technologies 7 Typical MapReduce Workflows Input to Job 1 SequenceFile Last Job Maps Reduces SequenceFile Job 1 Maps Reduces SequenceFile Job 2 Maps Reduces Output from Job 1 Output from Job 2 Input to last job Output from last job HDFS Iteration is slow because it writes/reads data to disk
  • 8. © 2014 MapR Technologies 8 Resilient Distributed Datasets (RDD) Spark revolves around RDDs • read only collection of elements • operated on in parallel • Partitions Cached in memory
  • 9. © 2014 MapR Technologies 10 Ratings Data
  • 10. © 2014 MapR Technologies 11 Parse Input // parse input UserID::MovieID::Rating def parseRating(str: String): Rating= { val fields = str.split("::") Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble) } // create an RDD of Ratings objects val ratingsRDD = ratingText.map(parseRating).cache()
  • 11. © 2014 MapR Technologies 12 Build Model Data Build Model Test Set Training Set split ratings RDD into training data RDD (80%) and test data RDD (20%) build a user product matrix model
  • 12. © 2014 MapR Technologies 13 Create Model // Randomly split ratings RDD into training data RDD (80%) and test data RDD (20%) val splits = ratingsRDD.randomSplit(Array(0.8, 0.2), 0L) val trainingRatingsRDD = splits(0).cache() val testRatingsRDD = splits(1).cache() // build a ALS user product matrix model with rank=20, iterations=10 val model = (new ALS().setRank(20).setIterations(10).run(trainingRatingsRDD))
  • 13. © 2014 MapR Technologies 14 Get predictions // get predicted ratings to compare to test ratings // call model.predict with test Userid, MovieId input data val predictionsForTestRDD = model.predict(testUserProductRDD) User, Movie Test Data Model Predicted Ratings
  • 14. © 2014 MapR Technologies 15 Compare predictions to Tests Join predicted ratings to test ratings in order to compare ((user, product),test rating) ((user, product), predicted rating) ((user, product),(test rating, predicted rating)) Key, Value Key, Value Key, Value
  • 15. © 2014 MapR Technologies 16 Test Model // prepare predictions for comparison val predictionsKeyedByUserProductRDD = predictionsForTestRDD.map{ case Rating(user, product, rating) => ((user, product), rating) } // prepare test for comparison val testKeyedByUserProductRDD = testRatingsRDD.map{ case Rating(user, product, rating) => ((user, product), rating) } //Join the test with predictions val testAndPredictionsJoinedRDD = testKeyedByUserProductRDD .join(predictionsKeyedByUserProductRDD)
  • 16. © 2014 MapR Technologies 17 Compare predictions to Tests Find False positives: Where test rating <= 1 and predicted rating >= 4 ((user, product),(test rating, predicted rating)) Key, Value
  • 17. © 2014 MapR Technologies 18 Test Model val falsePositives =(testAndPredictionsJoinedRDD.filter{ case ((user, product), (ratingT, ratingP)) => (ratingT <= 1 && ratingP >=4) }) falsePositives.take(2) Array[((Int, Int), (Double, Double))] = ((3842,2858),(1.0,4.106488210964762)), ((6031,3194),(1.0,4.790778049100913))
  • 18. © 2014 MapR Technologies 19 Test Model Mean Absolute Error //Evaluate the model using Mean Absolute Error (MAE) between test and predictions val meanAbsoluteError = testAndPredictionsJoinedRDD.map { case ((user, product), (testRating, predRating)) => val err = (testRating - predRating) Math.abs(err) }.mean() meanAbsoluteError: Double = 0.7244940545944053
  • 19. © 2014 MapR Technologies 20 Soon to Come • Spark On Demand Training – https://www.mapr.com/services/mapr-academy/ • Blogs and Tutorials: – Movie Recommendations with Collaborative Filtering – Spark Streaming
  • 20. © 2014 MapR Technologies 21 Machine Learning Blog • https://www.mapr.com/blog/parallel-and-iterative-processing- machine-learning-recommendations-spark
  • 21. © 2014 MapR Technologies 22 Spark on MapR • Certified Spark Distribution • Fully supported and packaged by MapR in partnership with Databricks – mapr-spark package with Spark, Shark, Spark Streaming today – Spark-python, GraphX and MLLib soon • YARN integration – Spark can then allocate resources from cluster when needed
  • 22. © 2014 MapR Technologies 23 References • Spark web site: http://spark.apache.org/ • https://databricks.com/ • Spark on MapR: – http://www.mapr.com/products/apache-spark • Spark SQL and DataFrame Guide • Apache Spark vs. MapReduce – Whiteboard Walkthrough • Learning Spark - O'Reilly Book • Apache Spark
  • 23. © 2014 MapR Technologies 24 Q&A @mapr maprtech Engage with us! MapR maprtech mapr-technologies

Notes de l'éditeur

  1. Collaborative filtering algorithms recommend items (this is the filtering part) based on preference information from many users (this is the collaborative part). The collaborative filtering approach is based on similarity; the basic idea is people who liked similar items in the past will like similar items in the future. In the example shown, Ted likes movies A, B, and C. Carol likes movies B and C. Bob likes movie B. To recommend a movie to Bob, we calculate that users who liked B also liked C, so C is a possible recommendation for Bob. Of course, this is a tiny example. In real situations, we would have much more data to work with.
  2. The goal of a collaborative filtering algorithm is to take preferences data from users and to create a model which can be used for recommendations or predictions. Ted likes movies A, B, and C. Carol likes movies B and C. So we take this data , run it through an algorithm to build a model. Then when we have new Data such as Bob likes movie B, we use the model to predict that C is a possible recommendation for Bob.
  3. ALS approximates the sparse user item rating matrix of dimension K as the product of two dense matrices, User and Item factor matrices of size U×K and I×K (see picture below). The factor matrices are also called latent feature models. The factor matrices represent hidden features which the algorithm tries to discover. One matrix tries to describe the latent or hidden features of each user, and one tries to describe latent properties of each movie. ALS is an iterative algorithm. In each iteration, the algorithm alternatively fixes one factor matrix and solves for the other, and this process continues until it converges. This alternation between which matrix to optimize is where the "alternating" in the name comes from.
  4. A typical machine learning workflow is shown , we will perform the following steps: Load the sample data. Parse the data into the input format for the ALS algorithm. Split the data into two parts, one for building the model and one for testing the model. Run the ALS algorithm to build/train a user product matrix model. Make predictions with the training data and observe the results. Test the model with the test data.
  5. Spark is especially useful for parallel processing of distributed data with iterative algorithms. Spark tries to keep things in memory, whereas MapReduce involves more reading and writing from disk. As shown in the image below, for each MapReduce Job, data is read from an HDFS file for a mapper, written to and from a SequenceFile in between, and then written to an output file from a reducer. When a chain of multiple jobs is needed, Spark can execute much faster by keeping data in memory.
  6. Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs.
  7. An RDD is simply a distributed collection of elements. You can think of the distributed collections like of like an array or list in your single machine program, except that it’s spread out across multiple nodes in the cluster. In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. Under the hood, Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them. So, Spark gives you APIs and functions that lets you do something on the whole collection in parallel using all the nodes.
  8. We use the org.apache.spark.mllib.recommendation.Rating class for parsing the ratings.dat file. Later we will use the Rating class as input for the ALS run method. Then we use the map transformation on ratingText, which will apply the parseRating function to each element in ratingText and return a new RDD of Rating objects. We cache the ratings data, since we will use this data to build the matrix model.
  9. Next we we Split the data into two parts, one for building the model and one for testing the model. Then we Run the ALS algorithm to build/train a user product matrix model.
  10. Next we we Split the data into two parts, one for building the model and one for testing the model. Then we Run the ALS algorithm to build/train a user product matrix model.
  11. Next we get predicted movie ratings for the test data: by calling model.predict with test User id , Movie Id input data
  12. Next we will compare test User id , Movie Id Ratings to the test Userid, Movie Id predicted Rating
  13. Here we create User id , Movie Id , Ratings key value pairs for joining in order to compare the test ratings to the predicted ratings
  14. Next we will compare test User id , Movie Id Ratings to the test Userid, Movie Id predicted Rating
  15. Here we compare test ratings and predicted ratings by filtering on ratings where the test rating<=1 and the predicted rating is >=4
  16. we register the DataFrame as a table. Registering it as a table allows us to use it in subsequent SQL statements.   Now we can inspect the data.
  17. https://www.mapr.com/blog/parallel-and-iterative-processing-machine-learning-recommendations-spark