2. Table of Contents
1. What is Recommendation?
2. Different Recommendation Strategies
3. Introduction of Hadoop/Mahout
4. Building Recommendation Engine with
Hadoop/Mahout
5. How to use Mahout
6. Q&A
4. Definition of
Recommendation Engine
"A recommendation system provides
information or items that are likely to be of
interest to a user, in an automated fashion”
- Alpa Jain from Twitter
"Serve the right item to users in an
automated fashion to optimize long-term
business objectives"
- Deepak Agarwal from Yahoo
5. Examples
• Related Product (Amazon)
• Movie Recommendation (Netflix)
• News Contents (Yahoo)
• Online Dating (eHarmony)
• Search Autocomplete (Google)
• Connection Recommendation (LinkedIn)
• Song Recommendation (Pandora)
• Walmart – (Physical) Store Layout
6. Why Recommendation?
• A way for users to find contents of interest
(from large selections) with less efforts.
o Natural way to personalization!
o Serendipity factor
• For companies, a good way to introduce
new and unknown contents
8. Item based
recommendation (1)
1. Content-based Item Recommendation.
o Using meta data from Item, compute similarity
between items.
i. Description, price, category and so on
ii. Normalize these into a feature vector (numeric
values)
i. You can think of it as a point in N-dimension.
iii. Compute the distances between vectors.
i. Euclidean Distance Score
ii. Cosine Similarity Score
iii. Pearson Correlation Score
9. Item based
recommendation (2)
2. Collaborative Filtering.
o Leverage users’ collective intelligence
Similar users tend to like similar items
Amazon’s product recommendation is a very
good and famous example
o Will look at this in more detail
10. User based
recommendation
• First group users into different clusters
o Represent users as feature vectors
Information about users:
• geo-location, gender, age, …
Items users liked or rated
o K-nearest neighbors (KNN) is used a lot
• From each cluster, find representative items
o Some kind of graph traversal
o Highest rated items
o Most liked items
11. Challenges of
Recommendation Engine
• Cold Starter
o For new users and/or items, no information to
leverage.
• Sparse Data
o Item reviews or purchases are not very common.
• Scalability Issue
o The bigger the data gets, the more computation is
needed.
13. What is Hadoop?
• An open source distributed computation and
storage platform after Google File System
and MapReduce framework
• Perfect fit for large scale batch offline
processing but not for realtime processing
• Widely used in many companies
14. What is Mahout?
• An open source machine learning library
written in Java.
1. Standalone
2. MapReduce.
o Supports large scale batch offline processing.
• Covers the followings
o Recommendation/Collaborative Filtering.
o Classification: Supervised Learning.
o Clustering: Unsupervised Learning.
16. Typical Architecture
Data Collection Web server logs,
MySQL
tables, ...
(explicit
Input Data Pre-processing (ETL, Filtering, …) feedback and
implicit
feedback)
Recommendation Data Building (Mahout)
Output Data Post-processing (Re-ordering)
Hadoop
Load Final Data To Serving Layer
MySQL, NoSQL,
Recommendation Serving Layer Solr/ElasticSearch,
...
17. Use Case:
Polyvore – Item Page
Item in question
Content Based
Recommendation
Collaborative Filtering
19. People who liked this
also like ...
• This is based on "Collaborative Filtering”
• Construct co-occurrence matrix or Item
similarity matrix – S[NxN]
o Increment S[i,j] and S[j,i] if item i and item j are liked
by the same user
o Repeat this for all users for their liked items
• For item k, find the most co-occurred items
(from column k or row k) as
recommendations.
20. Personalized
Recommendation
• This is based on "Collaborative Filtering”
• Extension of previous topic
• Computation-wise, matrix multiplication
a. First, build a similar matrix (S) for items
b. Next, build a preference vector (P) for user
c. Next, multiply two matrices from a and b
R=SxP
a. Lastly, sort the final vector elements of R
21. Polyvore Example
• Assumption:
o N items and M users. Users can only like (no rating)
• Create item similarity matrix of S (NxN)
o This will be used as recommendations in Item page
• Create user preference vector of P(1xN)
o Set all P(i) which are liked by the user in question
• Multiply S by P
o Sort result elements by the score
o This will be personalized item recommendation
22. How to use Mahout?
• ItemSimilarityJob class
• Main class to compute co-occurrence matrix.
• RecommenderJob class
• Main class to generate personalized
recommendations.
hadoop jar mahout-core-0.8-job.jar
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
-Dmapred.input.dir=input/user-item-rating.txt -Dmapred.output.dir=output
--usersFile input/users.txt --booleanData --similarityClassname
SIMILARITY_COOCCURRENCE --minPrefsPerUser 2 --maxPrefsPerUser
50000
This will run total 10 mapreduce jobs to generate final recommendations for
users
23. How to use Mahout?
(Cont'd)
• Input File: user-item-rating.txt
o userID,itemID[,rating] per line.
• How to compute similarity between Items
o --similarityClassname parameter determines
CooccurrenceCountSimilarity
LogLikelihoodSimilarity
TanimotoCoefficientSimilarity
CityBlockSimilarity
CosineSimilarity
PearsonCorrelationSimilarity
EuclideanDistanceSimilarity
24. How to use Mahout?
(Cont'd)
• Final Output
o UserID [(ItemID,Score),(ItemID,Score),......
o ...
• Load this from HDFS to a serving layer
o Relational Database
o Search Engine
o NoSQL
25. Lessons
• Need to understand business domain
o This takes time and efforts
• Garbage In Garbage Out
o Filtering is very important
• Start with simple approach
o And then improve gradually
• Having automated pipeline is very important
o More experiments with less efforts is doable
o Remember you will have to do lots of experiments
o But it is hard and takes time to build
26. Next stage of
recommendation?
• Need realtime & scalable
recommendation technology.
• Recommendation As A Service.
• www.myrrix.com
This is effective when you have a lot more users than items.
2% of users provide feedbacks
Make captions more visible and also Likes button on the far left.
Make captions more visible and also Likes button on the far left.
Access log case: lots of robots access. What would be business case for Polyvore. Where is your traffic coming from? What are user’s intetion? Sizes of users and items. Seasonality