Buidling large scale recommendation engine

Building
Recommendation
Engine
Keeyong Han, Jan 2013

Table of Contents
1. What is Recommendation?
2. Different Recommendation Strategies
3. Introduction of Hadoop/Mahout
4. Building Recommendation Engine with
Hadoop/Mahout
5. How to use Mahout
6. Q&A

Definition of
Recommendation Engine
"A recommendation system provides
information or items that are likely to be of
interest to a user, in an automated fashion”
- Alpa Jain from Twitter

"Serve the right item to users in an
automated fashion to optimize long-term
business objectives"
- Deepak Agarwal from Yahoo

Examples
• Related Product (Amazon)
• Movie Recommendation (Netflix)
• News Contents (Yahoo)
• Online Dating (eHarmony)
• Search Autocomplete (Google)
• Connection Recommendation (LinkedIn)
• Song Recommendation (Pandora)
• Walmart – (Physical) Store Layout

Why Recommendation?
• A way for users to find contents of interest
(from large selections) with less efforts.
o Natural way to personalization!
o Serendipity factor

• For companies, a good way to introduce
new and unknown contents

Different
Recommendation
Strategies
Item vs. User

Item based
recommendation (1)
1. Content-based Item Recommendation.
o Using meta data from Item, compute similarity
between items.
i. Description, price, category and so on
ii. Normalize these into a feature vector (numeric
values)
i. You can think of it as a point in N-dimension.
iii. Compute the distances between vectors.
i. Euclidean Distance Score
ii. Cosine Similarity Score
iii. Pearson Correlation Score

Item based
recommendation (2)
2. Collaborative Filtering.
o Leverage users’ collective intelligence
 Similar users tend to like similar items
 Amazon’s product recommendation is a very
good and famous example
o Will look at this in more detail

User based
recommendation
• First group users into different clusters
o Represent users as feature vectors
 Information about users:
• geo-location, gender, age, …
 Items users liked or rated
o K-nearest neighbors (KNN) is used a lot
• From each cluster, find representative items
o Some kind of graph traversal
o Highest rated items
o Most liked items

Challenges of
• Cold Starter
o For new users and/or items, no information to
leverage.
• Sparse Data
o Item reviews or purchases are not very common.
• Scalability Issue
o The bigger the data gets, the more computation is
needed.

What is Hadoop?
• An open source distributed computation and
storage platform after Google File System
and MapReduce framework
• Perfect fit for large scale batch offline
processing but not for realtime processing
• Widely used in many companies

What is Mahout?
• An open source machine learning library
written in Java.
1. Standalone
2. MapReduce.
o Supports large scale batch offline processing.

• Covers the followings
o Recommendation/Collaborative Filtering.
o Classification: Supervised Learning.
o Clustering: Unsupervised Learning.

Building
with Hadoop/Mahout

Typical Architecture
Data Collection Web server logs,
MySQL
tables, ...

(explicit
Input Data Pre-processing (ETL, Filtering, …) feedback and
implicit
feedback)
Recommendation Data Building (Mahout)

Output Data Post-processing (Re-ordering)
Hadoop
Load Final Data To Serving Layer

MySQL, NoSQL,
Recommendation Serving Layer Solr/ElasticSearch,
...

Use Case:
Polyvore – Item Page

Item in question

Content Based
Recommendation

Collaborative Filtering

Use Case:
Polyvore – Home Page

Personalized Recommendation

People who liked this
also like ...
• This is based on "Collaborative Filtering”
• Construct co-occurrence matrix or Item
similarity matrix – S[NxN]
o Increment S[i,j] and S[j,i] if item i and item j are liked
by the same user
o Repeat this for all users for their liked items
• For item k, find the most co-occurred items
(from column k or row k) as
recommendations.

Personalized
Recommendation
• This is based on "Collaborative Filtering”
• Extension of previous topic
• Computation-wise, matrix multiplication
a. First, build a similar matrix (S) for items
b. Next, build a preference vector (P) for user
c. Next, multiply two matrices from a and b
 R=SxP
a. Lastly, sort the final vector elements of R

Polyvore Example
• Assumption:
o N items and M users. Users can only like (no rating)
• Create item similarity matrix of S (NxN)
o This will be used as recommendations in Item page
• Create user preference vector of P(1xN)
o Set all P(i) which are liked by the user in question
• Multiply S by P
o Sort result elements by the score
o This will be personalized item recommendation

How to use Mahout?
• ItemSimilarityJob class
• Main class to compute co-occurrence matrix.
• RecommenderJob class
• Main class to generate personalized
recommendations.
hadoop jar mahout-core-0.8-job.jar
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
-Dmapred.input.dir=input/user-item-rating.txt -Dmapred.output.dir=output
--usersFile input/users.txt --booleanData --similarityClassname
SIMILARITY_COOCCURRENCE --minPrefsPerUser 2 --maxPrefsPerUser
50000

This will run total 10 mapreduce jobs to generate final recommendations for
users

How to use Mahout?
(Cont'd)
• Input File: user-item-rating.txt
o userID,itemID[,rating] per line.
• How to compute similarity between Items
o --similarityClassname parameter determines
 CooccurrenceCountSimilarity
 LogLikelihoodSimilarity
 TanimotoCoefficientSimilarity
 CityBlockSimilarity
 CosineSimilarity
 PearsonCorrelationSimilarity
 EuclideanDistanceSimilarity

How to use Mahout?
(Cont'd)
• Final Output
o UserID [(ItemID,Score),(ItemID,Score),......
o ...

• Load this from HDFS to a serving layer
o Relational Database
o Search Engine
o NoSQL

Lessons
• Need to understand business domain
o This takes time and efforts

• Garbage In Garbage Out
o Filtering is very important
• Start with simple approach
o And then improve gradually

• Having automated pipeline is very important
o More experiments with less efforts is doable
o Remember you will have to do lots of experiments
o But it is hard and takes time to build

Next stage of
recommendation?
• Need realtime & scalable
recommendation technology.
• Recommendation As A Service.
• www.myrrix.com

Buidling large scale recommendation engine

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Buidling large scale recommendation engine

Similaire à Buidling large scale recommendation engine (20)

Buidling large scale recommendation engine

Notes de l'éditeur