Offline evaluation of recommender systems: all pain and no gain?

Offline Evaluation of Recommender Systems
All pain and no gain?
Mark Levy
Mendeley

What is a good recommendation?

One that increases the usefulness
of your product in the long run1
1. WARNING: hard to measure directly

●
One that increased your bottom line:
– User bought item after it was recommended
– User clicked ad after it was shown
– User didn't skip track when it was played
– User added document to library...
– User connected with contact...

Why was it good?
●
Maybe it was
– Relevant
– Novel
– Familiar
– Serendipitous
– Well explained
●
Note: some of these are mutually incompatible

What is a bad recommendation?
(you know one when you seen one)

What is a bad recommendation?
●
Maybe it was
– Not relevant
– Too obscure
– Too familiar
– I already have it
– I already know that I don't like it
– Badly explained

What's the cost of getting it wrong?
●
Depends on your product and your users
– Lost revenue
– Less engaged user
– Angry user
– Amused user
– Confused user
– User defects to a rival product

Hypotheses
Good offline metrics
express product goals
Most (really) bad recommendations
can be caught by business logic

Issues
●
Real business goals concern long-term user
behaviour e.g. Netflix
“we have reformulated the recommendation problem to the
question of optimizing the probability a member chooses to
watch a title and enjoys it enough to come back to the service”
●
Usually have to settle for short-term surrogate
●
Only some user behaviour is visible
●
Same constraints when collecting training data

Least bad solution?
●
“Back to the future” aka historical log analysis
●
Decide which logged event(s) indicate success
●
Be honest about “success”
●
Usually care most about precision @ small k
●
Recall will discriminate once this plateaus
●
Expect to have to do online testing too

Making metrics meaningful
●
Building a test framework + data is hard
●
Be sure to get best value from your work
●
Don't use straw man baselines
●
Be realistic – leave the ivory tower
●
Make test setups and baselines reproducible

●
Old skool k-NN systems are better than you think
– Input numbers from mining logs
– Temporal “modelling” (e.g. fake users)
– Data pruning (scalability, popularity bias, quality)
– Preprocessing (tf-idf, log/sqrt, )…
– Hand crafted similarity metric
– Hand crafted aggregation formula
– Postprocessing (popularity matching)
– Diversification
– Attention profile

●
Measure preference honestly
●
Predicted items may not be “correct” just because
they were consumed once
●
Try to capture value
– Earlier recommendation may be better
– Don't need a recommender to suggest items by same
artist/author
●
Don't neglect side data
– At least use it for evaluation / sanity checking

●
Public data isn't enough for reproducibility or
fair comparison
●
Need to document preprocessing
●
Better:
Release your preparation/evaluation code too

What's the cost of poor evaluation?

What's the cost of poor evaluation?
Poor offline evaluation can lead to
years of misdirected research

Ex 1: Reduce playlist skips
●
Reorder a playlist of tracks to reduce skips by
avoiding “genre whiplash”
●
Use audio similarity measure to compute
transition distance, then travelling salesman
●
Metric: sum of transition distances (lower is
better)
●
6 months work to develop solution

●
Result: users skipped more often
●
Why?

●
Result: users skipped more often
●
When a user skipped a track they didn't like
they were played something else just like it
●
Better metric: average position of skipped
tracks (based on logs, lower down is better)

Ex 2: Recommend movies
●
Use a corpus of star ratings to improve movie
recommendations
●
Learn to predict ratings for un-rated movies
●
Metric: average RMSE of predictions for a
hidden test set (lower is better)
●
2+ years work to develop new algorithms

●
Result: “best” solutions were never deployed
●
Why?

●
Result: “best” solutions were never deployed
●
User behaviour correlates with rank not RMSE
●
Side datasets an order of magnitude more
valuable than algorithm improvements
●
Explicit ratings are the exception not the rule
●
RMSE still haunts research labs

Can contests help?
●
Good:
– Great for consistent evaluation
●
Not so good:
– Privacy concerns mean obfuscated data
– No guarantee that metrics are meaningful
– No guarantee that train/test framework is valid
– Small datasets can become overexposed

Ex 3: Yahoo! Music KDD Cup
●
Largest music rating dataset ever released
●
Realistic “loved songs” classification task
●
Data fully obfuscated due to recent lawsuits

●
Result: researchers hated it
●
Why?

●
Result: researchers hated it
●
Research frontier focussed on audio content
and metadata, not joinable to obfuscated ratings

Ex 4: Million Song Challenge
●
Large music dataset with rich metadata
●
Anonymized listening histories
●
Simple item recommendation task
●
Reasonable MAP@500 metric
●
Aimed to solve shortcomings of KDD Cup
●
Only obfuscation was removal of timestamps

●
Result: winning entry didn't use side data
●
Why?

●
Result: winning entry didn't use side data
●
No timestamps so test tracks chosen at random
●
So “people who listen to A also listen to B”
●
Traditional item similarity solves this well
●
More honesty about “success” might have
shown that contest data was flawed

Ex 5: Yelp RecSys Challenge
●
Small business review dataset with side data
●
Realistic mix of input data types
●
Rating prediction task
●
Informal procedure to create train/test sets

●
Result: baseline algorithms high up leaderboard
●
Why?

●
Result: baseline algorithms high up leaderboard
●
Train/test split was corrupt
●
Competition organisers moved fast to fix this
●
But left only one week before deadline

Ex 6: MIREX Audio Chord Estimation
●
Small dataset of audio tracks
●
Task to label with predicted chord symbols
●
Human labelled data hard to come by
●
Contest hosted by premier forum in field
●
Evaluate frame-level prediction accuracy
●
Historical glass ceiling around 80%

●
Result: 2011 winner ftw
●
Why?

●
Result: 2011 winner ftw
●
Spoof entry relying on known test set
●
Protest against inadequate test data
●
Other research showed weak generalisation of
winning algorithms from same contest
●
Next year results dropped significantly

So why evaluate offline at all?
●
Building test framework ensures clear goals
●
Avoid wishful thinking if your data is too thin
●
Be efficient with precious online testing
– Cut down huge parameter space
– Don't alienate users
●
Need to publish
●
Pursuing science as well as profit

Online evaluation is tricky too
●
No off the shelf solution for services
●
Many statistical gotchas
●
Same mismatch between short-term and long-
term success criteria
●
Results open to interpretation by management
●
Can make incremental improvements look
good when radical innovation is needed

Ex 7: Article Recommendations
●
Recommender for related research articles
●
Massive download logs available
●
Framework developed based on co-downloads
●
Aim to improve on existing search solution
●
Management “keen for it work”
●
Several weeks of live A/B testing available
●
No offline evaluation

●
Result: worse than similar title search
●
Why?

●
Result: worse than similar title search
●
Inadequate business rules e.g. often suggesting
other articles from same publication
●
Users identified only by organisational IP range
so value of “big data” very limited
●
Establishing an offline evaluation protocol
would have shown these in advance

Isn't there software for that?
Rules of the game:
– Model fit metrics (e.g. validation loss) don't count
– Need a transparent “audit trail” of data to support
genuine reproducibility
– Just using public datasets doesn't ensure this

Wish list for reproducible evaluation:
– Integrate with recommender implementations
– Handle data formats and preprocessing
– Handle splitting, cross-validation, side datasets
– Save everything to file
– Work from file inputs so not tied to one framework
– Generate meaningful metrics
– Well documented and easy to use

Current offerings:
●
GraphChi/GraphLab
●
Mahout
●
LensKit
●
MyMediaLite

Current offerings:
●
GraphChi/GraphLab
– Model validation loss, doesn't count
●
Mahout
– Only rating prediction accuracy, doesn't count
●
LensKit
– Too hard to understand, won't use

Current offerings:
●
MyMediaLite
– Reports meaningful metrics
– Handles cross-validation
– Data splitting not transparent
– No support for pre-processing
– No built in support for standalone evaluation
– API is capable but current utils don't meet wishlist

Eating your own dog food
●
Built a small framework around new algorithm
●
https://github.com/mendeley/mrec
– Reports meaningful metrics
– Handles cross-validation
– Supports simple pre-processing
– Writes everything to file for reproducibility
– Provides API and utility scripts
– Runs standalone evalutions
– Readable Python code

Eating your own dog food
●
Some lessons learned
– Usable frameworks are hard to write
– Tradeoff between clarity and scalability
– Should generate explicit validation sets
●
Please contribute!
●
Or use as inspiration to improve existing tools

Where next?
●
Shift evaluation online:
– Contests based around online evaluation
– Realistic but not reproducible
– Could some run continuously?
●
Recommender Systems as a commodity:
– Software and services reaching maturity now
– Business users can tune/evaluate themselves
– Is there a way to report results?

Where next?
●
Support alternative query paradigms:
– More like this, less like that
– Metrics for dynamic/online recommenders
●
Support recommendation with side data:
– LibFM, GenSGD, WARP research @google, …
– Open datasets?

Thanks for listening
mark.levy@mendeley.com
@gamboviol
https://github.com/gamboviol
https://github.com/mendeley/mrec

Offline evaluation of recommender systems: all pain and no gain?

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Offline evaluation of recommender systems: all pain and no gain?

Similaire à Offline evaluation of recommender systems: all pain and no gain? (20)

Dernier

Dernier (20)

Offline evaluation of recommender systems: all pain and no gain?