SlideShare une entreprise Scribd logo
1  sur  60
Télécharger pour lire hors ligne
Offline Evaluation of Recommender Systems
All pain and no gain?
Mark Levy
Mendeley
About me
About me
Some things I built
Something I'm building
What is a good recommendation?
What is a good recommendation?
One that increases the usefulness
of your product in the long run1
1. WARNING: hard to measure directly
What is a good recommendation?
●
One that increased your bottom line:
– User bought item after it was recommended
– User clicked ad after it was shown
– User didn't skip track when it was played
– User added document to library...
– User connected with contact...
Why was it good?
Why was it good?
●
Maybe it was
– Relevant
– Novel
– Familiar
– Serendipitous
– Well explained
●
Note: some of these are mutually incompatible
What is a bad recommendation?
What is a bad recommendation?
(you know one when you seen one)
What is a bad recommendation?
What is a bad recommendation?
What is a bad recommendation?
What is a bad recommendation?
●
Maybe it was
– Not relevant
– Too obscure
– Too familiar
– I already have it
– I already know that I don't like it
– Badly explained
What's the cost of getting it wrong?
●
Depends on your product and your users
– Lost revenue
– Less engaged user
– Angry user
– Amused user
– Confused user
– User defects to a rival product
Hypotheses
Good offline metrics
express product goals
Most (really) bad recommendations
can be caught by business logic
Issues
●
Real business goals concern long-term user
behaviour e.g. Netflix
“we have reformulated the recommendation problem to the
question of optimizing the probability a member chooses to
watch a title and enjoys it enough to come back to the service”
●
Usually have to settle for short-term surrogate
●
Only some user behaviour is visible
●
Same constraints when collecting training data
Least bad solution?
●
“Back to the future” aka historical log analysis
●
Decide which logged event(s) indicate success
●
Be honest about “success”
●
Usually care most about precision @ small k
●
Recall will discriminate once this plateaus
●
Expect to have to do online testing too
Making metrics meaningful
●
Building a test framework + data is hard
●
Be sure to get best value from your work
●
Don't use straw man baselines
●
Be realistic – leave the ivory tower
●
Make test setups and baselines reproducible
Making metrics meaningful
●
Old skool k-NN systems are better than you think
– Input numbers from mining logs
– Temporal “modelling” (e.g. fake users)
– Data pruning (scalability, popularity bias, quality)
– Preprocessing (tf-idf, log/sqrt, )…
– Hand crafted similarity metric
– Hand crafted aggregation formula
– Postprocessing (popularity matching)
– Diversification
– Attention profile
Making metrics meaningful
●
Measure preference honestly
●
Predicted items may not be “correct” just because
they were consumed once
●
Try to capture value
– Earlier recommendation may be better
– Don't need a recommender to suggest items by same
artist/author
●
Don't neglect side data
– At least use it for evaluation / sanity checking
Making metrics meaningful
●
Public data isn't enough for reproducibility or
fair comparison
●
Need to document preprocessing
●
Better:
Release your preparation/evaluation code too
What's the cost of poor evaluation?
What's the cost of poor evaluation?
Poor offline evaluation can lead to
years of misdirected research
Ex 1: Reduce playlist skips
●
Reorder a playlist of tracks to reduce skips by
avoiding “genre whiplash”
●
Use audio similarity measure to compute
transition distance, then travelling salesman
●
Metric: sum of transition distances (lower is
better)
●
6 months work to develop solution
Ex 1: Reduce playlist skips
●
Result: users skipped more often
●
Why?
Ex 1: Reduce playlist skips
●
Result: users skipped more often
●
When a user skipped a track they didn't like
they were played something else just like it
●
Better metric: average position of skipped
tracks (based on logs, lower down is better)
Ex 2: Recommend movies
●
Use a corpus of star ratings to improve movie
recommendations
●
Learn to predict ratings for un-rated movies
●
Metric: average RMSE of predictions for a
hidden test set (lower is better)
●
2+ years work to develop new algorithms
Ex 2: Recommend movies
●
Result: “best” solutions were never deployed
●
Why?
Ex 2: Recommend movies
●
Result: “best” solutions were never deployed
●
User behaviour correlates with rank not RMSE
●
Side datasets an order of magnitude more
valuable than algorithm improvements
●
Explicit ratings are the exception not the rule
●
RMSE still haunts research labs
Can contests help?
●
Good:
– Great for consistent evaluation
●
Not so good:
– Privacy concerns mean obfuscated data
– No guarantee that metrics are meaningful
– No guarantee that train/test framework is valid
– Small datasets can become overexposed
Ex 3: Yahoo! Music KDD Cup
●
Largest music rating dataset ever released
●
Realistic “loved songs” classification task
●
Data fully obfuscated due to recent lawsuits
Ex 3: Yahoo! Music KDD Cup
●
Result: researchers hated it
●
Why?
Ex 3: Yahoo! Music KDD Cup
●
Result: researchers hated it
●
Research frontier focussed on audio content
and metadata, not joinable to obfuscated ratings
Ex 4: Million Song Challenge
●
Large music dataset with rich metadata
●
Anonymized listening histories
●
Simple item recommendation task
●
Reasonable MAP@500 metric
●
Aimed to solve shortcomings of KDD Cup
●
Only obfuscation was removal of timestamps
Ex 4: Million Song Challenge
●
Result: winning entry didn't use side data
●
Why?
Ex 4: Million Song Challenge
●
Result: winning entry didn't use side data
●
No timestamps so test tracks chosen at random
●
So “people who listen to A also listen to B”
●
Traditional item similarity solves this well
●
More honesty about “success” might have
shown that contest data was flawed
Ex 5: Yelp RecSys Challenge
●
Small business review dataset with side data
●
Realistic mix of input data types
●
Rating prediction task
●
Informal procedure to create train/test sets
Ex 5: Yelp RecSys Challenge
●
Result: baseline algorithms high up leaderboard
●
Why?
Ex 5: Yelp RecSys Challenge
●
Result: baseline algorithms high up leaderboard
●
Train/test split was corrupt
●
Competition organisers moved fast to fix this
●
But left only one week before deadline
Ex 6: MIREX Audio Chord Estimation
●
Small dataset of audio tracks
●
Task to label with predicted chord symbols
●
Human labelled data hard to come by
●
Contest hosted by premier forum in field
●
Evaluate frame-level prediction accuracy
●
Historical glass ceiling around 80%
Ex 6: MIREX Audio Chord Estimation
●
Result: 2011 winner ftw
●
Why?
Ex 6: MIREX Audio Chord Estimation
●
Result: 2011 winner ftw
●
Spoof entry relying on known test set
●
Protest against inadequate test data
●
Other research showed weak generalisation of
winning algorithms from same contest
●
Next year results dropped significantly
So why evaluate offline at all?
●
Building test framework ensures clear goals
●
Avoid wishful thinking if your data is too thin
●
Be efficient with precious online testing
– Cut down huge parameter space
– Don't alienate users
●
Need to publish
●
Pursuing science as well as profit
Online evaluation is tricky too
●
No off the shelf solution for services
●
Many statistical gotchas
●
Same mismatch between short-term and long-
term success criteria
●
Results open to interpretation by management
●
Can make incremental improvements look
good when radical innovation is needed
Ex 7: Article Recommendations
●
Recommender for related research articles
●
Massive download logs available
●
Framework developed based on co-downloads
●
Aim to improve on existing search solution
●
Management “keen for it work”
●
Several weeks of live A/B testing available
●
No offline evaluation
Ex 7: Article Recommendations
●
Result: worse than similar title search
●
Why?
Ex 7: Article Recommendations
●
Result: worse than similar title search
●
Inadequate business rules e.g. often suggesting
other articles from same publication
●
Users identified only by organisational IP range
so value of “big data” very limited
●
Establishing an offline evaluation protocol
would have shown these in advance
Isn't there software for that?
Rules of the game:
– Model fit metrics (e.g. validation loss) don't count
– Need a transparent “audit trail” of data to support
genuine reproducibility
– Just using public datasets doesn't ensure this
Isn't there software for that?
Wish list for reproducible evaluation:
– Integrate with recommender implementations
– Handle data formats and preprocessing
– Handle splitting, cross-validation, side datasets
– Save everything to file
– Work from file inputs so not tied to one framework
– Generate meaningful metrics
– Well documented and easy to use
Isn't there software for that?
Current offerings:
●
GraphChi/GraphLab
●
Mahout
●
LensKit
●
MyMediaLite
Isn't there software for that?
Current offerings:
●
GraphChi/GraphLab
– Model validation loss, doesn't count
●
Mahout
– Only rating prediction accuracy, doesn't count
●
LensKit
– Too hard to understand, won't use
Isn't there software for that?
Current offerings:
●
MyMediaLite
– Reports meaningful metrics
– Handles cross-validation
– Data splitting not transparent
– No support for pre-processing
– No built in support for standalone evaluation
– API is capable but current utils don't meet wishlist
Eating your own dog food
●
Built a small framework around new algorithm
●
https://github.com/mendeley/mrec
– Reports meaningful metrics
– Handles cross-validation
– Supports simple pre-processing
– Writes everything to file for reproducibility
– Provides API and utility scripts
– Runs standalone evalutions
– Readable Python code
Eating your own dog food
●
Some lessons learned
– Usable frameworks are hard to write
– Tradeoff between clarity and scalability
– Should generate explicit validation sets
●
Please contribute!
●
Or use as inspiration to improve existing tools
Where next?
●
Shift evaluation online:
– Contests based around online evaluation
– Realistic but not reproducible
– Could some run continuously?
●
Recommender Systems as a commodity:
– Software and services reaching maturity now
– Business users can tune/evaluate themselves
– Is there a way to report results?
Where next?
●
Support alternative query paradigms:
– More like this, less like that
– Metrics for dynamic/online recommenders
●
Support recommendation with side data:
– LibFM, GenSGD, WARP research @google, …
– Open datasets?
Thanks for listening
mark.levy@mendeley.com
@gamboviol
https://github.com/gamboviol
https://github.com/mendeley/mrec

Contenu connexe

Tendances

An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...PyData
 
Best Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesBest Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesAlan Said
 
Past, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectivePast, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectiveXavier Amatriain
 
Recommender Systems - A Review and Recent Research Trends
Recommender Systems  -  A Review and Recent Research TrendsRecommender Systems  -  A Review and Recent Research Trends
Recommender Systems - A Review and Recent Research TrendsSujoy Bag
 
Survey Research In Empirical Software Engineering
Survey Research In Empirical Software EngineeringSurvey Research In Empirical Software Engineering
Survey Research In Empirical Software Engineeringalessio_ferrari
 
Recommender Systems In Industry
Recommender Systems In IndustryRecommender Systems In Industry
Recommender Systems In IndustryXavier Amatriain
 
Case Study Research in Software Engineering
Case Study Research in Software EngineeringCase Study Research in Software Engineering
Case Study Research in Software Engineeringalessio_ferrari
 
Aiinpractice2017deepaklongversion
Aiinpractice2017deepaklongversionAiinpractice2017deepaklongversion
Aiinpractice2017deepaklongversionDeepak Agarwal
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemMilind Gokhale
 
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...Ahmed Magdy Ezzeldin, MSc.
 
Systematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping StudiesSystematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping Studiesalessio_ferrari
 
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...alessio_ferrari
 
Empirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an OverviewEmpirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an Overviewalessio_ferrari
 
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialBuilding Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialXavier Amatriain
 
Recommender system a-introduction
Recommender system a-introductionRecommender system a-introduction
Recommender system a-introductionzh3f
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introductionLiang Xiang
 
Factorization Machines with libFM
Factorization Machines with libFMFactorization Machines with libFM
Factorization Machines with libFMLiangjie Hong
 
Alleviating cold-user start problem with users' social network data in recomm...
Alleviating cold-user start problem with users' social network data in recomm...Alleviating cold-user start problem with users' social network data in recomm...
Alleviating cold-user start problem with users' social network data in recomm...Eduardo Castillejo Gil
 
Recommendation techniques
Recommendation techniques Recommendation techniques
Recommendation techniques sun9413
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsLifeng (Aaron) Han
 

Tendances (20)

An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
 
Best Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesBest Practices in Recommender System Challenges
Best Practices in Recommender System Challenges
 
Past, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectivePast, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspective
 
Recommender Systems - A Review and Recent Research Trends
Recommender Systems  -  A Review and Recent Research TrendsRecommender Systems  -  A Review and Recent Research Trends
Recommender Systems - A Review and Recent Research Trends
 
Survey Research In Empirical Software Engineering
Survey Research In Empirical Software EngineeringSurvey Research In Empirical Software Engineering
Survey Research In Empirical Software Engineering
 
Recommender Systems In Industry
Recommender Systems In IndustryRecommender Systems In Industry
Recommender Systems In Industry
 
Case Study Research in Software Engineering
Case Study Research in Software EngineeringCase Study Research in Software Engineering
Case Study Research in Software Engineering
 
Aiinpractice2017deepaklongversion
Aiinpractice2017deepaklongversionAiinpractice2017deepaklongversion
Aiinpractice2017deepaklongversion
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation System
 
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
 
Systematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping StudiesSystematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping Studies
 
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
 
Empirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an OverviewEmpirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an Overview
 
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialBuilding Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
 
Recommender system a-introduction
Recommender system a-introductionRecommender system a-introduction
Recommender system a-introduction
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
 
Factorization Machines with libFM
Factorization Machines with libFMFactorization Machines with libFM
Factorization Machines with libFM
 
Alleviating cold-user start problem with users' social network data in recomm...
Alleviating cold-user start problem with users' social network data in recomm...Alleviating cold-user start problem with users' social network data in recomm...
Alleviating cold-user start problem with users' social network data in recomm...
 
Recommendation techniques
Recommendation techniques Recommendation techniques
Recommendation techniques
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methods
 

En vedette

Факторизационные модели в рекомендательных системах
Факторизационные модели в рекомендательных системахФакторизационные модели в рекомендательных системах
Факторизационные модели в рекомендательных системахromovpa
 
Contrasting Offline and Online Results when Evaluating Recommendation Algorithms
Contrasting Offline and Online Results when Evaluating Recommendation AlgorithmsContrasting Offline and Online Results when Evaluating Recommendation Algorithms
Contrasting Offline and Online Results when Evaluating Recommendation AlgorithmsMarco Rossetti
 
Algorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fmAlgorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fmMark Levy
 
[Decisions2013@RecSys]The Role of Emotions in Context-aware Recommendation
[Decisions2013@RecSys]The Role of Emotions in Context-aware Recommendation[Decisions2013@RecSys]The Role of Emotions in Context-aware Recommendation
[Decisions2013@RecSys]The Role of Emotions in Context-aware RecommendationYONG ZHENG
 
Spatially Aware Recommendation System
Spatially Aware Recommendation SystemSpatially Aware Recommendation System
Spatially Aware Recommendation SystemVeer Chandra
 
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...Aravind Sesagiri Raamkumar
 
II-SDV 2014 Recommender Systems for Analysis Applications (Roger Bradford - A...
II-SDV 2014 Recommender Systems for Analysis Applications (Roger Bradford - A...II-SDV 2014 Recommender Systems for Analysis Applications (Roger Bradford - A...
II-SDV 2014 Recommender Systems for Analysis Applications (Roger Bradford - A...Dr. Haxel Consult
 
Recommendation system
Recommendation systemRecommendation system
Recommendation systemRishabh Mehta
 
Toward the Next Generation of Recommender Systems:
Toward the Next Generation of Recommender Systems: Toward the Next Generation of Recommender Systems:
Toward the Next Generation of Recommender Systems: Vincent Chu
 
Impersonal Recommendation system on top of Hadoop
Impersonal Recommendation system on top of HadoopImpersonal Recommendation system on top of Hadoop
Impersonal Recommendation system on top of HadoopKostiantyn Kudriavtsev
 
Trust and Recommender Systems
Trust and  Recommender SystemsTrust and  Recommender Systems
Trust and Recommender Systemszhayefei
 
e-learning 3.0 and AI
e-learning 3.0 and AIe-learning 3.0 and AI
e-learning 3.0 and AINeil Rubens
 
Profile injection attack detection in recommender system
Profile injection attack detection in recommender systemProfile injection attack detection in recommender system
Profile injection attack detection in recommender systemASHISH PANNU
 
Crowd sourcing for tempo estimation
Crowd sourcing for tempo estimationCrowd sourcing for tempo estimation
Crowd sourcing for tempo estimationMark Levy
 
Recommender Systems in E-Commerce
Recommender Systems in E-CommerceRecommender Systems in E-Commerce
Recommender Systems in E-CommerceRoger Chen
 
Recommender Systems and Active Learning (for Startups)
Recommender Systems and Active Learning (for Startups)Recommender Systems and Active Learning (for Startups)
Recommender Systems and Active Learning (for Startups)Neil Rubens
 
Thesis about Computerized Payroll System for Barangay Hall, Dita
Thesis about Computerized Payroll System for Barangay Hall, DitaThesis about Computerized Payroll System for Barangay Hall, Dita
Thesis about Computerized Payroll System for Barangay Hall, DitaAcel Carl David O, Dolindo
 
Computerized payroll system
Computerized payroll systemComputerized payroll system
Computerized payroll systemFrancis Genavia
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architectureLiang Xiang
 

En vedette (20)

Факторизационные модели в рекомендательных системах
Факторизационные модели в рекомендательных системахФакторизационные модели в рекомендательных системах
Факторизационные модели в рекомендательных системах
 
Contrasting Offline and Online Results when Evaluating Recommendation Algorithms
Contrasting Offline and Online Results when Evaluating Recommendation AlgorithmsContrasting Offline and Online Results when Evaluating Recommendation Algorithms
Contrasting Offline and Online Results when Evaluating Recommendation Algorithms
 
Algorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fmAlgorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fm
 
[Decisions2013@RecSys]The Role of Emotions in Context-aware Recommendation
[Decisions2013@RecSys]The Role of Emotions in Context-aware Recommendation[Decisions2013@RecSys]The Role of Emotions in Context-aware Recommendation
[Decisions2013@RecSys]The Role of Emotions in Context-aware Recommendation
 
Spatially Aware Recommendation System
Spatially Aware Recommendation SystemSpatially Aware Recommendation System
Spatially Aware Recommendation System
 
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
 
II-SDV 2014 Recommender Systems for Analysis Applications (Roger Bradford - A...
II-SDV 2014 Recommender Systems for Analysis Applications (Roger Bradford - A...II-SDV 2014 Recommender Systems for Analysis Applications (Roger Bradford - A...
II-SDV 2014 Recommender Systems for Analysis Applications (Roger Bradford - A...
 
Recommendation system
Recommendation systemRecommendation system
Recommendation system
 
Toward the Next Generation of Recommender Systems:
Toward the Next Generation of Recommender Systems: Toward the Next Generation of Recommender Systems:
Toward the Next Generation of Recommender Systems:
 
Recommender systems
Recommender systemsRecommender systems
Recommender systems
 
Impersonal Recommendation system on top of Hadoop
Impersonal Recommendation system on top of HadoopImpersonal Recommendation system on top of Hadoop
Impersonal Recommendation system on top of Hadoop
 
Trust and Recommender Systems
Trust and  Recommender SystemsTrust and  Recommender Systems
Trust and Recommender Systems
 
e-learning 3.0 and AI
e-learning 3.0 and AIe-learning 3.0 and AI
e-learning 3.0 and AI
 
Profile injection attack detection in recommender system
Profile injection attack detection in recommender systemProfile injection attack detection in recommender system
Profile injection attack detection in recommender system
 
Crowd sourcing for tempo estimation
Crowd sourcing for tempo estimationCrowd sourcing for tempo estimation
Crowd sourcing for tempo estimation
 
Recommender Systems in E-Commerce
Recommender Systems in E-CommerceRecommender Systems in E-Commerce
Recommender Systems in E-Commerce
 
Recommender Systems and Active Learning (for Startups)
Recommender Systems and Active Learning (for Startups)Recommender Systems and Active Learning (for Startups)
Recommender Systems and Active Learning (for Startups)
 
Thesis about Computerized Payroll System for Barangay Hall, Dita
Thesis about Computerized Payroll System for Barangay Hall, DitaThesis about Computerized Payroll System for Barangay Hall, Dita
Thesis about Computerized Payroll System for Barangay Hall, Dita
 
Computerized payroll system
Computerized payroll systemComputerized payroll system
Computerized payroll system
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 

Similaire à Offline evaluation of recommender systems: all pain and no gain?

presentation.pdf
presentation.pdfpresentation.pdf
presentation.pdfcaa28steve
 
GIAF UK Winter 2015 - Analytical techniques: A practical guide to answering b...
GIAF UK Winter 2015 - Analytical techniques: A practical guide to answering b...GIAF UK Winter 2015 - Analytical techniques: A practical guide to answering b...
GIAF UK Winter 2015 - Analytical techniques: A practical guide to answering b...Lauren Cormack
 
Applied Data Science for E-Commerce
Applied Data Science for E-CommerceApplied Data Science for E-Commerce
Applied Data Science for E-CommerceArul Bharathi
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Nondeterministic Software for the Rest of Us
Nondeterministic Software for the Rest of UsNondeterministic Software for the Rest of Us
Nondeterministic Software for the Rest of UsTomer Gabel
 
Testing & Optimization - A Deeper Look
Testing & Optimization - A Deeper LookTesting & Optimization - A Deeper Look
Testing & Optimization - A Deeper LookCaleb Whitmore
 
Data Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisionsData Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisionsVivastream
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisSwiss Big Data User Group
 
Game analytics - The challenges of mobile free-to-play games
Game analytics - The challenges of mobile free-to-play gamesGame analytics - The challenges of mobile free-to-play games
Game analytics - The challenges of mobile free-to-play gamesChristian Beckers
 
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyKris Jack
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Demise of test scripts rise of test ideas
Demise of test scripts rise of test ideasDemise of test scripts rise of test ideas
Demise of test scripts rise of test ideasRichard Robinson
 
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyMaya Hristakeva
 
UK GIAF: Winter 2015
UK GIAF: Winter 2015UK GIAF: Winter 2015
UK GIAF: Winter 2015deltaDNA
 
Behind The Scenes Data Science Coolblue 2018-03-22
Behind The Scenes Data Science Coolblue 2018-03-22Behind The Scenes Data Science Coolblue 2018-03-22
Behind The Scenes Data Science Coolblue 2018-03-22Matthias Schuurmans
 
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix ScaleQcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix ScaleXavier Amatriain
 
The five essential steps to building a data product
The five essential steps to building a data productThe five essential steps to building a data product
The five essential steps to building a data productBirst
 
Legis pactum building high performance teams
Legis pactum   building high performance teamsLegis pactum   building high performance teams
Legis pactum building high performance teamsMiguel Pinto
 

Similaire à Offline evaluation of recommender systems: all pain and no gain? (20)

presentation.pdf
presentation.pdfpresentation.pdf
presentation.pdf
 
GIAF UK Winter 2015 - Analytical techniques: A practical guide to answering b...
GIAF UK Winter 2015 - Analytical techniques: A practical guide to answering b...GIAF UK Winter 2015 - Analytical techniques: A practical guide to answering b...
GIAF UK Winter 2015 - Analytical techniques: A practical guide to answering b...
 
Applied Data Science for E-Commerce
Applied Data Science for E-CommerceApplied Data Science for E-Commerce
Applied Data Science for E-Commerce
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Nondeterministic Software for the Rest of Us
Nondeterministic Software for the Rest of UsNondeterministic Software for the Rest of Us
Nondeterministic Software for the Rest of Us
 
Testing & Optimization - A Deeper Look
Testing & Optimization - A Deeper LookTesting & Optimization - A Deeper Look
Testing & Optimization - A Deeper Look
 
Data Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisionsData Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisions
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data Analysis
 
Evaluation of big data analysis
Evaluation of big data analysisEvaluation of big data analysis
Evaluation of big data analysis
 
Game analytics - The challenges of mobile free-to-play games
Game analytics - The challenges of mobile free-to-play gamesGame analytics - The challenges of mobile free-to-play games
Game analytics - The challenges of mobile free-to-play games
 
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in Mendeley
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Demise of test scripts rise of test ideas
Demise of test scripts rise of test ideasDemise of test scripts rise of test ideas
Demise of test scripts rise of test ideas
 
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in Mendeley
 
UK GIAF: Winter 2015
UK GIAF: Winter 2015UK GIAF: Winter 2015
UK GIAF: Winter 2015
 
Behind The Scenes Data Science Coolblue 2018-03-22
Behind The Scenes Data Science Coolblue 2018-03-22Behind The Scenes Data Science Coolblue 2018-03-22
Behind The Scenes Data Science Coolblue 2018-03-22
 
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix ScaleQcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
 
The five essential steps to building a data product
The five essential steps to building a data productThe five essential steps to building a data product
The five essential steps to building a data product
 
Legis pactum building high performance teams
Legis pactum   building high performance teamsLegis pactum   building high performance teams
Legis pactum building high performance teams
 

Dernier

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Dernier (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Offline evaluation of recommender systems: all pain and no gain?

  • 1. Offline Evaluation of Recommender Systems All pain and no gain? Mark Levy Mendeley
  • 6. What is a good recommendation?
  • 7. What is a good recommendation? One that increases the usefulness of your product in the long run1 1. WARNING: hard to measure directly
  • 8. What is a good recommendation? ● One that increased your bottom line: – User bought item after it was recommended – User clicked ad after it was shown – User didn't skip track when it was played – User added document to library... – User connected with contact...
  • 9. Why was it good?
  • 10. Why was it good? ● Maybe it was – Relevant – Novel – Familiar – Serendipitous – Well explained ● Note: some of these are mutually incompatible
  • 11. What is a bad recommendation?
  • 12. What is a bad recommendation? (you know one when you seen one)
  • 13. What is a bad recommendation?
  • 14. What is a bad recommendation?
  • 15. What is a bad recommendation?
  • 16. What is a bad recommendation? ● Maybe it was – Not relevant – Too obscure – Too familiar – I already have it – I already know that I don't like it – Badly explained
  • 17. What's the cost of getting it wrong? ● Depends on your product and your users – Lost revenue – Less engaged user – Angry user – Amused user – Confused user – User defects to a rival product
  • 18. Hypotheses Good offline metrics express product goals Most (really) bad recommendations can be caught by business logic
  • 19. Issues ● Real business goals concern long-term user behaviour e.g. Netflix “we have reformulated the recommendation problem to the question of optimizing the probability a member chooses to watch a title and enjoys it enough to come back to the service” ● Usually have to settle for short-term surrogate ● Only some user behaviour is visible ● Same constraints when collecting training data
  • 20. Least bad solution? ● “Back to the future” aka historical log analysis ● Decide which logged event(s) indicate success ● Be honest about “success” ● Usually care most about precision @ small k ● Recall will discriminate once this plateaus ● Expect to have to do online testing too
  • 21. Making metrics meaningful ● Building a test framework + data is hard ● Be sure to get best value from your work ● Don't use straw man baselines ● Be realistic – leave the ivory tower ● Make test setups and baselines reproducible
  • 22. Making metrics meaningful ● Old skool k-NN systems are better than you think – Input numbers from mining logs – Temporal “modelling” (e.g. fake users) – Data pruning (scalability, popularity bias, quality) – Preprocessing (tf-idf, log/sqrt, )… – Hand crafted similarity metric – Hand crafted aggregation formula – Postprocessing (popularity matching) – Diversification – Attention profile
  • 23. Making metrics meaningful ● Measure preference honestly ● Predicted items may not be “correct” just because they were consumed once ● Try to capture value – Earlier recommendation may be better – Don't need a recommender to suggest items by same artist/author ● Don't neglect side data – At least use it for evaluation / sanity checking
  • 24. Making metrics meaningful ● Public data isn't enough for reproducibility or fair comparison ● Need to document preprocessing ● Better: Release your preparation/evaluation code too
  • 25. What's the cost of poor evaluation?
  • 26. What's the cost of poor evaluation? Poor offline evaluation can lead to years of misdirected research
  • 27. Ex 1: Reduce playlist skips ● Reorder a playlist of tracks to reduce skips by avoiding “genre whiplash” ● Use audio similarity measure to compute transition distance, then travelling salesman ● Metric: sum of transition distances (lower is better) ● 6 months work to develop solution
  • 28. Ex 1: Reduce playlist skips ● Result: users skipped more often ● Why?
  • 29. Ex 1: Reduce playlist skips ● Result: users skipped more often ● When a user skipped a track they didn't like they were played something else just like it ● Better metric: average position of skipped tracks (based on logs, lower down is better)
  • 30. Ex 2: Recommend movies ● Use a corpus of star ratings to improve movie recommendations ● Learn to predict ratings for un-rated movies ● Metric: average RMSE of predictions for a hidden test set (lower is better) ● 2+ years work to develop new algorithms
  • 31. Ex 2: Recommend movies ● Result: “best” solutions were never deployed ● Why?
  • 32. Ex 2: Recommend movies ● Result: “best” solutions were never deployed ● User behaviour correlates with rank not RMSE ● Side datasets an order of magnitude more valuable than algorithm improvements ● Explicit ratings are the exception not the rule ● RMSE still haunts research labs
  • 33. Can contests help? ● Good: – Great for consistent evaluation ● Not so good: – Privacy concerns mean obfuscated data – No guarantee that metrics are meaningful – No guarantee that train/test framework is valid – Small datasets can become overexposed
  • 34. Ex 3: Yahoo! Music KDD Cup ● Largest music rating dataset ever released ● Realistic “loved songs” classification task ● Data fully obfuscated due to recent lawsuits
  • 35. Ex 3: Yahoo! Music KDD Cup ● Result: researchers hated it ● Why?
  • 36. Ex 3: Yahoo! Music KDD Cup ● Result: researchers hated it ● Research frontier focussed on audio content and metadata, not joinable to obfuscated ratings
  • 37. Ex 4: Million Song Challenge ● Large music dataset with rich metadata ● Anonymized listening histories ● Simple item recommendation task ● Reasonable MAP@500 metric ● Aimed to solve shortcomings of KDD Cup ● Only obfuscation was removal of timestamps
  • 38. Ex 4: Million Song Challenge ● Result: winning entry didn't use side data ● Why?
  • 39. Ex 4: Million Song Challenge ● Result: winning entry didn't use side data ● No timestamps so test tracks chosen at random ● So “people who listen to A also listen to B” ● Traditional item similarity solves this well ● More honesty about “success” might have shown that contest data was flawed
  • 40. Ex 5: Yelp RecSys Challenge ● Small business review dataset with side data ● Realistic mix of input data types ● Rating prediction task ● Informal procedure to create train/test sets
  • 41. Ex 5: Yelp RecSys Challenge ● Result: baseline algorithms high up leaderboard ● Why?
  • 42. Ex 5: Yelp RecSys Challenge ● Result: baseline algorithms high up leaderboard ● Train/test split was corrupt ● Competition organisers moved fast to fix this ● But left only one week before deadline
  • 43. Ex 6: MIREX Audio Chord Estimation ● Small dataset of audio tracks ● Task to label with predicted chord symbols ● Human labelled data hard to come by ● Contest hosted by premier forum in field ● Evaluate frame-level prediction accuracy ● Historical glass ceiling around 80%
  • 44. Ex 6: MIREX Audio Chord Estimation ● Result: 2011 winner ftw ● Why?
  • 45. Ex 6: MIREX Audio Chord Estimation ● Result: 2011 winner ftw ● Spoof entry relying on known test set ● Protest against inadequate test data ● Other research showed weak generalisation of winning algorithms from same contest ● Next year results dropped significantly
  • 46. So why evaluate offline at all? ● Building test framework ensures clear goals ● Avoid wishful thinking if your data is too thin ● Be efficient with precious online testing – Cut down huge parameter space – Don't alienate users ● Need to publish ● Pursuing science as well as profit
  • 47. Online evaluation is tricky too ● No off the shelf solution for services ● Many statistical gotchas ● Same mismatch between short-term and long- term success criteria ● Results open to interpretation by management ● Can make incremental improvements look good when radical innovation is needed
  • 48. Ex 7: Article Recommendations ● Recommender for related research articles ● Massive download logs available ● Framework developed based on co-downloads ● Aim to improve on existing search solution ● Management “keen for it work” ● Several weeks of live A/B testing available ● No offline evaluation
  • 49. Ex 7: Article Recommendations ● Result: worse than similar title search ● Why?
  • 50. Ex 7: Article Recommendations ● Result: worse than similar title search ● Inadequate business rules e.g. often suggesting other articles from same publication ● Users identified only by organisational IP range so value of “big data” very limited ● Establishing an offline evaluation protocol would have shown these in advance
  • 51. Isn't there software for that? Rules of the game: – Model fit metrics (e.g. validation loss) don't count – Need a transparent “audit trail” of data to support genuine reproducibility – Just using public datasets doesn't ensure this
  • 52. Isn't there software for that? Wish list for reproducible evaluation: – Integrate with recommender implementations – Handle data formats and preprocessing – Handle splitting, cross-validation, side datasets – Save everything to file – Work from file inputs so not tied to one framework – Generate meaningful metrics – Well documented and easy to use
  • 53. Isn't there software for that? Current offerings: ● GraphChi/GraphLab ● Mahout ● LensKit ● MyMediaLite
  • 54. Isn't there software for that? Current offerings: ● GraphChi/GraphLab – Model validation loss, doesn't count ● Mahout – Only rating prediction accuracy, doesn't count ● LensKit – Too hard to understand, won't use
  • 55. Isn't there software for that? Current offerings: ● MyMediaLite – Reports meaningful metrics – Handles cross-validation – Data splitting not transparent – No support for pre-processing – No built in support for standalone evaluation – API is capable but current utils don't meet wishlist
  • 56. Eating your own dog food ● Built a small framework around new algorithm ● https://github.com/mendeley/mrec – Reports meaningful metrics – Handles cross-validation – Supports simple pre-processing – Writes everything to file for reproducibility – Provides API and utility scripts – Runs standalone evalutions – Readable Python code
  • 57. Eating your own dog food ● Some lessons learned – Usable frameworks are hard to write – Tradeoff between clarity and scalability – Should generate explicit validation sets ● Please contribute! ● Or use as inspiration to improve existing tools
  • 58. Where next? ● Shift evaluation online: – Contests based around online evaluation – Realistic but not reproducible – Could some run continuously? ● Recommender Systems as a commodity: – Software and services reaching maturity now – Business users can tune/evaluate themselves – Is there a way to report results?
  • 59. Where next? ● Support alternative query paradigms: – More like this, less like that – Metrics for dynamic/online recommenders ● Support recommendation with side data: – LibFM, GenSGD, WARP research @google, … – Open datasets?