Contrasting Offline and Online Results when Evaluating Recommendation Algorithms

•

2 likes•936 views

Most evaluations of novel algorithmic contributions assess their accuracy in predicting what was withheld in an offline evaluation scenario. However, several doubts have been raised that standard offline evaluation practices are not appropriate to select the best algorithm for field deployment. The goal of this work is therefore to compare the offline and the online evaluation methodology with the same study participants, i.e. a within users experimental design. This paper presents empirical evidence that the ranking of algorithms based on offline accuracy measurements clearly contradicts the results from the online study with the same set of users. Thus the external validity of the most commonly applied evaluation methodology is not guaranteed.

Science

RecSys Boston, Sept 17, 2016 1
Contrasting Offline and Online
Results when Evaluating
Recommendation Algorithms
Marco Rossetti
Trainline Ltd., London
(previously University of Milan-Bicocca)
Fabio Stella
Department of Informatics, Systems and Communication
University of Milano-Bicocca
Markus Zanker
Faculty of Computer Science
Free University of Bozen-Bolzano

RecSys Boston, Sept 17, 2016 2
Research Goal
• Given the dominance of offline evaluation reflecting on its validity
becomes important
• Said and Bellogin (RecSys 2014) identified serious problems with the
internal validity (not reproducible results with different open source
frameworks).
• Different results from offline and online evaluations have also been
identified putting question marks on the external validity (e.g.
Cremonesi et al. 2012, Beel et al. 2013, Garcin et al. 2014, Ekstrand et
al. 2014, Maksai et al., 2015).
• Proposition:
• Compare performance of an offline experimentation with an online
evaluation.
• Use of a within-users experimental design, where we can test for
differences in paired samples.

RecSys Boston, Sept 17, 2016 3
Research Questions
1. Does the relative ranking of algorithms based on offline accuracy
measurements predict the relative ranking according to an accuracy
measurement in a user-centric evaluation?
2. Does the relative ranking of algorithms based on offline measurements of
the predictive accuracy for long- tail items produce comparable results to
a user-centric evaluation?
3. Do offline accuracy measurements allow to predict the utility of
recommendations in a user-centric evaluation?

RecSys Boston, Sept 17, 2016 4
Study Design
• Collected likes on ML movies
from 241 users
• On average 137 ratings per user
1
• Same users, evaluated 4 algorithms, 5
recommendations each
• On average 17.4 + 2 recommendations
• 122 users returned, 100 after cleaning
2

RecSys Boston, Sept 17, 2016 5
Offline and Online Evaluations
ML1M
All-but-1 validation Users Answers
Popularity
MF80: Matrix Factorization with 80 factors
MF400: Matrix Factorization with 400 factors
I2I: Item To Item K-Nearest Neighbors
train
Offline evaluation Online evaluation
Metrics
à precision on all items ß
à precision on long tail ß
useful recommendations ß

RecSys Boston, Sept 17, 2016 6
Precision All Items
MF400 MF80
POP I2I
p = 0.05 p = 0.05 p = 0.05
MF80 MF400
POP I2I
p = 0.05 p = 0.05 p = 0.1
Algorithm Offline Online
I2I 0.438 0.546
MF80 0.504 0.598
MF400 0.454 0.604
POP 0.340 0.516
Offline precision all items
Online precision all items

RecSys Boston, Sept 17, 2016 7
Precision on Long Tail Items
MF80
MF400
POP
I2I
p = 0.05
p = 0.05
p = 0.05
p = 0.05
p = 0.05
p = 0.05
Offline = Online precision long tail items
Algorithm Offline Online
I2I 0.280 0.356
MF80 0.018 0.054
MF400 0.360 0.628
POP 0.000 0.000

RecSys Boston, Sept 17, 2016 8
Useful Recommendations
MF400I2I
POP
p = 0.05 p = 0.05
MF80
p = 0.05 p = 0.05
p = 0.05
Useful recommendations
Algorithm Online
I2I 0.126
MF80 0.082
MF400 0.116
POP 0.026

RecSys Boston, Sept 17, 2016 9
Conclusions
• Comparison of different algorithms online and offline based on
a within-users experimental design.
• The algorithm performing best according to a traditional offline
accuracy measurement was significantly worse, when it comes
to useful (i.e. relevant and novel) recommendations measured
online.
• Academia and industry should keep investigating this topic in
order to find the best possible way to validate offline
evaluations.

RecSys Boston, Sept 17, 2016
Thank you!
10
Marco Rossetti
Trainline Ltd., London
@ross85

What's hot

Rp mr course quiz 05MROC Japan

Handling missing Social Network dataChristina Manteli

2010 ICGSE - Challenges and Solutions in Distributed Software Development Pro...HASE – Human Aspects in Software Engineering

BugDay2012 Test Design with CTE XL(SharingDay)JaAe CK

Automated Testing for Web Applications - Wurbe #36Andrei Savu

Investigating the effects of popularity data on predictive relevance judgment...Christiane Behnert

Using Data to Drive InstructionRoger Sevilla

Identifying Lead Users in a Living Lab Environment Enoll Summerschoollcoorevits

What's hot (8)

Rp mr course quiz 05

Handling missing Social Network data

2010 ICGSE - Challenges and Solutions in Distributed Software Development Pro...

BugDay2012 Test Design with CTE XL(SharingDay)

Automated Testing for Web Applications - Wurbe #36

Investigating the effects of popularity data on predictive relevance judgment...

Using Data to Drive Instruction

Identifying Lead Users in a Living Lab Environment Enoll Summerschool

Similar to Contrasting Offline and Online Results when Evaluating Recommendation Algorithms

[DOLAP2019] Augmented Business IntelligenceUniversity of Bologna

Software engineering practices and software quality empirical research resultsNikolai Avteniev

boninoDario Bonino

Open citations: Next stepsLudo Waltman

From Bugs to Decision Support - Selected Research HighlightsMarkus Borg

Incentives for infrastructure modernizationBjörn Brembs

2.pdfmkhawaruol

Intelligent Software Engineering: Synergy between AI and Software EngineeringTao Xie

Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Polytechnic University of Bari

Exploratory Analysis of User DataBehrooz Omidvar-Tehrani

DataMind Pitch August 2013Jonathan Cornelissen

Benchmarking Linked Data Introductory RemarksHolistic Benchmarking of Big Linked Data

How To Structure Your Search Team for SuccessOpenSource Connections

A Context-Aware Retrieval System for Mobile Applicationsmarcopavan83

Semantic Data Retrieval: Search, Ranking, and SummarizationGong Cheng

Software Analytics - Achievements and ChallengesTao Xie

User Personality and the New User Problem in a Context-Aware Point of Interes...University of Bergen

productionising-recommendersLudovik Coba

Frontiers: Five Year PlanFrontiersIn

Profiling Linked Open DataBlerina Spahiu

Similar to Contrasting Offline and Online Results when Evaluating Recommendation Algorithms (20)

[DOLAP2019] Augmented Business Intelligence

Software engineering practices and software quality empirical research results

bonino

Open citations: Next steps

From Bugs to Decision Support - Selected Research Highlights

Incentives for infrastructure modernization

2.pdf

Intelligent Software Engineering: Synergy between AI and Software Engineering

Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...

Exploratory Analysis of User Data

DataMind Pitch August 2013

Benchmarking Linked Data Introductory Remarks

How To Structure Your Search Team for Success

A Context-Aware Retrieval System for Mobile Applications

Semantic Data Retrieval: Search, Ranking, and Summarization

Software Analytics - Achievements and Challenges

User Personality and the New User Problem in a Context-Aware Point of Interes...

productionising-recommenders

Frontiers: Five Year Plan

Profiling Linked Open Data

Recently uploaded

Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju

Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita

LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth

Volatile Oils Pharmacognosy And Phytochemistry -INandakishor Bhaurao Deshmukh

Four Spheres of the Earth Presentation.pptJoemSTuliba

Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48

Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju

Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad

Neurodevelopmental disorders according to the dsm 5 trssuser06f238

Good agricultural practices 3rd year bpharm. herbal drug technology .pptxSimeonChristian

Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde

BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2

Harmful and Useful Microorganisms Presentationtahreemzahra82

RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed

Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa

Citronella presentation SlideShare mani upadhyayupadhyaymani499

Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju

Topic 9- General Principles of International Law.pptxJorenAcuavera1

《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29

Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9

Recently uploaded (20)

Pests of soyabean_Binomics_IdentificationDr.UPR.pdf

Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine

LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx

Volatile Oils Pharmacognosy And Phytochemistry -I

Four Spheres of the Earth Presentation.ppt

Vision and reflection on Mining Software Repositories research in 2024

Pests of castor_Binomics_Identification_Dr.UPR.pdf

Environmental Biotechnology Topic:- Microbial Biosensor

Neurodevelopmental disorders according to the dsm 5 tr

Good agricultural practices 3rd year bpharm. herbal drug technology .pptx

Microteaching on terms used in filtration .Pharmaceutical Engineering

BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf

Harmful and Useful Microorganisms Presentation

RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx

Bioteknologi kelas 10 kumer smapsa .pptx

Citronella presentation SlideShare mani upadhyay

Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf

Topic 9- General Principles of International Law.pptx

《Queensland毕业文凭-昆士兰大学毕业证成绩单》

Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...

Contrasting Offline and Online Results when Evaluating Recommendation Algorithms

1. RecSys Boston, Sept 17, 2016 1 Contrasting Offline and Online Results when Evaluating Recommendation Algorithms Marco Rossetti Trainline Ltd., London (previously University of Milan-Bicocca) Fabio Stella Department of Informatics, Systems and Communication University of Milano-Bicocca Markus Zanker Faculty of Computer Science Free University of Bozen-Bolzano

2. RecSys Boston, Sept 17, 2016 2 Research Goal • Given the dominance of offline evaluation reflecting on its validity becomes important • Said and Bellogin (RecSys 2014) identified serious problems with the internal validity (not reproducible results with different open source frameworks). • Different results from offline and online evaluations have also been identified putting question marks on the external validity (e.g. Cremonesi et al. 2012, Beel et al. 2013, Garcin et al. 2014, Ekstrand et al. 2014, Maksai et al., 2015). • Proposition: • Compare performance of an offline experimentation with an online evaluation. • Use of a within-users experimental design, where we can test for differences in paired samples.

3. RecSys Boston, Sept 17, 2016 3 Research Questions 1. Does the relative ranking of algorithms based on offline accuracy measurements predict the relative ranking according to an accuracy measurement in a user-centric evaluation? 2. Does the relative ranking of algorithms based on offline measurements of the predictive accuracy for long- tail items produce comparable results to a user-centric evaluation? 3. Do offline accuracy measurements allow to predict the utility of recommendations in a user-centric evaluation?

4. RecSys Boston, Sept 17, 2016 4 Study Design • Collected likes on ML movies from 241 users • On average 137 ratings per user 1 • Same users, evaluated 4 algorithms, 5 recommendations each • On average 17.4 + 2 recommendations • 122 users returned, 100 after cleaning 2

5. RecSys Boston, Sept 17, 2016 5 Offline and Online Evaluations ML1M All-but-1 validation Users Answers Popularity MF80: Matrix Factorization with 80 factors MF400: Matrix Factorization with 400 factors I2I: Item To Item K-Nearest Neighbors train Offline evaluation Online evaluation Metrics à precision on all items ß à precision on long tail ß useful recommendations ß

6. RecSys Boston, Sept 17, 2016 6 Precision All Items MF400 MF80 POP I2I p = 0.05 p = 0.05 p = 0.05 MF80 MF400 POP I2I p = 0.05 p = 0.05 p = 0.1 Algorithm Offline Online I2I 0.438 0.546 MF80 0.504 0.598 MF400 0.454 0.604 POP 0.340 0.516 Offline precision all items Online precision all items

7. RecSys Boston, Sept 17, 2016 7 Precision on Long Tail Items MF80 MF400 POP I2I p = 0.05 p = 0.05 p = 0.05 p = 0.05 p = 0.05 p = 0.05 Offline = Online precision long tail items Algorithm Offline Online I2I 0.280 0.356 MF80 0.018 0.054 MF400 0.360 0.628 POP 0.000 0.000

8. RecSys Boston, Sept 17, 2016 8 Useful Recommendations MF400I2I POP p = 0.05 p = 0.05 MF80 p = 0.05 p = 0.05 p = 0.05 Useful recommendations Algorithm Online I2I 0.126 MF80 0.082 MF400 0.116 POP 0.026

9. RecSys Boston, Sept 17, 2016 9 Conclusions • Comparison of different algorithms online and offline based on a within-users experimental design. • The algorithm performing best according to a traditional offline accuracy measurement was significantly worse, when it comes to useful (i.e. relevant and novel) recommendations measured online. • Academia and industry should keep investigating this topic in order to find the best possible way to validate offline evaluations.

10. RecSys Boston, Sept 17, 2016 Thank you! 10 Marco Rossetti Trainline Ltd., London @ross85

Contrasting Offline and Online Results when Evaluating Recommendation Algorithms

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to Contrasting Offline and Online Results when Evaluating Recommendation Algorithms

Similar to Contrasting Offline and Online Results when Evaluating Recommendation Algorithms (20)

Recently uploaded

Recently uploaded (20)

Contrasting Offline and Online Results when Evaluating Recommendation Algorithms