SlideShare a Scribd company logo
1 of 10
Download to read offline
RecSys Boston,	Sept	17,	2016 1
Contrasting Offline and Online
Results when Evaluating
Recommendation Algorithms
Marco	Rossetti
Trainline Ltd.,	London
(previously	University	of	Milan-Bicocca)
Fabio	Stella
Department	of	Informatics,	Systems	and	Communication
University	of	Milano-Bicocca
Markus	Zanker
Faculty	of	Computer	Science
Free	University	of	Bozen-Bolzano
RecSys Boston,	Sept	17,	2016 2
Research Goal
• Given the dominance of offline evaluation reflecting on its validity
becomes important
• Said and Bellogin (RecSys 2014) identified serious problems with the
internal validity (not reproducible results with different open source
frameworks).
• Different results from offline and online evaluations have also been
identified putting question marks on the external validity (e.g.
Cremonesi et al. 2012, Beel et al. 2013, Garcin et al. 2014, Ekstrand et
al. 2014, Maksai et al., 2015).
• Proposition:
• Compare performance of an offline experimentation with an online
evaluation.
• Use of a within-users experimental design, where we can test for
differences in paired samples.
RecSys Boston,	Sept	17,	2016 3
Research Questions
1. Does the relative ranking of algorithms based on offline accuracy
measurements predict the relative ranking according to an accuracy
measurement in a user-centric evaluation?
2. Does the relative ranking of algorithms based on offline measurements of
the predictive accuracy for long- tail items produce comparable results to
a user-centric evaluation?
3. Do offline accuracy measurements allow to predict the utility of
recommendations in a user-centric evaluation?
RecSys Boston,	Sept	17,	2016 4
Study Design
• Collected likes on ML movies
from 241 users
• On average 137 ratings per user
1
• Same users, evaluated 4 algorithms, 5
recommendations each
• On average 17.4 + 2 recommendations
• 122 users returned, 100 after cleaning
2
RecSys Boston,	Sept	17,	2016 5
Offline and Online Evaluations
ML1M
All-but-1	validation Users	Answers
Popularity
MF80:	Matrix	Factorization	with	80	factors
MF400:	Matrix	Factorization	with	400	factors
I2I:	Item	To	Item	K-Nearest	Neighbors
train
Offline	evaluation Online	evaluation
Metrics
à precision on all items ß
à precision on long tail ß
useful recommendations ß
RecSys Boston,	Sept	17,	2016 6
Precision All Items
MF400 MF80
POP I2I
p = 0.05 p = 0.05 p = 0.05
MF80 MF400
POP I2I
p = 0.05 p = 0.05 p = 0.1
Algorithm Offline Online
I2I 0.438 0.546
MF80 0.504 0.598
MF400 0.454 0.604
POP 0.340 0.516
Offline	precision	all	items
Online	precision	all	items
RecSys Boston,	Sept	17,	2016 7
Precision on Long Tail Items
MF80
MF400
POP
I2I
p = 0.05
p = 0.05
p = 0.05
p = 0.05
p = 0.05
p = 0.05
Offline	=	Online	precision	long	tail	items
Algorithm Offline Online
I2I 0.280 0.356
MF80 0.018 0.054
MF400 0.360 0.628
POP 0.000 0.000
RecSys Boston,	Sept	17,	2016 8
Useful Recommendations
MF400I2I
POP
p = 0.05 p = 0.05
MF80
p = 0.05 p = 0.05
p = 0.05
Useful	recommendations
Algorithm Online
I2I 0.126
MF80 0.082
MF400 0.116
POP 0.026
RecSys Boston,	Sept	17,	2016 9
Conclusions
• Comparison of different algorithms online and offline based on
a within-users experimental design.
• The algorithm performing best according to a traditional offline
accuracy measurement was significantly worse, when it comes
to useful (i.e. relevant and novel) recommendations measured
online.
• Academia and industry should keep investigating this topic in
order to find the best possible way to validate offline
evaluations.
RecSys Boston,	Sept	17,	2016
Thank you!
10
Marco	Rossetti
Trainline Ltd.,	London
@ross85

More Related Content

What's hot

Rp mr course quiz 05
Rp mr course quiz 05Rp mr course quiz 05
Rp mr course quiz 05MROC Japan
 
Handling missing Social Network data
Handling missing Social Network dataHandling missing Social Network data
Handling missing Social Network dataChristina Manteli
 
BugDay2012 Test Design with CTE XL(SharingDay)
BugDay2012 Test Design with CTE XL(SharingDay)BugDay2012 Test Design with CTE XL(SharingDay)
BugDay2012 Test Design with CTE XL(SharingDay)JaAe CK
 
Automated Testing for Web Applications - Wurbe #36
Automated Testing for Web Applications - Wurbe #36Automated Testing for Web Applications - Wurbe #36
Automated Testing for Web Applications - Wurbe #36Andrei Savu
 
Investigating the effects of popularity data on predictive relevance judgment...
Investigating the effects of popularity data on predictive relevance judgment...Investigating the effects of popularity data on predictive relevance judgment...
Investigating the effects of popularity data on predictive relevance judgment...Christiane Behnert
 
Using Data to Drive Instruction
Using Data to Drive InstructionUsing Data to Drive Instruction
Using Data to Drive InstructionRoger Sevilla
 
Identifying Lead Users in a Living Lab Environment Enoll Summerschool
Identifying Lead Users in a Living Lab Environment Enoll SummerschoolIdentifying Lead Users in a Living Lab Environment Enoll Summerschool
Identifying Lead Users in a Living Lab Environment Enoll Summerschoollcoorevits
 

What's hot (8)

Rp mr course quiz 05
Rp mr course quiz 05Rp mr course quiz 05
Rp mr course quiz 05
 
Handling missing Social Network data
Handling missing Social Network dataHandling missing Social Network data
Handling missing Social Network data
 
2010 ICGSE - Challenges and Solutions in Distributed Software Development Pro...
2010 ICGSE - Challenges and Solutions in Distributed Software Development Pro...2010 ICGSE - Challenges and Solutions in Distributed Software Development Pro...
2010 ICGSE - Challenges and Solutions in Distributed Software Development Pro...
 
BugDay2012 Test Design with CTE XL(SharingDay)
BugDay2012 Test Design with CTE XL(SharingDay)BugDay2012 Test Design with CTE XL(SharingDay)
BugDay2012 Test Design with CTE XL(SharingDay)
 
Automated Testing for Web Applications - Wurbe #36
Automated Testing for Web Applications - Wurbe #36Automated Testing for Web Applications - Wurbe #36
Automated Testing for Web Applications - Wurbe #36
 
Investigating the effects of popularity data on predictive relevance judgment...
Investigating the effects of popularity data on predictive relevance judgment...Investigating the effects of popularity data on predictive relevance judgment...
Investigating the effects of popularity data on predictive relevance judgment...
 
Using Data to Drive Instruction
Using Data to Drive InstructionUsing Data to Drive Instruction
Using Data to Drive Instruction
 
Identifying Lead Users in a Living Lab Environment Enoll Summerschool
Identifying Lead Users in a Living Lab Environment Enoll SummerschoolIdentifying Lead Users in a Living Lab Environment Enoll Summerschool
Identifying Lead Users in a Living Lab Environment Enoll Summerschool
 

Similar to Contrasting Offline and Online Results when Evaluating Recommendation Algorithms

[DOLAP2019] Augmented Business Intelligence
[DOLAP2019] Augmented Business Intelligence[DOLAP2019] Augmented Business Intelligence
[DOLAP2019] Augmented Business IntelligenceUniversity of Bologna
 
Software engineering practices and software quality empirical research results
Software engineering practices and software quality empirical research resultsSoftware engineering practices and software quality empirical research results
Software engineering practices and software quality empirical research resultsNikolai Avteniev
 
Open citations: Next steps
Open citations: Next stepsOpen citations: Next steps
Open citations: Next stepsLudo Waltman
 
From Bugs to Decision Support - Selected Research Highlights
From Bugs to Decision Support - Selected Research HighlightsFrom Bugs to Decision Support - Selected Research Highlights
From Bugs to Decision Support - Selected Research HighlightsMarkus Borg
 
Incentives for infrastructure modernization
Incentives for infrastructure modernizationIncentives for infrastructure modernization
Incentives for infrastructure modernizationBjörn Brembs
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringTao Xie
 
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Polytechnic University of Bari
 
How To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessOpenSource Connections
 
A Context-Aware Retrieval System for Mobile Applications
A Context-Aware Retrieval System for Mobile ApplicationsA Context-Aware Retrieval System for Mobile Applications
A Context-Aware Retrieval System for Mobile Applicationsmarcopavan83
 
Semantic Data Retrieval: Search, Ranking, and Summarization
Semantic Data Retrieval: Search, Ranking, and SummarizationSemantic Data Retrieval: Search, Ranking, and Summarization
Semantic Data Retrieval: Search, Ranking, and SummarizationGong Cheng
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesTao Xie
 
User Personality and the New User Problem in a Context-Aware Point of Interes...
User Personality and the New User Problem in a Context-Aware Point of Interes...User Personality and the New User Problem in a Context-Aware Point of Interes...
User Personality and the New User Problem in a Context-Aware Point of Interes...University of Bergen
 
productionising-recommenders
productionising-recommendersproductionising-recommenders
productionising-recommendersLudovik Coba
 
Frontiers: Five Year Plan
Frontiers: Five Year PlanFrontiers: Five Year Plan
Frontiers: Five Year PlanFrontiersIn
 
Profiling Linked Open Data
Profiling Linked Open DataProfiling Linked Open Data
Profiling Linked Open DataBlerina Spahiu
 

Similar to Contrasting Offline and Online Results when Evaluating Recommendation Algorithms (20)

[DOLAP2019] Augmented Business Intelligence
[DOLAP2019] Augmented Business Intelligence[DOLAP2019] Augmented Business Intelligence
[DOLAP2019] Augmented Business Intelligence
 
Software engineering practices and software quality empirical research results
Software engineering practices and software quality empirical research resultsSoftware engineering practices and software quality empirical research results
Software engineering practices and software quality empirical research results
 
bonino
boninobonino
bonino
 
Open citations: Next steps
Open citations: Next stepsOpen citations: Next steps
Open citations: Next steps
 
From Bugs to Decision Support - Selected Research Highlights
From Bugs to Decision Support - Selected Research HighlightsFrom Bugs to Decision Support - Selected Research Highlights
From Bugs to Decision Support - Selected Research Highlights
 
Incentives for infrastructure modernization
Incentives for infrastructure modernizationIncentives for infrastructure modernization
Incentives for infrastructure modernization
 
2.pdf
2.pdf2.pdf
2.pdf
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software Engineering
 
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
 
Exploratory Analysis of User Data
Exploratory Analysis of User DataExploratory Analysis of User Data
Exploratory Analysis of User Data
 
DataMind Pitch August 2013
DataMind Pitch August 2013DataMind Pitch August 2013
DataMind Pitch August 2013
 
Benchmarking Linked Data Introductory Remarks
Benchmarking Linked Data Introductory RemarksBenchmarking Linked Data Introductory Remarks
Benchmarking Linked Data Introductory Remarks
 
How To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for Success
 
A Context-Aware Retrieval System for Mobile Applications
A Context-Aware Retrieval System for Mobile ApplicationsA Context-Aware Retrieval System for Mobile Applications
A Context-Aware Retrieval System for Mobile Applications
 
Semantic Data Retrieval: Search, Ranking, and Summarization
Semantic Data Retrieval: Search, Ranking, and SummarizationSemantic Data Retrieval: Search, Ranking, and Summarization
Semantic Data Retrieval: Search, Ranking, and Summarization
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and Challenges
 
User Personality and the New User Problem in a Context-Aware Point of Interes...
User Personality and the New User Problem in a Context-Aware Point of Interes...User Personality and the New User Problem in a Context-Aware Point of Interes...
User Personality and the New User Problem in a Context-Aware Point of Interes...
 
productionising-recommenders
productionising-recommendersproductionising-recommenders
productionising-recommenders
 
Frontiers: Five Year Plan
Frontiers: Five Year PlanFrontiers: Five Year Plan
Frontiers: Five Year Plan
 
Profiling Linked Open Data
Profiling Linked Open DataProfiling Linked Open Data
Profiling Linked Open Data
 

Recently uploaded

Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxGood agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxSimeonChristian
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 

Recently uploaded (20)

Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxGood agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 

Contrasting Offline and Online Results when Evaluating Recommendation Algorithms

  • 1. RecSys Boston, Sept 17, 2016 1 Contrasting Offline and Online Results when Evaluating Recommendation Algorithms Marco Rossetti Trainline Ltd., London (previously University of Milan-Bicocca) Fabio Stella Department of Informatics, Systems and Communication University of Milano-Bicocca Markus Zanker Faculty of Computer Science Free University of Bozen-Bolzano
  • 2. RecSys Boston, Sept 17, 2016 2 Research Goal • Given the dominance of offline evaluation reflecting on its validity becomes important • Said and Bellogin (RecSys 2014) identified serious problems with the internal validity (not reproducible results with different open source frameworks). • Different results from offline and online evaluations have also been identified putting question marks on the external validity (e.g. Cremonesi et al. 2012, Beel et al. 2013, Garcin et al. 2014, Ekstrand et al. 2014, Maksai et al., 2015). • Proposition: • Compare performance of an offline experimentation with an online evaluation. • Use of a within-users experimental design, where we can test for differences in paired samples.
  • 3. RecSys Boston, Sept 17, 2016 3 Research Questions 1. Does the relative ranking of algorithms based on offline accuracy measurements predict the relative ranking according to an accuracy measurement in a user-centric evaluation? 2. Does the relative ranking of algorithms based on offline measurements of the predictive accuracy for long- tail items produce comparable results to a user-centric evaluation? 3. Do offline accuracy measurements allow to predict the utility of recommendations in a user-centric evaluation?
  • 4. RecSys Boston, Sept 17, 2016 4 Study Design • Collected likes on ML movies from 241 users • On average 137 ratings per user 1 • Same users, evaluated 4 algorithms, 5 recommendations each • On average 17.4 + 2 recommendations • 122 users returned, 100 after cleaning 2
  • 5. RecSys Boston, Sept 17, 2016 5 Offline and Online Evaluations ML1M All-but-1 validation Users Answers Popularity MF80: Matrix Factorization with 80 factors MF400: Matrix Factorization with 400 factors I2I: Item To Item K-Nearest Neighbors train Offline evaluation Online evaluation Metrics à precision on all items ß à precision on long tail ß useful recommendations ß
  • 6. RecSys Boston, Sept 17, 2016 6 Precision All Items MF400 MF80 POP I2I p = 0.05 p = 0.05 p = 0.05 MF80 MF400 POP I2I p = 0.05 p = 0.05 p = 0.1 Algorithm Offline Online I2I 0.438 0.546 MF80 0.504 0.598 MF400 0.454 0.604 POP 0.340 0.516 Offline precision all items Online precision all items
  • 7. RecSys Boston, Sept 17, 2016 7 Precision on Long Tail Items MF80 MF400 POP I2I p = 0.05 p = 0.05 p = 0.05 p = 0.05 p = 0.05 p = 0.05 Offline = Online precision long tail items Algorithm Offline Online I2I 0.280 0.356 MF80 0.018 0.054 MF400 0.360 0.628 POP 0.000 0.000
  • 8. RecSys Boston, Sept 17, 2016 8 Useful Recommendations MF400I2I POP p = 0.05 p = 0.05 MF80 p = 0.05 p = 0.05 p = 0.05 Useful recommendations Algorithm Online I2I 0.126 MF80 0.082 MF400 0.116 POP 0.026
  • 9. RecSys Boston, Sept 17, 2016 9 Conclusions • Comparison of different algorithms online and offline based on a within-users experimental design. • The algorithm performing best according to a traditional offline accuracy measurement was significantly worse, when it comes to useful (i.e. relevant and novel) recommendations measured online. • Academia and industry should keep investigating this topic in order to find the best possible way to validate offline evaluations.