Most evaluations of novel algorithmic contributions assess their accuracy in predicting what was withheld in an offline evaluation scenario. However, several doubts have been raised that standard offline evaluation practices are not appropriate to select the best algorithm for field deployment. The goal of this work is therefore to compare the offline and the online evaluation methodology with the same study participants, i.e. a within users experimental design. This paper presents empirical evidence that the ranking of algorithms based on offline accuracy measurements clearly contradicts the results from the online study with the same set of users. Thus the external validity of the most commonly applied evaluation methodology is not guaranteed.
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Contrasting Offline and Online Results when Evaluating Recommendation Algorithms
1. RecSys Boston, Sept 17, 2016 1
Contrasting Offline and Online
Results when Evaluating
Recommendation Algorithms
Marco Rossetti
Trainline Ltd., London
(previously University of Milan-Bicocca)
Fabio Stella
Department of Informatics, Systems and Communication
University of Milano-Bicocca
Markus Zanker
Faculty of Computer Science
Free University of Bozen-Bolzano
2. RecSys Boston, Sept 17, 2016 2
Research Goal
• Given the dominance of offline evaluation reflecting on its validity
becomes important
• Said and Bellogin (RecSys 2014) identified serious problems with the
internal validity (not reproducible results with different open source
frameworks).
• Different results from offline and online evaluations have also been
identified putting question marks on the external validity (e.g.
Cremonesi et al. 2012, Beel et al. 2013, Garcin et al. 2014, Ekstrand et
al. 2014, Maksai et al., 2015).
• Proposition:
• Compare performance of an offline experimentation with an online
evaluation.
• Use of a within-users experimental design, where we can test for
differences in paired samples.
3. RecSys Boston, Sept 17, 2016 3
Research Questions
1. Does the relative ranking of algorithms based on offline accuracy
measurements predict the relative ranking according to an accuracy
measurement in a user-centric evaluation?
2. Does the relative ranking of algorithms based on offline measurements of
the predictive accuracy for long- tail items produce comparable results to
a user-centric evaluation?
3. Do offline accuracy measurements allow to predict the utility of
recommendations in a user-centric evaluation?
4. RecSys Boston, Sept 17, 2016 4
Study Design
• Collected likes on ML movies
from 241 users
• On average 137 ratings per user
1
• Same users, evaluated 4 algorithms, 5
recommendations each
• On average 17.4 + 2 recommendations
• 122 users returned, 100 after cleaning
2
5. RecSys Boston, Sept 17, 2016 5
Offline and Online Evaluations
ML1M
All-but-1 validation Users Answers
Popularity
MF80: Matrix Factorization with 80 factors
MF400: Matrix Factorization with 400 factors
I2I: Item To Item K-Nearest Neighbors
train
Offline evaluation Online evaluation
Metrics
à precision on all items ß
à precision on long tail ß
useful recommendations ß
6. RecSys Boston, Sept 17, 2016 6
Precision All Items
MF400 MF80
POP I2I
p = 0.05 p = 0.05 p = 0.05
MF80 MF400
POP I2I
p = 0.05 p = 0.05 p = 0.1
Algorithm Offline Online
I2I 0.438 0.546
MF80 0.504 0.598
MF400 0.454 0.604
POP 0.340 0.516
Offline precision all items
Online precision all items
7. RecSys Boston, Sept 17, 2016 7
Precision on Long Tail Items
MF80
MF400
POP
I2I
p = 0.05
p = 0.05
p = 0.05
p = 0.05
p = 0.05
p = 0.05
Offline = Online precision long tail items
Algorithm Offline Online
I2I 0.280 0.356
MF80 0.018 0.054
MF400 0.360 0.628
POP 0.000 0.000
8. RecSys Boston, Sept 17, 2016 8
Useful Recommendations
MF400I2I
POP
p = 0.05 p = 0.05
MF80
p = 0.05 p = 0.05
p = 0.05
Useful recommendations
Algorithm Online
I2I 0.126
MF80 0.082
MF400 0.116
POP 0.026
9. RecSys Boston, Sept 17, 2016 9
Conclusions
• Comparison of different algorithms online and offline based on
a within-users experimental design.
• The algorithm performing best according to a traditional offline
accuracy measurement was significantly worse, when it comes
to useful (i.e. relevant and novel) recommendations measured
online.
• Academia and industry should keep investigating this topic in
order to find the best possible way to validate offline
evaluations.