SIGIR 2016 presentation slide for paper: Xin Qian, Jimmy Lin, and Adam Roegiest. Interleaved Evaluation for Retrospective Summarization and Prospective Notification on Document Streams. Proceedings of the 39th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2016), pages 175-184, July 2016, Pisa, Italy.
2. Volunteering Rio 2016 Summer Olympics
Source: http://www.phdcomics.com/comics/archive_print.php?comicid=1414
Retrospective summarization
Tweet Timeline Generation, TREC
2014
Now Future
Prospective notification
Push Notification Scenario, TREC 2015
More than Microblog…
• RSS feed, social media posts, medical record updates
3. • Controlled experiments in A/B test
• Random noise between buckets
• Low question bandwidth
Evaluation Methodology: A/B test
4. Evaluation Methodology: A/B test vs. interleaving
• Within-subject design in interleaving
• Better sensitivity
• Prompt experiment progress
5. Interleaved Evaluation Methodology
• How exactly do we interleave the output of two
systems into one single output?
• Balanced interleaving [1]
• Team-Draft interleaving [2]
• Probabilistic interleaving [3]
• How do we assign credit to each system in
response to user interactions with the interleaved
results?
• Aggregate over user clicks [1, 2, 3]
• More sophisticated click aggregation strategies [4, 5, 6]
Source: [1] T. Joachims. Optimizing search engines using clickthrough data. KDD, 2002.
[2] F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reect retrieval quality? CIKM, 2008.
[3] K. Hofmann, S. Whiteson, and M. de Rijke. A probabilistic method for inferring preferences from clicks. CIKM, 2011.
[4] O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large-scale validation and analysis of interleaved search evaluation. TOIS
2012.
[5] F. Radlinski and N. Craswell. Comparing the sensitivity of information retrieval metrics. SIGIR 2010.
[6] Y. Yue, Y. Gao, O. Chapelle, Y. Zhang, and T. Joachims. Learning more powerful test statistics for click-based retrieval
32. Simulation-based Meta-evaluation:
Dataset
• Grounded in the Microblog track at TREC
• TREC 2015 real-time filtering task, push
notification
• 14 groups for 37 runs
• 1 semantic cluster annotation
• Normalized cumulative gain (nCG):
• is the maximum possible gain
6:28pm At least seven killed in shooting at Sikh temple in
Wisconsin
6:10pm >= 7 killed in shooting @ Sikh temple in Wisconsin
6:10pm 4 were shot inside the Sikh Temple of Wisconsin and 3
outside, including a gunman killed by police
Four were shot inside the Sikh Temple and 3 outside, including a
gunman killed by police around 6:10pm
6:28pm seven Wisconsin shooting at Sikh temple
6:10pm four were shot inside the Sikh Temple and 3 outside,
including a gunman killed by police6:10pm four were
shot inside the
Sikh Temple and
3 outside,
including a
gunman killed by
police
33. • Apply the temporal interleaving strategy
• Assume a user with the semantic cluster
annotation
• Difference of Assigned credit
• Difference of nCG
Simulation-based Meta-evaluation: Experiment
Setup
39. Assessor Effort: Output Length
• Solution: Flip a biased coin and keep with
probability p No extra work
Still reasonable
40. Source: [1] E. Agichtein, E. Brill, S. Dumais, and R. Ragno. Learning user interaction models for predicting web search result
preferences. SIGIR, 2006.
[2] O. Chapelle and Y. Zhang. A Dynamic Bayesian Network click model for web search ranking. WWW, 2009.
[3] L. Granka, T. Joachims, and G. Gay. Eye-tracking analysis of user behavior in WWW search. SIGIR, 2004.
[4] T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from
clicks and query reformulations in web search. ACM TOIS, 25(2):1{27, 2007.
[5] D. Kelly. Understanding implicit feedback and document preference: A naturalistic user study. SIGIR Forum, 38(1):77{77, 200
[6] D. Kelly and J. Teevan. Implicit feedback for inferring user preference: A bibliography. SIGIR Forum, 37(2):18{28, 2003.
• Implicit judgments
• Web search: click models[1, 2], eye-tracking
studies[3, 4], etc [5, 6].
• multi-media elements, different types of clicks…...
Assessor Effort: Explicit
Judgments
42. • Solution: Pay attention with probability r
• Good prediction accuracy with limited user
interactions.
Assessor Effort: Explicit
Judgments
No extra work
Still reasonable
43. Assessor Effort: Combining
Both
• Randomly discarding system output + limited
user interactions
• Accuracy and verbosity tradeoff curves
44. Summary
• A novel interleaved evaluation methodology
• A temporal interleaving strategy
• A heuristic credit assignment method
• A user interaction model with explicit judgments
• A simulation-based meta-evaluation
• Analysis on assessor effort
• Output length
• Explicit judgments