Interleaved Evaluation for Retrospective Summarization and Prospective Notification

Interleaved Evaluation
for Retrospective
Summarization
and Prospective Notification
on Document Streams
Xin Qian, Jimmy Lin, Adam Roegiest
David R. Cheriton School of Computer
Science
University of Waterloo
Monday, July 18, 2016

Volunteering Rio 2016 Summer Olympics
Source: http://www.phdcomics.com/comics/archive_print.php?comicid=1414
Retrospective summarization
Tweet Timeline Generation, TREC
2014
Now Future
Prospective notification
Push Notification Scenario, TREC 2015
More than Microblog…
• RSS feed, social media posts, medical record updates

• Controlled experiments in A/B test
• Random noise between buckets
• Low question bandwidth
Evaluation Methodology: A/B test

Evaluation Methodology: A/B test vs. interleaving
• Within-subject design in interleaving
• Better sensitivity
• Prompt experiment progress

Interleaved Evaluation Methodology
• How exactly do we interleave the output of two
systems into one single output?
• Balanced interleaving [1]
• Team-Draft interleaving [2]
• Probabilistic interleaving [3]
• How do we assign credit to each system in
response to user interactions with the interleaved
results?
• Aggregate over user clicks [1, 2, 3]
• More sophisticated click aggregation strategies [4, 5, 6]
Source: [1] T. Joachims. Optimizing search engines using clickthrough data. KDD, 2002.
[2] F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reect retrieval quality? CIKM, 2008.
[3] K. Hofmann, S. Whiteson, and M. de Rijke. A probabilistic method for inferring preferences from clicks. CIKM, 2011.
[4] O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large-scale validation and analysis of interleaved search evaluation. TOIS
2012.
[5] F. Radlinski and N. Craswell. Comparing the sensitivity of information retrieval metrics. SIGIR 2010.
[6] Y. Yue, Y. Gao, O. Chapelle, Y. Zhang, and T. Joachims. Learning more powerful test statistics for click-based retrieval

Existing Approaches
Not satisfied

Verbosity
Source:
https://unsplash.com/photos/QlnUpMED6Qs

Existing Approaches
Not satisfied Could be satisfied
but largely biased

Existing Approaches
Not satisfied Could be satisfied
but largely biased
Satisfied

Temporal Interleaving
• Union and sort by document timestamp T
• Prospective notifications, T = tweet push time
• E.g. Neu-IR 2016

System A
results
System B
results
742982648566718202
742853770786425799
743582679184237590
743682473083990521
743682506294099968
743682673084772352
743682473083990521
De-duplication

System A
results
System B
results
742982648566718202
742853770786425799
743582679184237590
743682473083990521
743682506294099968
743682673084772352
743682473083990521
De-duplication matters!

User Interaction - Explicit judgments
Not relevant
Relevant
Relevant but redundant

System A credit System B credi

+1

+1
Solution: discount factor = the fraction of relevant or
redundant tweets above that come from the other
system

+1
+1

+1
Alternative: asking user the source of the
redundancy

+1
+1
+1

+1
+1
+1
+0.66

Simulation-based Meta-evaluation:
Dataset
• Grounded in the Microblog track at TREC
• TREC 2015 real-time filtering task, push
notification
• 14 groups for 37 runs
• 1 semantic cluster annotation
• Normalized cumulative gain (nCG):
• is the maximum possible gain
6:28pm At least seven killed in shooting at Sikh temple in
Wisconsin
6:10pm >= 7 killed in shooting @ Sikh temple in Wisconsin
6:10pm 4 were shot inside the Sikh Temple of Wisconsin and 3
outside, including a gunman killed by police
Four were shot inside the Sikh Temple and 3 outside, including a
gunman killed by police around 6:10pm
6:28pm seven Wisconsin shooting at Sikh temple
6:10pm four were shot inside the Sikh Temple and 3 outside,
including a gunman killed by police6:10pm four were
shot inside the
Sikh Temple and
3 outside,
including a
gunman killed by
police

• Apply the temporal interleaving strategy
• Assume a user with the semantic cluster
annotation
• Difference of Assigned credit
• Difference of nCG
Simulation-based Meta-evaluation: Experiment
Setup

Correct
Correct Incorrect
Incorrect
Agreement =
Correct
Correct + Incorrect

Conclusions
•
• Effectiveness: 92%-93% agreement
• Highest effectiveness: all pairs, binary relevance on “simple task” *
• Effect of using different assessors *
• Cluster importance and cluster weight *
• Effect of “quiet days” and treatment *
• Recall-oriented credit assignment *
• Precision/recall tradeoff *
Agreement =
Correct
Correct + Incorrect

Interleaved Evaluations (nearly) doubles the
work!
Criticism

Assessor Effort: Output Length
• Solution: Flip a biased coin and keep with
probability p No extra work
Still reasonable

Source: [1] E. Agichtein, E. Brill, S. Dumais, and R. Ragno. Learning user interaction models for predicting web search result
preferences. SIGIR, 2006.
[2] O. Chapelle and Y. Zhang. A Dynamic Bayesian Network click model for web search ranking. WWW, 2009.
[3] L. Granka, T. Joachims, and G. Gay. Eye-tracking analysis of user behavior in WWW search. SIGIR, 2004.
[4] T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from
clicks and query reformulations in web search. ACM TOIS, 25(2):1{27, 2007.
[5] D. Kelly. Understanding implicit feedback and document preference: A naturalistic user study. SIGIR Forum, 38(1):77{77, 200
[6] D. Kelly and J. Teevan. Implicit feedback for inferring user preference: A bibliography. SIGIR Forum, 37(2):18{28, 2003.
• Implicit judgments
• Web search: click models[1, 2], eye-tracking
studies[3, 4], etc [5, 6].
• multi-media elements, different types of clicks…...
Assessor Effort: Explicit
Judgments

• Explicit judgments: dismiss/click on
notifications
Judgments

• Solution: Pay attention with probability r
• Good prediction accuracy with limited user
interactions.
Judgments
No extra work
Still reasonable

Assessor Effort: Combining
Both
• Randomly discarding system output + limited
user interactions
• Accuracy and verbosity tradeoff curves

Summary
• A novel interleaved evaluation methodology
• A temporal interleaving strategy
• A heuristic credit assignment method
• A user interaction model with explicit judgments
• A simulation-based meta-evaluation
• Analysis on assessor effort
• Output length
• Explicit judgments

Participant
Systems
Baseline System
YoGosling
Twitter Streaming
API
TREC RTS
Server
Mobile App
Assessors
Human-in-the-loop assessment

Interleaved Evaluation for Retrospective Summarization and Prospective Notification

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (16)

Similar to Interleaved Evaluation for Retrospective Summarization and Prospective Notification

Similar to Interleaved Evaluation for Retrospective Summarization and Prospective Notification (20)

Recently uploaded

Recently uploaded (20)

Interleaved Evaluation for Retrospective Summarization and Prospective Notification