From Search to Predictions in Tagged Information Spaces

1
From Search to Predictions in Tagged
Information Spaces
Christoph Trattner
Know-Center
ctrattner@know-center.at
@Graz University of Technology, Austria
. Christoph Trattner 30.10.2014 – Yahoo! Labs, Barcelona

2
Before start in this presentation I will talk a bit about
myself, my background…

3
Where do I come from (Austria)?

4
Graz

5
Academic Back-Ground?
 Studied Computer Science at Graz University of
Technology & University of Pittsburgh
 Worked since 2009 as scientific researcher at the KMI &
IICM (BSc 2008, MSc 2009)
 My PhD thesis was on the Search & Navigation in Social
Tagging Systems (defended 2012)
 Since Feb. 2013 @ Know-Center
 Leading the Social Computing Area
 At TUG:
 WebScience
 Semantic Technologies

6
My team
2 Post-Docs, 5 Pre-Docs (2 more to join soon )
2 MSc student
2 BSc student
DI. Dieter
Theiler
DI. Dominik
Kowald
Dr. Peter
Kraker
Dr. Elisabeth
Lex
Mag. Sebastian
Dennerlein
Mag. Matthias
Rella
DI. Emanuel
Lacic
DI. Ilire Hasani

7
Thanks to my Collaborators

8
What is my group doing?
… we research on novel methods and tools that exploit
social data to generate a greater value for the
individual, communities, companies and the society as
whole.
Our competences:
• Network & Web Science
• Science 2.0
• Predictive Modeling
• Social Network Analysis
• Information Quality Assessment
• User Modeling
• Machine Learning and Data Mining
• Collaborative Systems
Our Services:
• Social Analytics: Hub-, Expert -, Community -
, Influencer -, Information Flow-, Trend
(Event) Detection, etc.
• Information Quality Assessment
• Social & Location-based Recommander
Systems
• Customer Segmentation
• Social Systems Design

9
Some industry partners...

10
Current projects
BlancNoir - “Towards a Big Data recommender engine for offline
and online marketplaces”
I2F - “Towards a Social Media and Online Marketing Manager
Seminar”
Automation-X - “Towards a scalable Graph-based Visual search
solution”
Styria - “Towards a scalable crowd-based hierarchical cluster
labeling approach for willhaben.at”
TripRebel - “Towards an engaging hybrid hotel recommender
solution for triprebel.com”
CDS - “Towards a scalable Entity & Graph-based Visual search
solution for cds.at”
Exthex - “Towards an efficient viral social media marketing
champagne in Facebook and Twitter”

11
The Projects
Project 1: Mendeley – UK Startup (recently acquired by Elsevier):
Interested in the problem of hirarchical concept-based search in
tagged information spaces.
Project 2: Tallinn University– Interested in the problem of
recommending tags and items in tagged information spaces.

12
Ok, let’s start….

13
Project 1
Mendeley – UK Startup (recently acquired by Elsevier):
Interested in the problem of hierarchical concept-based
search.

14
Research Question 1:
What kind of meta-data is more useful for search in
information systems - tags or keywords?
Externals involved:
• Mendeley, London, UK
Helic, D., Körner, C., Granitzer, M., Strohmaier, M. and Trattner, C. 2012. Navigational Efficiency of Broad vs.
Narrow Folksonomies. In Proceedings of the 23rd ACM Conference on Hypertext and Social Media (HT
2012), ACM, New York, NY, USA, pp. 63-72.

15
Mendeley

16
 We
Tags
Keywords
Mendeley Desktop

17
Task
What is the best way to extract hirarchies from tagged
information spaces? What is more useful for navigation –
keyword or tag hierarchies?

18
Different types of hierarchy induction
algorithms
Helic, D., Strohmaier, M., Trattner, C., Muhr M. and Lermann, K.: Pragmatic Evaluation of Folksonomies, In
Proceedings of the 20th international conference on World Wide Web (WWW 2011), ACM, New York, NY, USA,
417-426, 2011.

19
Issue (!!!)
...no literature on what type of hierarchy is best suited
for searching...
D. J. Watts, P. S. Dodds, and M. E. J. Newman. Identity and
search in social networks. Science, 296:1302–1305, 2002.
J. M. Kleinberg. Navigation in a small world. Nature,
406(6798):845, August 2000.

20
Stanley Milgram
 A social psychologist
 Yale and Harvard University
 Study on the Small World Problem,
beyond well defined communities
and relations
(such as actors, scientists, …)
 „An Experimental Study of the Small World Problem”
1933-1984

21
Set Up
 Target person:
 A Boston stockbroker
 Three starting populations
Nebraska
random
 100 “Nebraska stockholders”
 96 “Nebraska Nebraska
random”
 100 “Boston stockholders
random”
Target
Boston
stockbroker
Boston
random

22
Results
 How many of the starters would be able to establish
contact with the target?
 64 out of 296 reached the target
 How many intermediaries would be required to link
starters with the target?
 Well, that depends: the overall mean 5.2 links
 Through hometown: 6.1 links
 Through business: 4.6 links
 Boston group faster than Nebraska groups
 Nebraska stockholders not faster than Nebraska random
 What form would the distribution of chain lengths
take?

23
Hierarchical decentralized searcher
Information
Network
Hierarchy

24
Results

25
Validation
 We compared simulations with
human click trails of the online Game –
The Wiki Game (http://thewikigame.com/)
 Contains 1,500,000
click trails of more
than 500,000 users with
(start; target) information.

Wikipedia Category Label Dataset:
2,300,000 category labels,
4,500,000 articles, 30,000,000 category
label assignments
26
Hierachy Creation (1)
Two types of hierarchies were evaluated
1.) First type is based on our previous work
 Categorial Concepts:
 Tags from Delicious
 Category labels from Wikipedia
Similarity Graph
Delicious Tag Dataset:
440,000 tags, 580,000 articles and
3,400,000 tag assignments
Latent Hierarchical Taxonomy

27
Hierarchy Creation (2)
2.) Second type is based on the work of [Muchnik et al. 2007]
Simple idea: Algorithm iterates through all
links in the network and decides if that link is
of a hierarchical type, in which case it
remains in the network otherwise it is
removed.
Directed link-network dataset of the
English-Wikipedia from February
2012.
All in all, the dataset includes
around 10,000,000 articles and
around 250,000,000 links
Muchnik, L., Itzhack, R., Solomon S. and Louzoun Y.: Self-emergence of knowledge trees: Extraction
of the Wikipedia hierarchies, PHYSICAL REVIEW E 76, 016106 (2007)

28
Validation
Human Searchers

29
...ok let‘s come back to the Mendeley „problem“...

30
Are keyword hierarchies better for search
than social tag hierarchies?
Results:
With simulations we find that tag-based
Tags
Keywords
Results: Our Greedy Navigator (= Simulator) needs on average 1-click
more with keywords to reach the target node than with tags
hierarchies are more efficient
for navigation than keywords

31
...ok let‘s move on to some prediction stuff 

32
Project 2
Tallinn University – Interested in the problem of
recommending items and tags to users in social
tagging systems.

33
Research Question 2:
To what extent is human cognition theory applicable to
the problem of predicting tags and items to users?
Externals involved:
• PUC - Chile, UFCG – Brazil

34
Motivation
 They help you to classify Web content better [Zubiaga 2012]
 They help people to navigate large knowledge repositories better
[Helic et al. 2012]
 They help people to search for information faster [Trattner et al. 2012]
However, there is an issue with social tags…
People are typically lazy to apply social tags(!!)
Zubiaga, A. (2012). Harnessing Folksonomies for Resource Classification. arXiv preprint arXiv:1204.6521.
Trattner, C., Lin, Y. L., Parra, D., Yue, Z., Real, W., & Brusilovsky, P. (2012, June). Evaluating tag-based information
access in image collections. In Proceedings of the 23rd ACM conference on Hypertext and social media (pp. 113-
122). ACM.
Helic, D., Körner, C., Granitzer, M., Strohmaier, M., & Trattner, C. (2012, June). Navigational efficiency of broad vs.
narrow folksonomies. In Proceedings of the 23rd ACM conference on Hypertext and social media (pp. 63-72). ACM.

35
Motivation
To overcome that issue some smart people started to invent mechanisms that
should help the user in applying tags, known as social tag recommender
system based on:
 Collaborative Filtering
 User based- and item-based CF [Marinho et al. 2008]
 Matrix Factorization
 FM, PITF [Rendle et al. 2010, 2011, 2012]
 Graph Structures
 Adapted PageRank and FolkRank [Hotho et al. 2006]
 Topic Models
 Latent Dirichlet Allocation (LDA) [Krestel et al. 2009, 2010, 2011]

36
Why do we need cognitive models?
First answer: We do not like data data driven approaches…
Me: OK
Second answer: We can understand things better…
…why is something happening and how…

37
MINERVA2

38
Approach
 Based on a Human cognition (derived from MINERVA2 [Kruschke et al., 1992])

39
Evaluation
 Wikipedia
 p-core pruning (p = 14)
 To finally measure to performance of our approach we split up our dataset in two
sub-sets 80% for training and 20% for testing Training
 Precision, Recall, F1-score, MRR, MAP
 As Baseline algorithm we have chosen Latent Dirichlet Allocation (LDA)
[Krestel et al. 2009]

40
Results
Results:
3Layers reaches higher levels of
estimate than the pure LDA
approach.

41
ACT-R

42
ACT-R

43
Interestingly, when looking into the literatur of tagging
systems - temporal processes are typically modeled
with an exponential function...
D. Yin, L. Hong, and B. D. Davison. Exploiting session-like behaviors in tag prediction. In
Proceedings of the 20th international conference companion on World wide web, pages
167–168. ACM, 2011.
L. Zhang, J. Tang, and M. Zhang. Integrating temporal usage pattern into personalized tag
prediction. In Web Technologies and Applications, pages 354–365. Springer, 2012

 Linear distribution with log-scale
44
Empirical Analysis: BibSonomy (1)
44
on Y-axis 
exponential function
 Linear distribution with log-scale
on X- and Y-axes 
power function

45
Empirical Analysis: BibSonomy (2)
45
Exponential distribution
R² = 31%
Power distribution
R² = 89%

46
Results:
Decay factor is better modeled as
power-function rather than an ex-function

47
Experiment 1: Predicting re-use of tags

48
Results: Predicting re-use of tags
BLLAC
BLL
MPU
GIRP

49
Results: Recall / Precision
Results:
BLLAC performs fairly well in
predicting the re-use of tags

50
Experiment 2: Recommending Tags

51
Results: Recall-Precision plots
51
 The time-depended
approaches outperform the
state-of-the-art
 BLL+MPr reaches the
highest level of accuracy
CiteULike

BLL approaches outperform current
state-of-the-art tag recommender
approaches.
52
Results: Recall Precision
Results:

53
...how about runtime?

54
Results: Runtime
BLL+C needs only around 1s to generate tag-recommendations
for 5,500 users in BibSonomy

55
Results: Runtime

56
...predicting (re-ranking) items with ACT-R

57
Our Approach
= CIRTT  2 main steps
First step:
– User-based Collaborative Filtering (CF) to get
candidate items of similar users
Second step:
– Item-based CF to rank these candidate items using
the BLL equation to integrate tag and time
information:
57

IR metrics: nDCG@20, MAP@20, Recall@20, Diversity and
58
How does it perform?
3 freely-available folksonomy datasets
– BibSonomy (~ 340,000 tag assignments)
– CiteULike (~ 100.000 tag assignments)
– MovieLens (~ 100.000 tag assignments)
Original datasets (no p-core pruning) Doerfel et al. (2013)
80/20 split (for each user 20% most recent bookmarks/posts
in test-set, rest in training-set)
User Coverage
58

59
Baseline Methods
• Most Popular (MP)
• User-based Collaborative Filtering (CF)
• Two alternative approaches based on tag and time
information
– Zheng et al. (2011)  exponential function
– Huang et al. (2014)  linear function
(remember: our CIRTT uses a power function)
59

60
Results: nDCG plots
60
CIRTT reaches the highest level of accuracy

61
Results: Recall plots
61
CIRTT reaches the highest level of accuracy

CIRTT works quite well compared to
the current state-of-the-art in tag-based
62
Results
Results:
item recommender systems

63
What are we...
...currently working on...

64
MINERVA2 + ACT-R

65
Time in Semantic vs. Lexical Memory

66
Topical vs. Lexical shift in time
Topics
Tags
Results:
Topical shift in time is less
pronounced than lexical shift

67
Results: Recall / Precision

68
Describer vs. Categorizer
M. Strohmaier, C. Koerner, and R. Kern. Understanding why users tag: A survey of tagging motivation
literature and results from an empirical study. Journal of Web Semantics, 17:1–11, 2012.

69
Results: Categorizer vs. Describer

70
... ok that‘s basically it 

71
Code and Framework
https://github.com/learning-layers/TagRec/

72
Thank you!
Christoph Trattner
Email: trattner.christoph@gmail.com
Web: christophtrattner.info
Twitter: @ctrattner
Sponsors:

From Search to Predictions in Tagged Information Spaces

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to From Search to Predictions in Tagged Information Spaces

Similar to From Search to Predictions in Tagged Information Spaces (20)

Recently uploaded

Recently uploaded (20)

From Search to Predictions in Tagged Information Spaces