SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
A Graph-based Clustering Scheme for Identifying
                                       Related Tags in Folksonomies
                        Symeon Papadopoulos, Yiannis Kompatsiaris, Athena Vakali




                                                                    Bilbao, Spain
                                                                   30 Aug – 3 Sep




CERTH   ITI   AUTH
overview

• tag clustering / intro

• existing solutions - limitations

• hybrid graph clustering (HGC)
    – core set detection
    – (μ,ε)-space exploration
    – core set expansion

• evaluation

• conclusions




                           Symeon Papadopoulos (CERTH-ITI, AUTH)   2
overview

• tag clustering / intro

• existing solutions - limitations

• hybrid graph clustering (HGC)
    – core set detection
    – (μ,ε)-space exploration
    – core set expansion

• evaluation

• conclusions




                           Symeon Papadopoulos (CERTH-ITI, AUTH)   3
tag clustering
• starting point:
   folksonomy, i.e. annotation scheme produced by the set of users,
      resources, tags of a social tagging system, e.g. delicious, flickr,
      BibSonomy (Mika, 2005)

• observation I:
   folksonomies  a direct encoding of the views of users on how content
      items should be organized through a flexible annotation scheme

• observation II:
   tags used to describe the same resources  tags related to each other
      (meaningful semantic association)

       Mika, P.: Ontologies are us: A unified model of social networks and semantics. ISWC 2005, LNCS 3729, 522-536,
       Springer-Verlag (2005)


                               Symeon Papadopoulos (CERTH-ITI, AUTH)                                                   4
why is tag clustering useful?
• information exploration and navigation (Begelman et al., 2006;
  Simpson, 2008)
• automatic content annotation (Brooks, 2006)
• user profiling (Gemmell, 2008)
• content clustering (Giannakidou, 2008)
• tag sense disambiguation (Au Yeung, 2009)
Begelman, G., Keller, P., Smadja, F.: Automated Tag Clustering: Improving search and exploration in the tag space. Online article:
http://www.pui.ch/phred/automated_tag_clustering (2006)
Simpson, E.: Clustering Tags in Enterprise and Web Folksonomies. Technical Report HPL-2008-18 (2008)
Au Yeung, C. M., Gibbins, N., Shadbolt., N.: Contextualising Tags in Collaborative Tagging Systems. Proceedings of 20th ACM
Conference on Hypertext and Hypermedia, pages 251-260, Turin, Italy, 29 June - 1 July, ACM (2009)
Giannakidou, E., Koutsonikola, V. A., Vakali, A., Kompatsiaris, Y.: Co-Clustering Tags and Social Data Sources. Proceedings of WAIM
2008: 9th International Conference on Web-Age Information Management. IEEE, 317-324 (2008)
               Brooks, C. H., Montanez, N.: Improved annotation of the blogosphere via autotagging and hierarchical clustering.
               Proceedings of WWW '06: 15th international Conference on World Wide Web. ACM, New York, NY, 625-632 (2006)
               Gemmell, J., Shepitsen A., Mobasher B., Burke, R.: Personalizing Navigation in Folksonomies Using Hierarchical Tag
               Clustering. Data Warehousing and Knowledge Discovery 5182, 196-205 (2008)

                                         Symeon Papadopoulos (CERTH-ITI, AUTH)                                                        5
overview

• tag clustering / intro

• existing solutions - limitations

• hybrid graph clustering (HGC)
    – core set detection
    – (μ,ε)-space exploration
    – core set expansion

• evaluation

• conclusions




                           Symeon Papadopoulos (CERTH-ITI, AUTH)   6
existing solutions (i) :: conventional clustering

• conventional clustering schemes
    represent tags in some feature space and employ
    standard clustering method, e.g.:
      • k-means (Giannakidou et al., 2008)
      • hierarchical agglomerative clustering (HAC)
             (Brooks et al., 2006; Gemmell et al., 2008)

     Giannakidou, E., Koutsonikola, V. A., Vakali, A., Kompatsiaris, Y.: Co-Clustering Tags and Social Data Sources. Proceedings
     of WAIM 2008: 9th International Conference on Web-Age Information Management. IEEE, 317-324 (2008)
     Brooks, C. H., Montanez, N.: Improved annotation of the blogosphere via autotagging and hierarchical clustering.
     Proceedings of WWW '06: 15th international Conference on World Wide Web. ACM, New York, NY, 625-632 (2006)
     Gemmell, J., Shepitsen A., Mobasher B., Burke, R.: Personalizing Navigation in Folksonomies Using Hierarchical Tag
     Clustering. Data Warehousing and Knowledge Discovery 5182, 196-205 (2008)

                               Symeon Papadopoulos (CERTH-ITI, AUTH)                                                               7
existing solutions (i) :: conventional clustering

• problems with conventional clustering

• needs number of clusters to be defined: very hard to even
  estimate it in large-scale tagging systems

• not easily scalable:
   – k-means (Lloyd’s):           O(I  C  n  D)
   – HAC:                         O(n2  logn)
     n: number of tags, I: number of iterations, C: number of clusters, D:
     number of dimensions
     HAC is hardly applicable since it requires n2 memory for storing the
     dissimilarity matrix


                      Symeon Papadopoulos (CERTH-ITI, AUTH)                  8
existing solutions (ii) :: community detection

• use of community detection methods on tag graphs (derived
  from folksonomies) to find groups of tags that are more
  densely connected to each other than to the rest of the graph
• community detection methods largely address shortcomings
  of conventional clustering (Begelman et al., 2006; Simpson,
  2008; Au Yeung et al., 2009) schemes
   – efficient: complexity O(n  logn)
   – do not require number of communities to be provided as input
     (typically use modularity maximization)


       Begelman, G., Keller, P., Smadja, F.: Automated Tag Clustering: Improving search and exploration in the tag space.
       Online article: http://www.pui.ch/phred/automated_tag_clustering (2006)
       Simpson, E.: Clustering Tags in Enterprise and Web Folksonomies. Technical Report HPL-2008-18 (2008)

       Au Yeung, C. M., Gibbins, N., Shadbolt., N.: Contextualising Tags in Collaborative Tagging Systems. Proceedings of
       20th ACM Conference on Hypertext and Hypermedia, pages 251-260, Turin, Italy, 29 June - 1 July, ACM (2009)


                              Symeon Papadopoulos (CERTH-ITI, AUTH)                                                         9
existing solutions (ii) :: community detection

• existing community detection schemes also suffer
  from problems
   – modularity maximization typically leads to highly skewed
     cluster size distribution (Simpson, 2008):
       few gigantic clusters and numerous small ones 
       gigantic clusters (representing even half the number of
       objects) are not useful for IR
   – not possible to leave noisy objects out of cluster structure
   – not possible to have overlap among clusters (which is
     useful in the context of tag clustering)
       Simpson, E.: Clustering Tags in Enterprise and Web Folksonomies. Technical Report HPL-2008-18 (2008)


                             Symeon Papadopoulos (CERTH-ITI, AUTH)                                            10
overview

• tag clustering / intro

• existing solutions - limitations

• hybrid graph clustering (HGC)
    – core set detection
    – (μ,ε)-space exploration
    – core set expansion

• evaluation

• conclusions




                           Symeon Papadopoulos (CERTH-ITI, AUTH)   11
hybrid graph clustering
• our solution is based on a structure-connected community
  detection approach (Xu et al., 2007) that is based on the
  concept of structural similarity and (μ,ε)-cores:
   – nodes on the graph are structurally similar when they have many
     neighbors in common
   – a (μ,ε)-core is a node that has at least μ neighboring nodes with which
     it has structural similarity at least ε

• extended in two ways:
   – parameter space exploration  raises the need for setting
     parameters
   – core community expansion  permits overlap among communities

       Xu, X., Yuruk, N., Feng, Z., Schweiger, T. A.: SCAN: A Structural Clustering Algorithm for Networks. Proceedings of KDD
       '07: 13th international Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, 824-833 (2007)


                                Symeon Papadopoulos (CERTH-ITI, AUTH)                                                            12
hybrid graph clustering

• hybrid scheme:
  – (μ,ε)-core identification and structure connected
    cluster extraction (original approach)
  – (μ,ε)-parameter space exploration
     makes scheme completely parameter-free
  – cluster expansion
     increases coverage, permits overlap among clusters




                Symeon Papadopoulos (CERTH-ITI, AUTH)     13
structure connected cluster extraction
• structural similarity between nodes u, w on a graph G = {V, E}:


• ε-neighborhood:
• (μ,ε)-core:
• direct structure reachability of w w.r.t. to core u:

• cluster extraction (Xu et al., 2007):
   starting from a (μ,ε)-core node grow the cluster to contain all nodes that
      are directly structure reachable to it or reachable through a chain of
      nodes that are directly structure reachable to each other
       Xu, X., Yuruk, N., Feng, Z., Schweiger, T. A.: SCAN: A Structural Clustering Algorithm for Networks. Proceedings of KDD
       '07: 13th international Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, 824-833 (2007)

                                Symeon Papadopoulos (CERTH-ITI, AUTH)                                                            14
structure connected cluster extraction




• edge labels denote structural similarity values between nodes
• blue nodes are (μ, ε)-cores for μ = 5 and ε = 0.65
• gray nodes are directly structure reachable from (μ, ε)-cores
      • the rest of nodes are left out of the cluster structure

                   Symeon Papadopoulos (CERTH-ITI, AUTH)          15
parameter space exploration

• original approach needs parameter setting that is
  troublesome for complex datasets
• parameter interpretation:
   – μ: a high value for μ will lead to fewer and larger clusters,
     i.e. only nodes with degree of at least μ will be considered
     to be cores
   – ε: a high value for ε will make the cluster extraction
     process stricter, i.e. less nodes will be assigned to clusters
• in fact, a single (μ,ε) parameter pair is unlikely to
  discover all interesting clusters


                   Symeon Papadopoulos (CERTH-ITI, AUTH)              16
parameter space exploration
• search for clusters at multiple parameter pairs
• identify the highest quality clusters (high μ, high ε), then
  proceed to less profound clusters
• exclude nodes that have
  already been assigned
  to a cluster from being
  re-assigned  makes
  process faster
• log-sampling along μ
  axis for faster
  exploration



                    Symeon Papadopoulos (CERTH-ITI, AUTH)        17
cluster expansion
• the original structure connected approach may be too strict and thus
  leave too many nodes out of the clustering structure
• an expansion process attempts to mitigate this weakness
• for each extracted core cluster, a local expansion process is conducted
  that attaches neighboring nodes
• the expansion is based on a simple greedy maximization of a local cluster
  density measure called subgraph modularity (Luo et al., 2006):



• nodes with very high degree (belonging to the top 10 percentile of the
  degree distribution) are not considered in this process in order to make
  the expansion process more efficient


          Luo, F., Wang, J. Z., Promislow, E.: Exploring Local Community Structures in Large Networks. Proceedings of the
          2006 IEEE/WIC/ACM international Conference on Web Intelligence. IEEE Computer Society, 233-239 (2006)

                                Symeon Papadopoulos (CERTH-ITI, AUTH)                                                       18
cluster expansion




    (a) before attaching node 11                 (b) after attaching node 11
            M(S) = 1.429                                  M(S) = 2.4




                    Symeon Papadopoulos (CERTH-ITI, AUTH)                      19
overview

• tag clustering / intro

• existing solutions - limitations

• hybrid graph clustering (HGC)
    – core set detection
    – (μ,ε)-space exploration
    – core set expansion

• evaluation

• conclusions




                           Symeon Papadopoulos (CERTH-ITI, AUTH)   20
evaluation :: overview
goal: compare the quality of tag clusters produced by our method (HGC) with
   the one produced by state-of-the-art, namely:
   (a) modularity-maximization method by Clauset et al., 2004 (CNM)
   (b) original structure connected graph clustering by Xu et al., 2007 (SCAN)

two kinds of evaluation:

• direct small-scale evaluation
       subjective assessment of the produced tag clusters by eyeballing to see
       whether tags belonging to the same cluster are related

• indirect large-scale evaluation
       evaluate how useful the produced cluster structure is for some IR task, namely
       tag recommendation  if tag clusters are good, performance of tag
       recommendation based on them will be good as well



                         Symeon Papadopoulos (CERTH-ITI, AUTH)                          21
evaluation :: datasets

• three different folksonomy datasets of various sizes:




• resulting tag graphs (large component)




                                             average degree

                                             average clustering coefficient
                Symeon Papadopoulos (CERTH-ITI, AUTH)                         22
direct evaluation (i)
examples of unrelated tags placed in the same gigantic community by CNM




                     Symeon Papadopoulos (CERTH-ITI, AUTH)                23
direct evaluation (ii)
          examples of interesting HGC communities




               Symeon Papadopoulos (CERTH-ITI, AUTH)   24
indirect evaluation :: setup (i)

• process
  – simple tag recommender based on tag clusters:
      • input tag
      • find containing community
      • recommend most frequent tags of the same community
    naïve technique, but fair for comparing the effectiveness of the used
    tag cluster structure
  – the three competing tag cluster structures (CNM, SCAN, HGC) were
    used by the recommender
  – historic tagging data were used as ground truth
      • for each user one tag was used as input and the rest were considered as
        the “correct” output
      • very frequent tags (top 5%) were left out of this process in order not to
        allow trivial (very generic) recommendations to mask the actual results
                      Symeon Papadopoulos (CERTH-ITI, AUTH)                         25
indirect evaluation :: setup (ii)

• measures
  – RTP: number of correct recommendations per
    recommender instance
  – UTP: number of unique correct recommendations
  – P: precision, i.e. ratio of correct recommendations over
    total recommendation per recommender instance
  – R: recall, i.e. ratio of correct recommendations of a
    recommender instance over all correct tags according to
    ground truth
  – F-measure
  – P@1, P@5: Precision in the top-1/top-5 recommendations


                 Symeon Papadopoulos (CERTH-ITI, AUTH)         26
indirect evaluation :: results




•   for SCAN, we used the (μ,ε)-pair that yielded the highest F-measure
•   both SCAN and HGC perform considerably better than CNM
•   HGC results in more unique correct recommendations and higher recall
•   the cluster expansion step was responsible for the largest increase in recall and
    corresponding drop in precision

          conclusion: given the task and the evaluation setup, we would prefer HGC,
          since: (a) it is parameter free, (b) it leads to more correct recommendations

                           Symeon Papadopoulos (CERTH-ITI, AUTH)                          27
overview

• tag clustering / intro

• existing solutions - limitations

• hybrid graph clustering (HGC)
    – core set detection
    – (μ,ε)-space exploration
    – core set expansion

• evaluation

• conclusions




                           Symeon Papadopoulos (CERTH-ITI, AUTH)   28
conclusions
contributions:
•   efficient tag clustering scheme that addresses several shortcomings of previous
    approaches
     –   no need for setting the number of clusters
     –   no gigantic communities
     –   noisy tags left out of cluster structure
     –   possibility for overlap among communities

caveats:
•   despite being efficient compared to conventional clustering schemes, the method
    is still much slower than the original SCAN (Xu et al., 2007)
•   the fact that previously assigned nodes are not taken into account when a new
    (μ,ε) pair is explored, distorts the actual clustering results

           future work:
           •   investigate means of making parameter exploration more efficient
           •   evaluate the value of permitting overlap among communities


                             Symeon Papadopoulos (CERTH-ITI, AUTH)                    29
questions




       Symeon Papadopoulos (CERTH-ITI, AUTH)   30

Contenu connexe

Similaire à A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Bs31267274
Bs31267274Bs31267274
Bs31267274IJMER
 
Data Clustering Using Swarm Intelligence Algorithms An Overview
Data Clustering Using  Swarm Intelligence Algorithms  An OverviewData Clustering Using  Swarm Intelligence Algorithms  An Overview
Data Clustering Using Swarm Intelligence Algorithms An OverviewAboul Ella Hassanien
 
HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers B...
HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers B...HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers B...
HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers B...Albert Orriols-Puig
 
Cataloguing of learning objects using social tagging
Cataloguing of learning objects using social taggingCataloguing of learning objects using social tagging
Cataloguing of learning objects using social taggingLuciana Zaina
 
Learning Relations from Social Tagging Data
Learning Relations from Social Tagging DataLearning Relations from Social Tagging Data
Learning Relations from Social Tagging DataHang Dong
 
Social Event Detection using Multimodal Clustering and Integrating Supervisor...
Social Event Detection using Multimodal Clustering and Integrating Supervisor...Social Event Detection using Multimodal Clustering and Integrating Supervisor...
Social Event Detection using Multimodal Clustering and Integrating Supervisor...Symeon Papadopoulos
 
On the Navigability of Social Tagging Systems
On the Navigability of Social Tagging SystemsOn the Navigability of Social Tagging Systems
On the Navigability of Social Tagging SystemsMarkus Strohmaier
 
Recruitment Based On Ontology with Enhanced Security Features
Recruitment Based On Ontology with Enhanced Security FeaturesRecruitment Based On Ontology with Enhanced Security Features
Recruitment Based On Ontology with Enhanced Security Featurestheijes
 
Ba2419551957
Ba2419551957Ba2419551957
Ba2419551957IJMER
 
Grouping techniques for facing Volume and Velocity in the Big Data
Grouping techniques for facing Volume and Velocity in the Big DataGrouping techniques for facing Volume and Velocity in the Big Data
Grouping techniques for facing Volume and Velocity in the Big DataFacultad de Informática UCM
 
Improving Personal Tagging Consistency Through Visualization Of Tag
Improving Personal Tagging Consistency Through Visualization Of TagImproving Personal Tagging Consistency Through Visualization Of Tag
Improving Personal Tagging Consistency Through Visualization Of TagQin Gao
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
 
Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"ieee_cis_cyprus
 
Bibliography (Microsoft Word, 61k)
Bibliography (Microsoft Word, 61k)Bibliography (Microsoft Word, 61k)
Bibliography (Microsoft Word, 61k)butest
 
Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics Angelo Salatino
 
Advanced Probabilistic Modeling Algorithms for Clustering ...
Advanced Probabilistic Modeling Algorithms for Clustering ...Advanced Probabilistic Modeling Algorithms for Clustering ...
Advanced Probabilistic Modeling Algorithms for Clustering ...butest
 
Applications Of Clustering Techniques In Data Mining A Comparative Study
Applications Of Clustering Techniques In Data Mining  A Comparative StudyApplications Of Clustering Techniques In Data Mining  A Comparative Study
Applications Of Clustering Techniques In Data Mining A Comparative StudyFiona Phillips
 

Similaire à A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies (20)

Bs31267274
Bs31267274Bs31267274
Bs31267274
 
Data Clustering Using Swarm Intelligence Algorithms An Overview
Data Clustering Using  Swarm Intelligence Algorithms  An OverviewData Clustering Using  Swarm Intelligence Algorithms  An Overview
Data Clustering Using Swarm Intelligence Algorithms An Overview
 
HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers B...
HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers B...HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers B...
HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers B...
 
Cataloguing of learning objects using social tagging
Cataloguing of learning objects using social taggingCataloguing of learning objects using social tagging
Cataloguing of learning objects using social tagging
 
Learning Relations from Social Tagging Data
Learning Relations from Social Tagging DataLearning Relations from Social Tagging Data
Learning Relations from Social Tagging Data
 
Social Event Detection using Multimodal Clustering and Integrating Supervisor...
Social Event Detection using Multimodal Clustering and Integrating Supervisor...Social Event Detection using Multimodal Clustering and Integrating Supervisor...
Social Event Detection using Multimodal Clustering and Integrating Supervisor...
 
On the Navigability of Social Tagging Systems
On the Navigability of Social Tagging SystemsOn the Navigability of Social Tagging Systems
On the Navigability of Social Tagging Systems
 
Recruitment Based On Ontology with Enhanced Security Features
Recruitment Based On Ontology with Enhanced Security FeaturesRecruitment Based On Ontology with Enhanced Security Features
Recruitment Based On Ontology with Enhanced Security Features
 
Ba2419551957
Ba2419551957Ba2419551957
Ba2419551957
 
Grouping techniques for facing Volume and Velocity in the Big Data
Grouping techniques for facing Volume and Velocity in the Big DataGrouping techniques for facing Volume and Velocity in the Big Data
Grouping techniques for facing Volume and Velocity in the Big Data
 
Improving Personal Tagging Consistency Through Visualization Of Tag
Improving Personal Tagging Consistency Through Visualization Of TagImproving Personal Tagging Consistency Through Visualization Of Tag
Improving Personal Tagging Consistency Through Visualization Of Tag
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 
ISA - a short overview - Dec 2013
ISA - a short overview - Dec 2013ISA - a short overview - Dec 2013
ISA - a short overview - Dec 2013
 
Kmeans
KmeansKmeans
Kmeans
 
Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"
 
Geometric Deep Learning
Geometric Deep Learning Geometric Deep Learning
Geometric Deep Learning
 
Bibliography (Microsoft Word, 61k)
Bibliography (Microsoft Word, 61k)Bibliography (Microsoft Word, 61k)
Bibliography (Microsoft Word, 61k)
 
Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics
 
Advanced Probabilistic Modeling Algorithms for Clustering ...
Advanced Probabilistic Modeling Algorithms for Clustering ...Advanced Probabilistic Modeling Algorithms for Clustering ...
Advanced Probabilistic Modeling Algorithms for Clustering ...
 
Applications Of Clustering Techniques In Data Mining A Comparative Study
Applications Of Clustering Techniques In Data Mining  A Comparative StudyApplications Of Clustering Techniques In Data Mining  A Comparative Study
Applications Of Clustering Techniques In Data Mining A Comparative Study
 

Plus de Symeon Papadopoulos

DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...
DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...
DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...Symeon Papadopoulos
 
Deepfakes: An Emerging Internet Threat and their Detection
Deepfakes: An Emerging Internet Threat and their DetectionDeepfakes: An Emerging Internet Threat and their Detection
Deepfakes: An Emerging Internet Threat and their DetectionSymeon Papadopoulos
 
Knowledge-based Fusion for Image Tampering Localization
Knowledge-based Fusion for Image Tampering LocalizationKnowledge-based Fusion for Image Tampering Localization
Knowledge-based Fusion for Image Tampering LocalizationSymeon Papadopoulos
 
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...Deepfake Detection: The Importance of Training Data Preprocessing and Practic...
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...Symeon Papadopoulos
 
COVID-19 Infodemic vs Contact Tracing
COVID-19 Infodemic vs Contact TracingCOVID-19 Infodemic vs Contact Tracing
COVID-19 Infodemic vs Contact TracingSymeon Papadopoulos
 
Similarity-based retrieval of multimedia content
Similarity-based retrieval of multimedia contentSimilarity-based retrieval of multimedia content
Similarity-based retrieval of multimedia contentSymeon Papadopoulos
 
Twitter-based Sensing of City-level Air Quality
Twitter-based Sensing of City-level Air QualityTwitter-based Sensing of City-level Air Quality
Twitter-based Sensing of City-level Air QualitySymeon Papadopoulos
 
Aggregating and Analyzing the Context of Social Media Content
Aggregating and Analyzing the Context of Social Media ContentAggregating and Analyzing the Context of Social Media Content
Aggregating and Analyzing the Context of Social Media ContentSymeon Papadopoulos
 
Verifying Multimedia Content on the Internet
Verifying Multimedia Content on the InternetVerifying Multimedia Content on the Internet
Verifying Multimedia Content on the InternetSymeon Papadopoulos
 
A Web-based Service for Image Tampering Detection
A Web-based Service for Image Tampering DetectionA Web-based Service for Image Tampering Detection
A Web-based Service for Image Tampering DetectionSymeon Papadopoulos
 
Learning to detect Misleading Content on Twitter
Learning to detect Misleading Content on TwitterLearning to detect Misleading Content on Twitter
Learning to detect Misleading Content on TwitterSymeon Papadopoulos
 
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersNear-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersSymeon Papadopoulos
 
Verifying Multimedia Use at MediaEval 2016
Verifying Multimedia Use at MediaEval 2016Verifying Multimedia Use at MediaEval 2016
Verifying Multimedia Use at MediaEval 2016Symeon Papadopoulos
 
Placing Images with Refined Language Models and Similarity Search with PCA-re...
Placing Images with Refined Language Models and Similarity Search with PCA-re...Placing Images with Refined Language Models and Similarity Search with PCA-re...
Placing Images with Refined Language Models and Similarity Search with PCA-re...Symeon Papadopoulos
 
In-depth Exploration of Geotagging Performance
In-depth Exploration of Geotagging PerformanceIn-depth Exploration of Geotagging Performance
In-depth Exploration of Geotagging PerformanceSymeon Papadopoulos
 
Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...Symeon Papadopoulos
 
Web and Social Media Image Forensics for News Professionals
Web and Social Media Image Forensics for News ProfessionalsWeb and Social Media Image Forensics for News Professionals
Web and Social Media Image Forensics for News ProfessionalsSymeon Papadopoulos
 
Predicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online DiscussionsPredicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online DiscussionsSymeon Papadopoulos
 
Finding Diverse Social Images at MediaEval 2015
Finding Diverse Social Images at MediaEval 2015Finding Diverse Social Images at MediaEval 2015
Finding Diverse Social Images at MediaEval 2015Symeon Papadopoulos
 

Plus de Symeon Papadopoulos (20)

DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...
DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...
DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...
 
Deepfakes: An Emerging Internet Threat and their Detection
Deepfakes: An Emerging Internet Threat and their DetectionDeepfakes: An Emerging Internet Threat and their Detection
Deepfakes: An Emerging Internet Threat and their Detection
 
Knowledge-based Fusion for Image Tampering Localization
Knowledge-based Fusion for Image Tampering LocalizationKnowledge-based Fusion for Image Tampering Localization
Knowledge-based Fusion for Image Tampering Localization
 
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...Deepfake Detection: The Importance of Training Data Preprocessing and Practic...
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...
 
COVID-19 Infodemic vs Contact Tracing
COVID-19 Infodemic vs Contact TracingCOVID-19 Infodemic vs Contact Tracing
COVID-19 Infodemic vs Contact Tracing
 
Similarity-based retrieval of multimedia content
Similarity-based retrieval of multimedia contentSimilarity-based retrieval of multimedia content
Similarity-based retrieval of multimedia content
 
Twitter-based Sensing of City-level Air Quality
Twitter-based Sensing of City-level Air QualityTwitter-based Sensing of City-level Air Quality
Twitter-based Sensing of City-level Air Quality
 
Aggregating and Analyzing the Context of Social Media Content
Aggregating and Analyzing the Context of Social Media ContentAggregating and Analyzing the Context of Social Media Content
Aggregating and Analyzing the Context of Social Media Content
 
Verifying Multimedia Content on the Internet
Verifying Multimedia Content on the InternetVerifying Multimedia Content on the Internet
Verifying Multimedia Content on the Internet
 
A Web-based Service for Image Tampering Detection
A Web-based Service for Image Tampering DetectionA Web-based Service for Image Tampering Detection
A Web-based Service for Image Tampering Detection
 
Learning to detect Misleading Content on Twitter
Learning to detect Misleading Content on TwitterLearning to detect Misleading Content on Twitter
Learning to detect Misleading Content on Twitter
 
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersNear-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
 
Verifying Multimedia Use at MediaEval 2016
Verifying Multimedia Use at MediaEval 2016Verifying Multimedia Use at MediaEval 2016
Verifying Multimedia Use at MediaEval 2016
 
Multimedia Privacy
Multimedia PrivacyMultimedia Privacy
Multimedia Privacy
 
Placing Images with Refined Language Models and Similarity Search with PCA-re...
Placing Images with Refined Language Models and Similarity Search with PCA-re...Placing Images with Refined Language Models and Similarity Search with PCA-re...
Placing Images with Refined Language Models and Similarity Search with PCA-re...
 
In-depth Exploration of Geotagging Performance
In-depth Exploration of Geotagging PerformanceIn-depth Exploration of Geotagging Performance
In-depth Exploration of Geotagging Performance
 
Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...
 
Web and Social Media Image Forensics for News Professionals
Web and Social Media Image Forensics for News ProfessionalsWeb and Social Media Image Forensics for News Professionals
Web and Social Media Image Forensics for News Professionals
 
Predicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online DiscussionsPredicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online Discussions
 
Finding Diverse Social Images at MediaEval 2015
Finding Diverse Social Images at MediaEval 2015Finding Diverse Social Images at MediaEval 2015
Finding Diverse Social Images at MediaEval 2015
 

Dernier

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Dernier (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

  • 1. A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies Symeon Papadopoulos, Yiannis Kompatsiaris, Athena Vakali Bilbao, Spain 30 Aug – 3 Sep CERTH ITI AUTH
  • 2. overview • tag clustering / intro • existing solutions - limitations • hybrid graph clustering (HGC) – core set detection – (μ,ε)-space exploration – core set expansion • evaluation • conclusions Symeon Papadopoulos (CERTH-ITI, AUTH) 2
  • 3. overview • tag clustering / intro • existing solutions - limitations • hybrid graph clustering (HGC) – core set detection – (μ,ε)-space exploration – core set expansion • evaluation • conclusions Symeon Papadopoulos (CERTH-ITI, AUTH) 3
  • 4. tag clustering • starting point: folksonomy, i.e. annotation scheme produced by the set of users, resources, tags of a social tagging system, e.g. delicious, flickr, BibSonomy (Mika, 2005) • observation I: folksonomies  a direct encoding of the views of users on how content items should be organized through a flexible annotation scheme • observation II: tags used to describe the same resources  tags related to each other (meaningful semantic association) Mika, P.: Ontologies are us: A unified model of social networks and semantics. ISWC 2005, LNCS 3729, 522-536, Springer-Verlag (2005) Symeon Papadopoulos (CERTH-ITI, AUTH) 4
  • 5. why is tag clustering useful? • information exploration and navigation (Begelman et al., 2006; Simpson, 2008) • automatic content annotation (Brooks, 2006) • user profiling (Gemmell, 2008) • content clustering (Giannakidou, 2008) • tag sense disambiguation (Au Yeung, 2009) Begelman, G., Keller, P., Smadja, F.: Automated Tag Clustering: Improving search and exploration in the tag space. Online article: http://www.pui.ch/phred/automated_tag_clustering (2006) Simpson, E.: Clustering Tags in Enterprise and Web Folksonomies. Technical Report HPL-2008-18 (2008) Au Yeung, C. M., Gibbins, N., Shadbolt., N.: Contextualising Tags in Collaborative Tagging Systems. Proceedings of 20th ACM Conference on Hypertext and Hypermedia, pages 251-260, Turin, Italy, 29 June - 1 July, ACM (2009) Giannakidou, E., Koutsonikola, V. A., Vakali, A., Kompatsiaris, Y.: Co-Clustering Tags and Social Data Sources. Proceedings of WAIM 2008: 9th International Conference on Web-Age Information Management. IEEE, 317-324 (2008) Brooks, C. H., Montanez, N.: Improved annotation of the blogosphere via autotagging and hierarchical clustering. Proceedings of WWW '06: 15th international Conference on World Wide Web. ACM, New York, NY, 625-632 (2006) Gemmell, J., Shepitsen A., Mobasher B., Burke, R.: Personalizing Navigation in Folksonomies Using Hierarchical Tag Clustering. Data Warehousing and Knowledge Discovery 5182, 196-205 (2008) Symeon Papadopoulos (CERTH-ITI, AUTH) 5
  • 6. overview • tag clustering / intro • existing solutions - limitations • hybrid graph clustering (HGC) – core set detection – (μ,ε)-space exploration – core set expansion • evaluation • conclusions Symeon Papadopoulos (CERTH-ITI, AUTH) 6
  • 7. existing solutions (i) :: conventional clustering • conventional clustering schemes represent tags in some feature space and employ standard clustering method, e.g.: • k-means (Giannakidou et al., 2008) • hierarchical agglomerative clustering (HAC) (Brooks et al., 2006; Gemmell et al., 2008) Giannakidou, E., Koutsonikola, V. A., Vakali, A., Kompatsiaris, Y.: Co-Clustering Tags and Social Data Sources. Proceedings of WAIM 2008: 9th International Conference on Web-Age Information Management. IEEE, 317-324 (2008) Brooks, C. H., Montanez, N.: Improved annotation of the blogosphere via autotagging and hierarchical clustering. Proceedings of WWW '06: 15th international Conference on World Wide Web. ACM, New York, NY, 625-632 (2006) Gemmell, J., Shepitsen A., Mobasher B., Burke, R.: Personalizing Navigation in Folksonomies Using Hierarchical Tag Clustering. Data Warehousing and Knowledge Discovery 5182, 196-205 (2008) Symeon Papadopoulos (CERTH-ITI, AUTH) 7
  • 8. existing solutions (i) :: conventional clustering • problems with conventional clustering • needs number of clusters to be defined: very hard to even estimate it in large-scale tagging systems • not easily scalable: – k-means (Lloyd’s): O(I  C  n  D) – HAC: O(n2  logn) n: number of tags, I: number of iterations, C: number of clusters, D: number of dimensions HAC is hardly applicable since it requires n2 memory for storing the dissimilarity matrix Symeon Papadopoulos (CERTH-ITI, AUTH) 8
  • 9. existing solutions (ii) :: community detection • use of community detection methods on tag graphs (derived from folksonomies) to find groups of tags that are more densely connected to each other than to the rest of the graph • community detection methods largely address shortcomings of conventional clustering (Begelman et al., 2006; Simpson, 2008; Au Yeung et al., 2009) schemes – efficient: complexity O(n  logn) – do not require number of communities to be provided as input (typically use modularity maximization) Begelman, G., Keller, P., Smadja, F.: Automated Tag Clustering: Improving search and exploration in the tag space. Online article: http://www.pui.ch/phred/automated_tag_clustering (2006) Simpson, E.: Clustering Tags in Enterprise and Web Folksonomies. Technical Report HPL-2008-18 (2008) Au Yeung, C. M., Gibbins, N., Shadbolt., N.: Contextualising Tags in Collaborative Tagging Systems. Proceedings of 20th ACM Conference on Hypertext and Hypermedia, pages 251-260, Turin, Italy, 29 June - 1 July, ACM (2009) Symeon Papadopoulos (CERTH-ITI, AUTH) 9
  • 10. existing solutions (ii) :: community detection • existing community detection schemes also suffer from problems – modularity maximization typically leads to highly skewed cluster size distribution (Simpson, 2008): few gigantic clusters and numerous small ones  gigantic clusters (representing even half the number of objects) are not useful for IR – not possible to leave noisy objects out of cluster structure – not possible to have overlap among clusters (which is useful in the context of tag clustering) Simpson, E.: Clustering Tags in Enterprise and Web Folksonomies. Technical Report HPL-2008-18 (2008) Symeon Papadopoulos (CERTH-ITI, AUTH) 10
  • 11. overview • tag clustering / intro • existing solutions - limitations • hybrid graph clustering (HGC) – core set detection – (μ,ε)-space exploration – core set expansion • evaluation • conclusions Symeon Papadopoulos (CERTH-ITI, AUTH) 11
  • 12. hybrid graph clustering • our solution is based on a structure-connected community detection approach (Xu et al., 2007) that is based on the concept of structural similarity and (μ,ε)-cores: – nodes on the graph are structurally similar when they have many neighbors in common – a (μ,ε)-core is a node that has at least μ neighboring nodes with which it has structural similarity at least ε • extended in two ways: – parameter space exploration  raises the need for setting parameters – core community expansion  permits overlap among communities Xu, X., Yuruk, N., Feng, Z., Schweiger, T. A.: SCAN: A Structural Clustering Algorithm for Networks. Proceedings of KDD '07: 13th international Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, 824-833 (2007) Symeon Papadopoulos (CERTH-ITI, AUTH) 12
  • 13. hybrid graph clustering • hybrid scheme: – (μ,ε)-core identification and structure connected cluster extraction (original approach) – (μ,ε)-parameter space exploration makes scheme completely parameter-free – cluster expansion increases coverage, permits overlap among clusters Symeon Papadopoulos (CERTH-ITI, AUTH) 13
  • 14. structure connected cluster extraction • structural similarity between nodes u, w on a graph G = {V, E}: • ε-neighborhood: • (μ,ε)-core: • direct structure reachability of w w.r.t. to core u: • cluster extraction (Xu et al., 2007): starting from a (μ,ε)-core node grow the cluster to contain all nodes that are directly structure reachable to it or reachable through a chain of nodes that are directly structure reachable to each other Xu, X., Yuruk, N., Feng, Z., Schweiger, T. A.: SCAN: A Structural Clustering Algorithm for Networks. Proceedings of KDD '07: 13th international Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, 824-833 (2007) Symeon Papadopoulos (CERTH-ITI, AUTH) 14
  • 15. structure connected cluster extraction • edge labels denote structural similarity values between nodes • blue nodes are (μ, ε)-cores for μ = 5 and ε = 0.65 • gray nodes are directly structure reachable from (μ, ε)-cores • the rest of nodes are left out of the cluster structure Symeon Papadopoulos (CERTH-ITI, AUTH) 15
  • 16. parameter space exploration • original approach needs parameter setting that is troublesome for complex datasets • parameter interpretation: – μ: a high value for μ will lead to fewer and larger clusters, i.e. only nodes with degree of at least μ will be considered to be cores – ε: a high value for ε will make the cluster extraction process stricter, i.e. less nodes will be assigned to clusters • in fact, a single (μ,ε) parameter pair is unlikely to discover all interesting clusters Symeon Papadopoulos (CERTH-ITI, AUTH) 16
  • 17. parameter space exploration • search for clusters at multiple parameter pairs • identify the highest quality clusters (high μ, high ε), then proceed to less profound clusters • exclude nodes that have already been assigned to a cluster from being re-assigned  makes process faster • log-sampling along μ axis for faster exploration Symeon Papadopoulos (CERTH-ITI, AUTH) 17
  • 18. cluster expansion • the original structure connected approach may be too strict and thus leave too many nodes out of the clustering structure • an expansion process attempts to mitigate this weakness • for each extracted core cluster, a local expansion process is conducted that attaches neighboring nodes • the expansion is based on a simple greedy maximization of a local cluster density measure called subgraph modularity (Luo et al., 2006): • nodes with very high degree (belonging to the top 10 percentile of the degree distribution) are not considered in this process in order to make the expansion process more efficient Luo, F., Wang, J. Z., Promislow, E.: Exploring Local Community Structures in Large Networks. Proceedings of the 2006 IEEE/WIC/ACM international Conference on Web Intelligence. IEEE Computer Society, 233-239 (2006) Symeon Papadopoulos (CERTH-ITI, AUTH) 18
  • 19. cluster expansion (a) before attaching node 11 (b) after attaching node 11 M(S) = 1.429 M(S) = 2.4 Symeon Papadopoulos (CERTH-ITI, AUTH) 19
  • 20. overview • tag clustering / intro • existing solutions - limitations • hybrid graph clustering (HGC) – core set detection – (μ,ε)-space exploration – core set expansion • evaluation • conclusions Symeon Papadopoulos (CERTH-ITI, AUTH) 20
  • 21. evaluation :: overview goal: compare the quality of tag clusters produced by our method (HGC) with the one produced by state-of-the-art, namely: (a) modularity-maximization method by Clauset et al., 2004 (CNM) (b) original structure connected graph clustering by Xu et al., 2007 (SCAN) two kinds of evaluation: • direct small-scale evaluation subjective assessment of the produced tag clusters by eyeballing to see whether tags belonging to the same cluster are related • indirect large-scale evaluation evaluate how useful the produced cluster structure is for some IR task, namely tag recommendation  if tag clusters are good, performance of tag recommendation based on them will be good as well Symeon Papadopoulos (CERTH-ITI, AUTH) 21
  • 22. evaluation :: datasets • three different folksonomy datasets of various sizes: • resulting tag graphs (large component) average degree average clustering coefficient Symeon Papadopoulos (CERTH-ITI, AUTH) 22
  • 23. direct evaluation (i) examples of unrelated tags placed in the same gigantic community by CNM Symeon Papadopoulos (CERTH-ITI, AUTH) 23
  • 24. direct evaluation (ii) examples of interesting HGC communities Symeon Papadopoulos (CERTH-ITI, AUTH) 24
  • 25. indirect evaluation :: setup (i) • process – simple tag recommender based on tag clusters: • input tag • find containing community • recommend most frequent tags of the same community naïve technique, but fair for comparing the effectiveness of the used tag cluster structure – the three competing tag cluster structures (CNM, SCAN, HGC) were used by the recommender – historic tagging data were used as ground truth • for each user one tag was used as input and the rest were considered as the “correct” output • very frequent tags (top 5%) were left out of this process in order not to allow trivial (very generic) recommendations to mask the actual results Symeon Papadopoulos (CERTH-ITI, AUTH) 25
  • 26. indirect evaluation :: setup (ii) • measures – RTP: number of correct recommendations per recommender instance – UTP: number of unique correct recommendations – P: precision, i.e. ratio of correct recommendations over total recommendation per recommender instance – R: recall, i.e. ratio of correct recommendations of a recommender instance over all correct tags according to ground truth – F-measure – P@1, P@5: Precision in the top-1/top-5 recommendations Symeon Papadopoulos (CERTH-ITI, AUTH) 26
  • 27. indirect evaluation :: results • for SCAN, we used the (μ,ε)-pair that yielded the highest F-measure • both SCAN and HGC perform considerably better than CNM • HGC results in more unique correct recommendations and higher recall • the cluster expansion step was responsible for the largest increase in recall and corresponding drop in precision conclusion: given the task and the evaluation setup, we would prefer HGC, since: (a) it is parameter free, (b) it leads to more correct recommendations Symeon Papadopoulos (CERTH-ITI, AUTH) 27
  • 28. overview • tag clustering / intro • existing solutions - limitations • hybrid graph clustering (HGC) – core set detection – (μ,ε)-space exploration – core set expansion • evaluation • conclusions Symeon Papadopoulos (CERTH-ITI, AUTH) 28
  • 29. conclusions contributions: • efficient tag clustering scheme that addresses several shortcomings of previous approaches – no need for setting the number of clusters – no gigantic communities – noisy tags left out of cluster structure – possibility for overlap among communities caveats: • despite being efficient compared to conventional clustering schemes, the method is still much slower than the original SCAN (Xu et al., 2007) • the fact that previously assigned nodes are not taken into account when a new (μ,ε) pair is explored, distorts the actual clustering results future work: • investigate means of making parameter exploration more efficient • evaluate the value of permitting overlap among communities Symeon Papadopoulos (CERTH-ITI, AUTH) 29
  • 30. questions Symeon Papadopoulos (CERTH-ITI, AUTH) 30