SlideShare une entreprise Scribd logo
1  sur  44
Télécharger pour lire hors ligne
IFCS 2009
           Dresden – 17/03/2009



Visualising a text with a tree cloud

      Philippe Gambette, Jean Véronis
Outline

• Tag and word clouds
• Enhanced tag clouds
• Tree clouds
• Construction steps
• Quality control
Tag clouds

• Built from a set of tags
• Font size related to frequency




                                    What is considered the
                                        first tag cloud, from
                                   D. Coupland: Microserfs,
                                    HarperCollins, Toronto,
                                                        1995
Tag clouds

• Built from a set of tags
• Font size related to frequency
• Gained popularity with Flickr




                                   Flickr's all time most popular tags
Word clouds

• Built from a set of words from a text
• Font size related to frequency
• Gained popularity with Wordle
Word clouds

• Built from a set of words from a text
• Font size related to frequency
• Gained popularity with Wordle




GoogleImage(obama inaugural address wordle)
Word clouds

• Built from a set of words from a text
• Font size related to frequency
• Gained popularity with Wordle




GoogleImage(obama inaugural address wordle)
Word clouds

• Built from a set of words from a text
• Font size related to frequency
• Gained popularity with Wordle




GoogleImage(obama inaugural address wordle)
Enhanced tag / word clouds

Add information from the text:
• color intensity to express recency in Amazon
• shared tags in red on del.icio.us
• group together cooccurring tags on the same line
                        Hassan-Montero & Herrero-Solana, InScit'06
• optimize blank space and semantic proximity
                                         Kaser & Lemire, WWW'07
• “topigraphy”: 2D placement according to cooccurrence
       Fujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08
Enhanced tag / word clouds

Add information from the text:
• color intensity to express recency in Amazon
• shared tags in red on del.icio.us
• group together cooccurring tags on the same line
                        Hassan-Montero & Herrero-Solana, InScit'06
• optimize blank space and semantic proximity
                                         Kaser & Lemire, WWW'07
• “topigraphy”: 2D placement according to cooccurrence
       Fujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08
Enhanced tag / word clouds

Add information from the text:
• color intensity to express recency in Amazon
• shared tags in red on del.icio.us
• group together cooccurring tags on the same line
                        Hassan-Montero & Herrero-Solana, InScit'06
• optimize blank space and semantic proximity
                                         Kaser & Lemire, WWW'07
• “topigraphy”: 2D placement according to cooccurrence
       Fujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08
Enhanced tag / word clouds

Add information from the text:
• color intensity to express recency in Amazon
• shared tags in red on del.icio.us
• group together cooccurring tags on the same line
                        Hassan-Montero & Herrero-Solana, InScit'06
• optimize blank space and semantic proximity
                                         Kaser & Lemire, WWW'07
• “topigraphy”: 2D placement according to cooccurrence
       Fujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08
Enhanced tag / word clouds

Add information from the text:
• color intensity to express recency in Amazon
• shared tags in red on del.icio.us
• group together cooccurring tags on the same line
                        Hassan-Montero & Herrero-Solana, InScit'06
• optimize blank space and semantic proximity
                                         Kaser & Lemire, WWW'07
• “topigraphy”: 2D placement according to cooccurrence
       Fujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08
Extract semantic information from a text
• literature analysis:
philological approach: only consider the text
                                                            Brody
• discourse analysis:
tree analysis or cooccurrence graph, geodesic projection
                             Brunet (Hyperbase), Viprey (Astartex)
• text mining:
semantic graph
                                           Grimmer (Wordmapper)
• natural language processing:
sense desamibiguation
                                                Véronis (Hyperlex)
Extract semantic information from a text
• literature analysis:
philological approach: only consider the text
                                                                    Brody
• discourse analysis:
tree analysis or cooccurrence graph, geodesic projection
                             Brunet (Hyperbase), Viprey (Astartex)
• text mining:
semantic graph
                                           Grimmer (Wordmapper)
• natural language processing:
sense desamibiguation
                                                 Véronis (Hyperlex)


                                              Mayaffre, Quand travail, famille,
                                                et patrie cooccurrent dans le
                                                          discours de Nicolas
                                                            Sarkozy, JADT'08
Extract semantic information from a text
• literature analysis:
philological approach: only consider the text
                                                                  Brody
• discourse analysis:
tree analysis or cooccurrence graph, geodesic projection
                             Brunet (Hyperbase), Viprey (Astartex)
• text mining:
semantic graph
                                           Grimmer (Wordmapper)
• natural language processing:
sense desamibiguation
                                                Véronis (Hyperlex)




                                              Brunet, Les séquences (suite),
                                                                  JADT'08
Extract semantic information from a text
• literature analysis:
philological approach: only consider the text
                                                                  Brody
• discourse analysis:
tree analysis or cooccurrence graph, geodesic projection
                             Brunet (Hyperbase), Viprey (Astartex)
• text mining:
semantic graph
                                           Grimmer (Wordmapper)
• natural language processing:
word sense disamibiguation
                                                Véronis (Hyperlex)


                                                              Barry, Viprey,
                                                    Approche comparative
                                                 des résultats d'exploration
                                                     textuelle des discours
                                                  de deux leaders africains
                                                             Keita et Touré,
                                                                   JADT'08
Extract semantic information from a text
• literature analysis:
philological approach: only consider the text
                                                                 Brody
• discourse analysis:
tree analysis or cooccurrence graph, geodesic projection
                             Brunet (Hyperbase), Viprey (Astartex)
• text mining:
semantic graph
                                           Grimmer (Wordmapper)
• natural language processing:
word sense disamibiguation
                                                Véronis (Hyperlex)




                                                           Peyrat-Guillard,
                                                     Analyse du discours
                                                  syndical sur l’entreprise,
                                                                   JADT'08
Extract semantic information from a text
• literature analysis:
philological approach: only consider the text
                                                                  Brody
• discourse analysis:
tree analysis or cooccurrence graph, geodesic projection
                             Brunet (Hyperbase), Viprey (Astartex)
• text mining:
semantic graph
                                           Grimmer (Wordmapper)
• natural language processing:
word sense disambiguation
                                                Véronis (Hyperlex)




                                                    Disambiguation of word
                                                   “barrage”: dam, play-off,
                                                  roadblock, police cordon.

                                                        Véronis, HyperLex:
                                                    Lexical Cartography for
                                                Information Retrieval, 2004
Tag cloud + tree = tree cloud




                                 SplitsTree: Huson 1998,
                                   Huson & Bryant 2006
Built with
TreeCloud and                    GPL-licensed Treecloud in Python,
                                available at http://www.treecloud.org
The first tree cloud




                              Tree cloud of the blog posts containing “Laurence Ferrari”
                                        from 25/11/2007 to 10/12/2007, by Jean Véronis
                http://aixtal.blogspot.com/2007/12/actu-une-ferrari-dans-un-arbre.html
Building a tree cloud – extracting the words
Extract words with frequency:

• stoplist?
       without stoplist         with stoplist
Building a tree cloud – extracting the words
Extract words with frequency:

• lemmatization?

• groups words with similar meaning?




                                        no lemmatization...
                                        sometimes interesting


                                                70 most frequent words in
                                            Obama's campaign speeches,
                                       winsize=30, distance=dice, NJ-tree.
Building a tree cloud – dissimilarity matrix

Many semantic distance formulas based on cooccurrence
Building a tree cloud – dissimilarity matrix

Many semantic distance formulas based on cooccurrence

 Text
  sliding
  window S sliding
           step s
  width w

 cooccurrence matrices           semantic dissimilarity
 O11, O12, O21, O22              matrix
                                 chi squared, mutual
                                 information, liddel, dice,
                                 jaccard, gmean, hyperlex,
                                 minimum sensitivity, odds
                                 ratio, zscore, log likelihood,
                                 poisson-stirling...

                                                                    Evert,
                                       Statistics of words cooccurrences,
                                                        PhD Thesis, 2005
Building a tree cloud – dissimilarity matrix

Transformations needed on the dissimilarity:

• transform similarity into dissimilarity

• linear normalization for positive matrices to get distances
in [0,1]

• affine normalization for matrices with positive or negative
numbers, to get distances in [α,1] (for example α=0.1)
Building a tree cloud – tree reconstruction

Many existing methods:

• Neighbor-Joining
                                  Saitou & Nei, 1987


• Addtree variants
                           Barthelemy & Luong, 1987

• Quartet heuristic
                             Cilibrasi & Vitanyi, 2007
Building a tree cloud – tree decoration

Choice of word sizes:

• computed directly from frequency (apply a log!)

or

• computed from frequency ranking (exponential distribution)

or

• statistical significance with respect to a reference corpus
Building a tree cloud – tree decoration

Colors: chronology old
                   recent




150 most frequent words in            Built with
Obama's campaign speeches,            TreeCloud and
winsize=30, distance=oddsratio,
color=chronology, NJ-tree.
Building a tree cloud – tree decoration

Colors: dispersion sparse
                   dense
standard
deviation
of the position




                                      too blue?


150 most frequent words in
Obama's campaign speeches,
winsize=30, distance=oddsratio,
color=dispersion, NJ-tree.
Building a tree cloud – tree decoration

Colors: dispersion sparse
                   dense
standard
deviation
of the position
word frequency




                                          too red?


150 most frequent words in
Obama's campaign speeches,
winsize=30, distance=oddsratio,
color=norm-dispersion, NJ-tree.
Building a tree cloud – tree decoration

Edge color or thickness:
quality of the induced
cluster.




150 most frequent words in
Obama's campaign speeches,
winsize=30, distance=oddsratio,
color=chronology, NJ-tree.
Quality control

Is there an objective quality measure of tree clouds?
Quality control

Is there an objective quality measure of tree clouds?

What is the best method to build a tree cloud from my data?
Quality control

Is there an objective quality measure of tree clouds?

What is the best method to build a tree cloud from my data?

Tree cloud variations if small changes?
       bootstrap to evaluate:
             - stability of the result
             - robustness of the method
Quality control

Is there an objective quality measure of tree clouds?

What is the best method to build a tree cloud from my data?

Tree cloud variations if small changes?
       bootstrap to evaluate:
             - stability of the result
             - robustness of the method

Still, is there a more direct method?
         arboricity to show whether the distance matrix
         fits with a tree, which should imply stability?
                                       Guénoche & Garreta, 2001
                                        Guénoche & Darlu, 2009
Quality control – bootstrap

• Randomly delete words with probability 50%.

• Built tree cloud of original text, and altered text.

• Compute similarity of both trees (1-normalized RobinsonFoulds)
           similarity
                0,95


                 0,9


                0,85


                 0,8


                0,75


                 0,7


                0,65


                 0,6


                0,55


                 0,5
                                                                                                                        4 altered versions of 10
                                    mi             dice           gmean         ms         zscore     poissonstirling
                                                                                                                            Obama's speeches,
                       chisquared        liddell          jaccard      hyperlex    oddsratio   loglikelihood
                                                                                                                        3000 words in average,
                                                                                                                              width=30, NJ-tree.
Quality control – arboricity

Relationship between “bootstrap quality” and arboricity:
                80
    arboricity (%)

                70




                60




                50




                40




                30




                20




                10




                 0
                     0,52   0,54   0,56   0,58   0,6   0,62   0,64     0,66   0,68   0,7
                                                                     bootstrap quality
Quality control – arboricity

Relationship between “bootstrap quality” and arboricity:
                      80
          arboricity (%)

                      70



  correlation         60

  coefficient:
  0.64                50




  arboricity          40


  below
  50%:                30


  danger!
                      20




                      10




                       0
                       0,52   0,54   0,56   0,58   0,6   0,62   0,64    0,66   0,68   0,7
                                                                       bootstrap quality
Perspectives

• Make the tool available on a web interface    http://www.treecloud.org



• Evaluate tree clouds for discourse analysis

• Build the daily tree cloud of people popular on blogs,

with




                                                    http://labs.wikio.net
Thank you for your attention!




                                      Tree cloud of the words appearing
                           twice or more in the IFCS 2009 call for paper
                        lemmatization, width=20, distance=dice, NJ-tree.
                                 http://www.treecloud.org - http://www.splitstree.org
Tree clouds focused on one word




                       Tree cloud of the neighborhood of “McCain” in
                                        Obama's campaign speeches
                             http://www.treecloud.org - http://www.splitstree.org
Tree clouds focused on one word




                        Tree cloud of the neighborhood of “Bush” in
                                      Obama's campaign speeches
                            http://www.treecloud.org - http://www.splitstree.org
Tree clouds focused on one word




                        Tree cloud of the neighborhood of “world” in
                                      Obama's campaign speeches
                            http://www.treecloud.org - http://www.splitstree.org

Contenu connexe

Plus de Philippe Gambette

Practical use of combinatorial methods for phylogenetic network reconstruction
Practical use of combinatorial methods for phylogenetic network reconstructionPractical use of combinatorial methods for phylogenetic network reconstruction
Practical use of combinatorial methods for phylogenetic network reconstructionPhilippe Gambette
 
Méthodes combinatoires de reconstruction de réseaux phylogénétiques
Méthodes combinatoires de reconstruction de réseaux phylogénétiquesMéthodes combinatoires de reconstruction de réseaux phylogénétiques
Méthodes combinatoires de reconstruction de réseaux phylogénétiquesPhilippe Gambette
 
Quadruplets et réseaux non enracinés de niveau k
Quadruplets et réseaux non enracinés de niveau kQuadruplets et réseaux non enracinés de niveau k
Quadruplets et réseaux non enracinés de niveau kPhilippe Gambette
 
Utilisation de la visualisation en nuage arboré pour l'analyse littéraire
Utilisation de la visualisation en nuage arboré pour l'analyse littéraireUtilisation de la visualisation en nuage arboré pour l'analyse littéraire
Utilisation de la visualisation en nuage arboré pour l'analyse littérairePhilippe Gambette
 
Codage des voisinages et parcours en largeur en temps O(n) des graphes d'inte...
Codage des voisinages et parcours en largeur en temps O(n) des graphes d'inte...Codage des voisinages et parcours en largeur en temps O(n) des graphes d'inte...
Codage des voisinages et parcours en largeur en temps O(n) des graphes d'inte...Philippe Gambette
 
The Structure of Level-k Phylogenetic Networks
The Structure of Level-k Phylogenetic NetworksThe Structure of Level-k Phylogenetic Networks
The Structure of Level-k Phylogenetic NetworksPhilippe Gambette
 
Estimation du nombre de citations de papillotes et de blagues Carambar
Estimation du nombre de citations de papillotes et de blagues CarambarEstimation du nombre de citations de papillotes et de blagues Carambar
Estimation du nombre de citations de papillotes et de blagues CarambarPhilippe Gambette
 
On restrictions of balanced 2-interval graphs
On restrictions of balanced 2-interval graphsOn restrictions of balanced 2-interval graphs
On restrictions of balanced 2-interval graphsPhilippe Gambette
 
Reconstruction de reseaux phylogenetiques a structure arboree depuis un ensem...
Reconstruction de reseaux phylogenetiques a structure arboree depuis un ensem...Reconstruction de reseaux phylogenetiques a structure arboree depuis un ensem...
Reconstruction de reseaux phylogenetiques a structure arboree depuis un ensem...Philippe Gambette
 

Plus de Philippe Gambette (9)

Practical use of combinatorial methods for phylogenetic network reconstruction
Practical use of combinatorial methods for phylogenetic network reconstructionPractical use of combinatorial methods for phylogenetic network reconstruction
Practical use of combinatorial methods for phylogenetic network reconstruction
 
Méthodes combinatoires de reconstruction de réseaux phylogénétiques
Méthodes combinatoires de reconstruction de réseaux phylogénétiquesMéthodes combinatoires de reconstruction de réseaux phylogénétiques
Méthodes combinatoires de reconstruction de réseaux phylogénétiques
 
Quadruplets et réseaux non enracinés de niveau k
Quadruplets et réseaux non enracinés de niveau kQuadruplets et réseaux non enracinés de niveau k
Quadruplets et réseaux non enracinés de niveau k
 
Utilisation de la visualisation en nuage arboré pour l'analyse littéraire
Utilisation de la visualisation en nuage arboré pour l'analyse littéraireUtilisation de la visualisation en nuage arboré pour l'analyse littéraire
Utilisation de la visualisation en nuage arboré pour l'analyse littéraire
 
Codage des voisinages et parcours en largeur en temps O(n) des graphes d'inte...
Codage des voisinages et parcours en largeur en temps O(n) des graphes d'inte...Codage des voisinages et parcours en largeur en temps O(n) des graphes d'inte...
Codage des voisinages et parcours en largeur en temps O(n) des graphes d'inte...
 
The Structure of Level-k Phylogenetic Networks
The Structure of Level-k Phylogenetic NetworksThe Structure of Level-k Phylogenetic Networks
The Structure of Level-k Phylogenetic Networks
 
Estimation du nombre de citations de papillotes et de blagues Carambar
Estimation du nombre de citations de papillotes et de blagues CarambarEstimation du nombre de citations de papillotes et de blagues Carambar
Estimation du nombre de citations de papillotes et de blagues Carambar
 
On restrictions of balanced 2-interval graphs
On restrictions of balanced 2-interval graphsOn restrictions of balanced 2-interval graphs
On restrictions of balanced 2-interval graphs
 
Reconstruction de reseaux phylogenetiques a structure arboree depuis un ensem...
Reconstruction de reseaux phylogenetiques a structure arboree depuis un ensem...Reconstruction de reseaux phylogenetiques a structure arboree depuis un ensem...
Reconstruction de reseaux phylogenetiques a structure arboree depuis un ensem...
 

Dernier

16042024_First India Newspaper Jaipur.pdf
16042024_First India Newspaper Jaipur.pdf16042024_First India Newspaper Jaipur.pdf
16042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
Emerging issues in migration policies.ppt
Emerging issues in migration policies.pptEmerging issues in migration policies.ppt
Emerging issues in migration policies.pptNandinituteja1
 
Political-Ideologies-and-The-Movements.pptx
Political-Ideologies-and-The-Movements.pptxPolitical-Ideologies-and-The-Movements.pptx
Political-Ideologies-and-The-Movements.pptxSasikiranMarri
 
Foreign Relation of Pakistan with Neighboring Countries.pptx
Foreign Relation of Pakistan with Neighboring Countries.pptxForeign Relation of Pakistan with Neighboring Countries.pptx
Foreign Relation of Pakistan with Neighboring Countries.pptxunark75
 
14042024_First India Newspaper Jaipur.pdf
14042024_First India Newspaper Jaipur.pdf14042024_First India Newspaper Jaipur.pdf
14042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
12042024_First India Newspaper Jaipur.pdf
12042024_First India Newspaper Jaipur.pdf12042024_First India Newspaper Jaipur.pdf
12042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
13042024_First India Newspaper Jaipur.pdf
13042024_First India Newspaper Jaipur.pdf13042024_First India Newspaper Jaipur.pdf
13042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
15042024_First India Newspaper Jaipur.pdf
15042024_First India Newspaper Jaipur.pdf15042024_First India Newspaper Jaipur.pdf
15042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
11042024_First India Newspaper Jaipur.pdf
11042024_First India Newspaper Jaipur.pdf11042024_First India Newspaper Jaipur.pdf
11042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...
Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...
Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...The Lifesciences Magazine
 
Geostrategic significance of South Asian countries.ppt
Geostrategic significance of South Asian countries.pptGeostrategic significance of South Asian countries.ppt
Geostrategic significance of South Asian countries.pptUsmanKaran
 
Power in International Relations (Pol 5)
Power in International Relations (Pol 5)Power in International Relations (Pol 5)
Power in International Relations (Pol 5)ssuser583c35
 
Transforming Andhra Pradesh: TDP's Legacy in Road Connectivity
Transforming Andhra Pradesh: TDP's Legacy in Road ConnectivityTransforming Andhra Pradesh: TDP's Legacy in Road Connectivity
Transforming Andhra Pradesh: TDP's Legacy in Road Connectivitynarsireddynannuri1
 
lok sabha Elections in india- 2024 .pptx
lok sabha Elections in india- 2024 .pptxlok sabha Elections in india- 2024 .pptx
lok sabha Elections in india- 2024 .pptxdigiyvbmrkt
 

Dernier (14)

16042024_First India Newspaper Jaipur.pdf
16042024_First India Newspaper Jaipur.pdf16042024_First India Newspaper Jaipur.pdf
16042024_First India Newspaper Jaipur.pdf
 
Emerging issues in migration policies.ppt
Emerging issues in migration policies.pptEmerging issues in migration policies.ppt
Emerging issues in migration policies.ppt
 
Political-Ideologies-and-The-Movements.pptx
Political-Ideologies-and-The-Movements.pptxPolitical-Ideologies-and-The-Movements.pptx
Political-Ideologies-and-The-Movements.pptx
 
Foreign Relation of Pakistan with Neighboring Countries.pptx
Foreign Relation of Pakistan with Neighboring Countries.pptxForeign Relation of Pakistan with Neighboring Countries.pptx
Foreign Relation of Pakistan with Neighboring Countries.pptx
 
14042024_First India Newspaper Jaipur.pdf
14042024_First India Newspaper Jaipur.pdf14042024_First India Newspaper Jaipur.pdf
14042024_First India Newspaper Jaipur.pdf
 
12042024_First India Newspaper Jaipur.pdf
12042024_First India Newspaper Jaipur.pdf12042024_First India Newspaper Jaipur.pdf
12042024_First India Newspaper Jaipur.pdf
 
13042024_First India Newspaper Jaipur.pdf
13042024_First India Newspaper Jaipur.pdf13042024_First India Newspaper Jaipur.pdf
13042024_First India Newspaper Jaipur.pdf
 
15042024_First India Newspaper Jaipur.pdf
15042024_First India Newspaper Jaipur.pdf15042024_First India Newspaper Jaipur.pdf
15042024_First India Newspaper Jaipur.pdf
 
11042024_First India Newspaper Jaipur.pdf
11042024_First India Newspaper Jaipur.pdf11042024_First India Newspaper Jaipur.pdf
11042024_First India Newspaper Jaipur.pdf
 
Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...
Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...
Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...
 
Geostrategic significance of South Asian countries.ppt
Geostrategic significance of South Asian countries.pptGeostrategic significance of South Asian countries.ppt
Geostrategic significance of South Asian countries.ppt
 
Power in International Relations (Pol 5)
Power in International Relations (Pol 5)Power in International Relations (Pol 5)
Power in International Relations (Pol 5)
 
Transforming Andhra Pradesh: TDP's Legacy in Road Connectivity
Transforming Andhra Pradesh: TDP's Legacy in Road ConnectivityTransforming Andhra Pradesh: TDP's Legacy in Road Connectivity
Transforming Andhra Pradesh: TDP's Legacy in Road Connectivity
 
lok sabha Elections in india- 2024 .pptx
lok sabha Elections in india- 2024 .pptxlok sabha Elections in india- 2024 .pptx
lok sabha Elections in india- 2024 .pptx
 

Visualising a text with a tree cloud

  • 1. IFCS 2009 Dresden – 17/03/2009 Visualising a text with a tree cloud Philippe Gambette, Jean Véronis
  • 2. Outline • Tag and word clouds • Enhanced tag clouds • Tree clouds • Construction steps • Quality control
  • 3. Tag clouds • Built from a set of tags • Font size related to frequency What is considered the first tag cloud, from D. Coupland: Microserfs, HarperCollins, Toronto, 1995
  • 4. Tag clouds • Built from a set of tags • Font size related to frequency • Gained popularity with Flickr Flickr's all time most popular tags
  • 5. Word clouds • Built from a set of words from a text • Font size related to frequency • Gained popularity with Wordle
  • 6. Word clouds • Built from a set of words from a text • Font size related to frequency • Gained popularity with Wordle GoogleImage(obama inaugural address wordle)
  • 7. Word clouds • Built from a set of words from a text • Font size related to frequency • Gained popularity with Wordle GoogleImage(obama inaugural address wordle)
  • 8. Word clouds • Built from a set of words from a text • Font size related to frequency • Gained popularity with Wordle GoogleImage(obama inaugural address wordle)
  • 9. Enhanced tag / word clouds Add information from the text: • color intensity to express recency in Amazon • shared tags in red on del.icio.us • group together cooccurring tags on the same line Hassan-Montero & Herrero-Solana, InScit'06 • optimize blank space and semantic proximity Kaser & Lemire, WWW'07 • “topigraphy”: 2D placement according to cooccurrence Fujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08
  • 10. Enhanced tag / word clouds Add information from the text: • color intensity to express recency in Amazon • shared tags in red on del.icio.us • group together cooccurring tags on the same line Hassan-Montero & Herrero-Solana, InScit'06 • optimize blank space and semantic proximity Kaser & Lemire, WWW'07 • “topigraphy”: 2D placement according to cooccurrence Fujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08
  • 11. Enhanced tag / word clouds Add information from the text: • color intensity to express recency in Amazon • shared tags in red on del.icio.us • group together cooccurring tags on the same line Hassan-Montero & Herrero-Solana, InScit'06 • optimize blank space and semantic proximity Kaser & Lemire, WWW'07 • “topigraphy”: 2D placement according to cooccurrence Fujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08
  • 12. Enhanced tag / word clouds Add information from the text: • color intensity to express recency in Amazon • shared tags in red on del.icio.us • group together cooccurring tags on the same line Hassan-Montero & Herrero-Solana, InScit'06 • optimize blank space and semantic proximity Kaser & Lemire, WWW'07 • “topigraphy”: 2D placement according to cooccurrence Fujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08
  • 13. Enhanced tag / word clouds Add information from the text: • color intensity to express recency in Amazon • shared tags in red on del.icio.us • group together cooccurring tags on the same line Hassan-Montero & Herrero-Solana, InScit'06 • optimize blank space and semantic proximity Kaser & Lemire, WWW'07 • “topigraphy”: 2D placement according to cooccurrence Fujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08
  • 14. Extract semantic information from a text • literature analysis: philological approach: only consider the text Brody • discourse analysis: tree analysis or cooccurrence graph, geodesic projection Brunet (Hyperbase), Viprey (Astartex) • text mining: semantic graph Grimmer (Wordmapper) • natural language processing: sense desamibiguation Véronis (Hyperlex)
  • 15. Extract semantic information from a text • literature analysis: philological approach: only consider the text Brody • discourse analysis: tree analysis or cooccurrence graph, geodesic projection Brunet (Hyperbase), Viprey (Astartex) • text mining: semantic graph Grimmer (Wordmapper) • natural language processing: sense desamibiguation Véronis (Hyperlex) Mayaffre, Quand travail, famille, et patrie cooccurrent dans le discours de Nicolas Sarkozy, JADT'08
  • 16. Extract semantic information from a text • literature analysis: philological approach: only consider the text Brody • discourse analysis: tree analysis or cooccurrence graph, geodesic projection Brunet (Hyperbase), Viprey (Astartex) • text mining: semantic graph Grimmer (Wordmapper) • natural language processing: sense desamibiguation Véronis (Hyperlex) Brunet, Les séquences (suite), JADT'08
  • 17. Extract semantic information from a text • literature analysis: philological approach: only consider the text Brody • discourse analysis: tree analysis or cooccurrence graph, geodesic projection Brunet (Hyperbase), Viprey (Astartex) • text mining: semantic graph Grimmer (Wordmapper) • natural language processing: word sense disamibiguation Véronis (Hyperlex) Barry, Viprey, Approche comparative des résultats d'exploration textuelle des discours de deux leaders africains Keita et Touré, JADT'08
  • 18. Extract semantic information from a text • literature analysis: philological approach: only consider the text Brody • discourse analysis: tree analysis or cooccurrence graph, geodesic projection Brunet (Hyperbase), Viprey (Astartex) • text mining: semantic graph Grimmer (Wordmapper) • natural language processing: word sense disamibiguation Véronis (Hyperlex) Peyrat-Guillard, Analyse du discours syndical sur l’entreprise, JADT'08
  • 19. Extract semantic information from a text • literature analysis: philological approach: only consider the text Brody • discourse analysis: tree analysis or cooccurrence graph, geodesic projection Brunet (Hyperbase), Viprey (Astartex) • text mining: semantic graph Grimmer (Wordmapper) • natural language processing: word sense disambiguation Véronis (Hyperlex) Disambiguation of word “barrage”: dam, play-off, roadblock, police cordon. Véronis, HyperLex: Lexical Cartography for Information Retrieval, 2004
  • 20. Tag cloud + tree = tree cloud SplitsTree: Huson 1998, Huson & Bryant 2006 Built with TreeCloud and GPL-licensed Treecloud in Python, available at http://www.treecloud.org
  • 21. The first tree cloud Tree cloud of the blog posts containing “Laurence Ferrari” from 25/11/2007 to 10/12/2007, by Jean Véronis http://aixtal.blogspot.com/2007/12/actu-une-ferrari-dans-un-arbre.html
  • 22. Building a tree cloud – extracting the words Extract words with frequency: • stoplist? without stoplist with stoplist
  • 23. Building a tree cloud – extracting the words Extract words with frequency: • lemmatization? • groups words with similar meaning? no lemmatization... sometimes interesting 70 most frequent words in Obama's campaign speeches, winsize=30, distance=dice, NJ-tree.
  • 24. Building a tree cloud – dissimilarity matrix Many semantic distance formulas based on cooccurrence
  • 25. Building a tree cloud – dissimilarity matrix Many semantic distance formulas based on cooccurrence Text sliding window S sliding step s width w cooccurrence matrices semantic dissimilarity O11, O12, O21, O22 matrix chi squared, mutual information, liddel, dice, jaccard, gmean, hyperlex, minimum sensitivity, odds ratio, zscore, log likelihood, poisson-stirling... Evert, Statistics of words cooccurrences, PhD Thesis, 2005
  • 26. Building a tree cloud – dissimilarity matrix Transformations needed on the dissimilarity: • transform similarity into dissimilarity • linear normalization for positive matrices to get distances in [0,1] • affine normalization for matrices with positive or negative numbers, to get distances in [α,1] (for example α=0.1)
  • 27. Building a tree cloud – tree reconstruction Many existing methods: • Neighbor-Joining Saitou & Nei, 1987 • Addtree variants Barthelemy & Luong, 1987 • Quartet heuristic Cilibrasi & Vitanyi, 2007
  • 28. Building a tree cloud – tree decoration Choice of word sizes: • computed directly from frequency (apply a log!) or • computed from frequency ranking (exponential distribution) or • statistical significance with respect to a reference corpus
  • 29. Building a tree cloud – tree decoration Colors: chronology old recent 150 most frequent words in Built with Obama's campaign speeches, TreeCloud and winsize=30, distance=oddsratio, color=chronology, NJ-tree.
  • 30. Building a tree cloud – tree decoration Colors: dispersion sparse dense standard deviation of the position too blue? 150 most frequent words in Obama's campaign speeches, winsize=30, distance=oddsratio, color=dispersion, NJ-tree.
  • 31. Building a tree cloud – tree decoration Colors: dispersion sparse dense standard deviation of the position word frequency too red? 150 most frequent words in Obama's campaign speeches, winsize=30, distance=oddsratio, color=norm-dispersion, NJ-tree.
  • 32. Building a tree cloud – tree decoration Edge color or thickness: quality of the induced cluster. 150 most frequent words in Obama's campaign speeches, winsize=30, distance=oddsratio, color=chronology, NJ-tree.
  • 33. Quality control Is there an objective quality measure of tree clouds?
  • 34. Quality control Is there an objective quality measure of tree clouds? What is the best method to build a tree cloud from my data?
  • 35. Quality control Is there an objective quality measure of tree clouds? What is the best method to build a tree cloud from my data? Tree cloud variations if small changes? bootstrap to evaluate: - stability of the result - robustness of the method
  • 36. Quality control Is there an objective quality measure of tree clouds? What is the best method to build a tree cloud from my data? Tree cloud variations if small changes? bootstrap to evaluate: - stability of the result - robustness of the method Still, is there a more direct method? arboricity to show whether the distance matrix fits with a tree, which should imply stability? Guénoche & Garreta, 2001 Guénoche & Darlu, 2009
  • 37. Quality control – bootstrap • Randomly delete words with probability 50%. • Built tree cloud of original text, and altered text. • Compute similarity of both trees (1-normalized RobinsonFoulds) similarity 0,95 0,9 0,85 0,8 0,75 0,7 0,65 0,6 0,55 0,5 4 altered versions of 10 mi dice gmean ms zscore poissonstirling Obama's speeches, chisquared liddell jaccard hyperlex oddsratio loglikelihood 3000 words in average, width=30, NJ-tree.
  • 38. Quality control – arboricity Relationship between “bootstrap quality” and arboricity: 80 arboricity (%) 70 60 50 40 30 20 10 0 0,52 0,54 0,56 0,58 0,6 0,62 0,64 0,66 0,68 0,7 bootstrap quality
  • 39. Quality control – arboricity Relationship between “bootstrap quality” and arboricity: 80 arboricity (%) 70 correlation 60 coefficient: 0.64 50 arboricity 40 below 50%: 30 danger! 20 10 0 0,52 0,54 0,56 0,58 0,6 0,62 0,64 0,66 0,68 0,7 bootstrap quality
  • 40. Perspectives • Make the tool available on a web interface http://www.treecloud.org • Evaluate tree clouds for discourse analysis • Build the daily tree cloud of people popular on blogs, with http://labs.wikio.net
  • 41. Thank you for your attention! Tree cloud of the words appearing twice or more in the IFCS 2009 call for paper lemmatization, width=20, distance=dice, NJ-tree. http://www.treecloud.org - http://www.splitstree.org
  • 42. Tree clouds focused on one word Tree cloud of the neighborhood of “McCain” in Obama's campaign speeches http://www.treecloud.org - http://www.splitstree.org
  • 43. Tree clouds focused on one word Tree cloud of the neighborhood of “Bush” in Obama's campaign speeches http://www.treecloud.org - http://www.splitstree.org
  • 44. Tree clouds focused on one word Tree cloud of the neighborhood of “world” in Obama's campaign speeches http://www.treecloud.org - http://www.splitstree.org