ESWC 2016 talk about how to compute types (ontology classes) for literals and add semantics to them, making them richer. Then utilize them in an entity summarization usecase.
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Gleaning Types for Literals in RDF with Application to Entity Summarization
1. Gleaning Types for Literals
in RDF Triples with Application to
Entity Summarization
1 Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis),
Wright State University, USA
2 National Key Laboratory for Novel Software Technology, Nanjing University, China
13th Extended Semantic Web Conference (ESWC ) 2016
Greece, 05.31.2016
Kalpa Gunaratna 1 Krishnaprasad Thirunarayan 1 Amit Sheth 1 Gong Cheng 2
2. o Literals and background of Entity Summarization
o Typing literals in knowledge graphs
o Entity Summarization (FACES-E)
o Evaluation
– Typing
– Entity Summarization with datatype properties
o Conclusion and Future Work
2
Talk Overview
3. o Considerable amount of information captured in datatype
properties.
– 1600 datatype properties vs. 1079 object properties in DBpedia
o Many literals can be “easily typed” for proper interpretation
and use.
– Example: in DBpedia, http://dbpedia.org/property/location has
~1,00,000 unique simple literals that can be directly mapped to
entities.
o Added semantics can be used in practical and useful
applications like (i) entity summarization, (ii) property
alignment, (iii) data integration, and (iv) dataset profiling.
3
Motivating Facts – Literals and Semantics
4. o Datasets and knowledge graphs on the web continue to grow
in number and size.
– DBpedia (3.9) has around 200 triples on average per entity.
o All the facts of an entity are difficult to process when browsing.
o Better presentation is required. Good quality summaries can
help!
4
Lets Focus on Entity Summarization now …..
5. 5
Importance of Entities and Summaries
Google has its own knowledge graph called
Google Knowledge Graph (GKG) to facilitate
search.
Google made summarization their second
priority in building GKG*.
* Singhal, A. 2012. Introducing the knowledge graph: things, not strings. Official Google Blog, May.
6. o Introduced FACES (FACeted Entity Summaries) approach *.
o FACES follows two main steps.
o First, it groups “conceptually” similar features.
– Two groups will have different facts from each other.
o Second, it picks features (property-value pairs) from these
groups, improving diversity, for the summaries.
6
Diversity-Aware Entity Summaries (FACES approach) - Background
* Kalpa Gunaratna, Krishnaprasad Thirunarayan, and Amit Sheth. 'FACES: Diversity-Aware Entity Summarization using
Incremental Hierarchical Conceptual Clustering'. 29th AAAI Conference on Artificial Intelligence (AAAI 2015), AAAI, 2015.
7. 7
Faceted Entity Summary - Example
Marie Curie
Pierre Curie Warsaw Passy,_Haute-
Savoie
ESPCI_ParisTechUniversity_of_Paris
Radioactivity
Chemistry
Birth
Place
Field
Concise and comprehensive summary
could be: {f1,f2, f6}
Another summary could be: {f4, f6, f7}
8. o FACES utilizes type semantics of objects in grouping features.
o Literals in RDF triples do not have “semantic” types. They only
have primary data types (e.g., date, integer, string, etc).
o Can we try to add semantic types to literals? How?
8
Information coming from literals???
9. o FACES can only handle object property based features.
o Why? – any specific reason???
– Values of the features are not URIs and have no “semantic”
types.
– Hence, the adapted algorithm (Cobweb) for grouping features
can not get types for property object values.
– It can not create the partitions for faceted entity summaries.
o Our contributions are to:
– First compute types for the values of datatype property based
features (data enrichment).
– Then, adapt and improve ranking algorithms (summarization).
FACES-E system.
9
Typing Literals in RDF Triples for Entity Summarization
10. 10
Typing Datatype Property Values - Example
dbr:Barack_Obama dbo:Politician
dbo:Politician
dbp:vicePresident
dbr:Joe_Biden
rdf:type
dbr:Barack_Obama “44th President of the United States”^^xsd:string
dbp:shortDescription
dbr:Calvin_Coolidge “48th Governor of Massachusetts”^^xsd:string
dbp:orderInOffice
dbo:President
dbo:Governor
rdf:subClassOf
rdf:subClassOf
11. o Focus of the literal is not clear unlike URIs.
o May contain several entities or labels matching ontology
classes.
o The literal can be long.
– In this work, we focus on one sentence long literals.
– For a paragraph like text, finding a single focus is hard and needs
different techniques.
11
Why is it Hard?
44th President of the United States
option 1
option 2 option 3
12. o We expect the focus of the sentence or phrase leads to the
representative entity/type of the sentence.
o There are prominent works on identifying head word of a
sentence/phrase.
– Example: member of committee
o We use existing head word detection algorithms to identify the
focus term.
– Collins’ head word detection algorithm
12
Focus term identification
13. We filter out date and numeric values.
1. Exact matching of focus term to class labels.
– E.g., “48th Governor of Massachusetts” Governor (class)
2. Get the n-grams and see for a matching class using n-gram and
focus term overlap (maximal match).
I. Check for a matching class for an overlapping n-gram.
II. If a type not found, spot entities in the n-grams and get their types.
• “United States Senate” “United State Senate” 3-gram matches the
entity in DBpedia.
3. Semantic matching of focus term to class labels.
– We compare pairwise similarity of the focus term with all the class
labels and pick the highest (we utilize UMBC similarity service).
13
Deriving type (class) from head word
16. o Ranking mechanism for objects (in the FACES) do not work.
– Why? Two literals can be unique even if their types and the
main entities are the same.
• Example, “United States President” Vs. “President of the United
States” (counting is affected).
• Not desirable to search using the whole phrase.
– Hence, use entities.
– A literal can have several entities. Which one to choose?
16
Ranking Datatype Property Features
17. o We observe humans recognize popular entities.
– Entities can be in literals with variations.
o We use the popular entities in literals and not the literals
themselves for ranking.
o Functions
– Function ES(v) returns all entities present in the value v.
– Function max(ES(v)) returns the most popular entity in ES(v).
17
Idea for Ranking
v = “44th President of the United States”
ES(v) = {db:President, db:United States}
max(ES(v)) = db:United States
Remember: our goal and objective of ranking is disjoint with typing mechanism
18. 18
Modified Ranking Equations
If you really wanted to know …
informativeness is inversely proportional to the number of entities that are associated with
overlapping values containing the most popular entity of feature f.
Frequency of the most popular entity in v.
tf-idf based ranking score.
19. o Aggregate feature ranking scores for each facet.
o Rank facets based on the aggregated scores.
19
Facet Ranking
Rank(f) is the original function and Rank(f)’ is the modified one for datatype property based features.
20. 1. Extract features for the entity e.
2. Enrich each feature and get the WordSet WS(f).
3. Enriched feature set FS(e) is input to the partitioning algorithm and get facet set
F(e).
4. First get the feature ranking scores (R(f)) and then compute the facet ranking
scores for each facet (FacetRank(F(e)).
5. Top ranked features from top ranked facets in the order are picked to form the
faceted entity summary. The constraints defined in the definition for the faceted
entity summary hold.
20
FACES-E Entity Summary Generation
(1) (2) (3) (4) (5)
Enriching Literals Modified Ranking
21. Literal Types
United States Ambassador to the United Nations Agent, Ambassador, Person
Chairman of the Republican National Committee Agent, Politician, Person, President,
United States Navy Agent, Organisation, Military Unit
Member of the New York State Senate Agent, OrganisationMember, Person
Senate Minority Leader Agent, Politician, Person, President
United States Senate Agent, Organisation, Legislature
from Virginia Administrative Region, Place, Region,
Populated Place
Denison, Texas, U.S. Administrative Region, Place, Country,
Region,Populated Place
21
Type Computation Samples
with super types excluding owl:Thing
22. o Type Set TS(v) is the generated set of types for the value v.
22
Evaluation – Type Generation Metrics
n is the total number of features.
23. o DBpedia Spotlight is used as the baseline and had 1117 unique
property-value pairs (features).
o 118 pairs (consisting of labelling properties and noisy features)
were removed.
o Results convey that special care should be taken in deciding
types for literals.
23
Evaluation – Type Generation
Mean Precision (MP) Any Mean Precision (AMP) Coverage
Our approach 0.8290 0.8829 0.8529
Baseline 0.4867 0.5825 0.5533
24. 24
Evaluation – Summarization Metrics
Average pairwise agreement of the ideal summaries
Average summary overlap between system generated and ideal summaries.
25. o The gold standard consists of 20 random entities used in FACES
taken from DBpedia 3.9 and 60 random entities taken from
DBpedia 2015-04.
o 17 human users created ideal summaries (total of 900). Each
entity received at least 4 ideal summaries for each length.
25
Evaluation – FACES-E Summary Generation
System k = 5 k = 10
Avg. Quality % Increase Avg. Quality % Increase
FACES-E 1.5308 - 4.5320 -
RELIN 0.9611 59 % 3.0988 46 %
RELINM 1.0251 49 % 3.6514 24 %
Avg. Agreement 2.1168 5.4363
k is the summary length
26. o Consider meaning of the property name to compute types.
o Literals and properties are noisy.
– Identify those automatically to filter out.
– Filter out labelling properties (automatic identification). This is
hard.
o A formal model to capture the semantic types in RDF for
literals.
– Without changing their original representation (literals).
26
Future Work
29. o Entities are described by features.
o Feature: A property-value pair is called a feature.
o Feature Set: All the features that describe an entity.
o Entity Summary of size k: A subset of the feature set for an entity,
constrained by size k.
29
Preliminaries
Entity summaries for k=3:
{f1,f2,f5}, {f4, f6, f7}, {f3,f4,f5}, …
Entity – Marie Curie
Feature Set Features Property Value
FS
f1 spouse Pierre_Curie
f2 birthPlace Warsaw
f3 deathPlace Passy_Haute-Savoie
f4 almaMater ESPI_ParisTech
f5 workInstitutions University_of_Paris
f6 knownFor Radioactivity
f7 field Chemistry
30. Facets (partition)
Given an entity e, a set of facets F(e) of e is a partition of the feature set FS(e). That is, F(e) =
{C1, C2, ..Cn} such that F(e) satisfies:
(i) Non-empty: ∅ ∉ F(e).
(ii) Collectively exhaustive: C1 ∪ C2 ∪…Cn = FS(e).
(iii) Mutually (pairwise) disjoint: Ci ≠ Cj then Ci ∩ Cj = ∅.
Faceted entity summary
Given an entity e and a positive integer k < |FS(e)|, faceted entity summary of e of size k,
FSumm(e,k), is a collection of features such that FSumm(e,k) ⊂ FS(e), |FSumm(e,k)| = k.
Further, either (i) k > |F(e)| and ∀X ∈ F(e), X ∩ FSumm(e,k) ≠ ∅ or (ii) k ≤ |F(e)| and ∀X ∈ F(e),
| X ∩ FSumm(e,k)| ≤ 1 holds, where F(e) is a set of facets of FS(e).
30
Faceted Entity Summary
Faceted entity summary,
k=2: {f1, f6}
k=3: {f1, f2, f6}
Notes de l'éditeur
Property facts taken from DBpedia 3.9 statistics (I believe)
Easily types literal example: “California”, “Greece” etc., related to location property
We talk about entity summarization as a usecase from here onwards together with Typing.
DBpedia 2.0 had 1.95 million things and DBpedia 2015-04 has 5.9 million things.
LOD had 295 datasets in 2011 and in 2014 it had 1014 datasets.
Entity – A real world thing (e.g., person, book, place) at the data level that encapsulates facts and is represented by a URI.
Entity summary is a subset of facts that represent the entity.
Explain conceptual idea -> we want facts talking about one “theme” to be grouped together.
Conceptually similar features are colored in the same color.
Refer to previous slide in grouping using types. Explain with example: almaMater and workInstitute properties are both talking about places.
Possible reasons for creating most of the literals instead of URI resources:
(i) the creator was unable to find a suitable entity URI for the object value, and hence chose to use a literal instead,
(ii) the creator of the triple did not want to attach more details to the value and hence represented it in plain text,
(iii) the value contains only basic implementation types like integer, boolean, and date, and hence not meaningful to create an entity, or
(iv) the value has a lengthy description spanning several sentences (e.g., dbo:abstract property in DBpedia) that covers a diverse set of entities and facts.
Option 1 and 2 seems to be the right pick (one of them).
Another example,
48th Governor of Massachusetts person and populated place
Another one
United States Ambassador to the United Nations
We used Dbpedia 2015-04 dataset at the time of processing
Numeric and Date, can not do more than these types.
( Governor Matches class Governor (entity Governor of owl:Thing) )
Case 2:
Senate matches to an entity of class Thing, so it didn’t get a type from step 1.
United States Senate matches to Legislature
Harvard Law School focus term is school.
3-gram leads to Hardvard_Law_School entitity Educational Institute
Head word detection – Colin’s Head Word Detection algorithm.
Directly matches head word to class
Matches N-grams and head word to class label or else, match entities to N-grams and head word and then get the types.
Semantic matcher of head word using UMBC matching service.
Inf(f)’ – count # of entities having the feature. Property should match but value has to contain the popular entity of the input feature’s value.
Po(v)’ – count the number of triples that have the matching feature with most popular entity of the value.
F(e) is the facet set
Colored parts are the new additions/modifications
We got super classes other than Thing in this evaluation for both baseline and our approach.
Recall is not measured because it is hard to do so (check so many pairs).
For DBpedia 2015-04 version
Summ(e) is the system generated summary
SummI(e) is the ideal summary
For this evaluation we used both DBpedia 3.9 and DBpedia 2015-04 versions.
Labelling properties should not be typed, they are probably there to just referencing as a label. For example, “name” property value typing as a Person looks odd.