Knowledge Graph Construction and the Role of DBPedia
1. KNOWLEDGE GRAPHS
AND THE ROLE OF DBPEDIA
Paul Groth @pgroth
pgroth.com
Thanks to Joao Moura
Elsevier Labs @elsevierlabs
6th DBpedia Community Meeting in The Hague 2016
Feb. 12, 2016
4. ELSEVIER LABS - INTRO
WORLD LEADER IN DIGITAL INFO SOLUTIONS
4
Published over
330,000 articles
in 2013
Founded over
130 years ago
Work with over
30 million
Scientists, students, health
& information professionals
Employ over
7,000 employees
in 24 countries
Received over
1 million submissions
in 2013
Over the last
50 years
the majority of Noble
Laureates have published
with Elsevier
Over 53 million
items indexed by
Scopus
Elsevier eBooks, Online
Journals, Databases
Publishes over
2,200 online
journals & over
10,000 e-books
SOLUTIONS
Elsevier
R+D Solutions
Elsevier
Clinical Solutions
Helps corporate
researchers, R+D
professionals, and
engineers improve how
they interact with, share,
and apply information to
solve problems using
our digital workflow
tools, analytics, and data
Provides universities,
governments, and
research institutions with
the resources and
insights to improve
institutional research
strategy, management,
and performance.
Elsevier
Education
Helps medical
professionals apply
trusted data and
sophisticated tools to
make better clinical
decisions, deliver better
care, and produce
better healthcare
outcomes.
Helps educate
highly-skilled,
effective healthcare
professionals, using
the most advanced
pedagogical tools
and reference
works.
Elsevier
Research Intelligence
CONTENT
CAPABILITIESPLATFORMS
9. BUILDING BETTER TAXONOMIES
• Ontologies and taxonomies help organize and query content
• Annotation
• Classification / Navigation
• Autocomplete
• Suggestion & Recommendation
• We have lots of taxonomies/ontologies
• Journal Classification for Scopus
• Mendeley classification system
• Science Direct Subject classification
• Reference Modules Hierarchies for Books
• Submission system Journal classifications
• …
• Connect to external ontologies (e.g. MESH)
• Ontology Maintenance, Usage and Mapping
11. TAXONOMY INDUCTION
Starting with a very shallow hierarchy of syntactical concepts with almost no intersections:
1. Matching concepts against a target (well accepted) taxonomy and dbpedia:
• Problems: Same concept may have different names or terminologies in different
branches; Multiple languages etc.
2. Check for partial orders between these concepts, using the hierarchy of the target
taxonomy and dbpedia (skos:broader).
3. Finding/completing missing links between concepts.
12. Example Given two concepts, check if they form a parent-child relation:
select distinct * where{
<http://dbpedia.org/resource/Model-checking>
dbo:wikiPageRedirect* ?conceptChild.
?conceptChild dbo:wikiPageRedirects* ?redirectedChild.
?redirectedChild dct:subject ?subjectChild.
<http://dbpedia.org/resource/Formal_methods>
dbo:wikiPageRedirect* ?conceptParent.
?conceptParent dbo:wikiPageRedirects* ?redirectedParent.
?redirectedParent dct:subject ?subjectParent.
?subjectChild skos:broader ?subjectChildsParent
Filter(?subjectChildsParent = ?subjectParent)
}
15. ANNOTATION
• http://www.slideshare.net/SparkSummit/dictionary-based-annotation-at-scale-with-spark-by-sujit-pal
• What is the problem?
• Annotate millions of documents from different corpora.
• 14M docs from Science Direct alone.
• More from other corpora, dependency parsing, etc.
• Critical step for Machine Reading and Knowledge Graph applications.
• Why is this such a big deal?
• Takes advantage of existing linked data.
• No model training for multiple complex STM domains.
• However, simple until done at scale.
17. DICTIONARY BASED NE ANNOTATOR (SODA)
DICTIONARY BASED NE ANNOTATOR (SODA)
• Part of Document Annotation Pipeline.
• Annotates text with Named Entities from external Dictionaries.
• Why do we have to scale (Wikipedia KBs) – 8 Million entities
• Built with Open Source Components
• Apache Solr – Highly reliable, scalable and fault-tolerant search index.
• SolrTextTagger – Solr component for text tagging, uses Lucene FST technology.
• Apache OpenNLP – Machine Learning based toolkit for processing Natural Language Text.
• Apache Spark – Lightning fast, large scale data processing.
• Uses ideas from other Open Source libraries
• FuzzyWuzzy – Fuzzy String Matching like a boss.
• Contributed back to Open Source
• https://github.com/elsevierlabs-os/soda
21. PREDICTED RELATIONS: GLAUCOMA
• At threshold = 0.08
• 22 unseen relations
• F1 = 0.71
• Applications beyond
knowledge graph construction
• Taxonomy and ontology
maintenance
• Entity search in task-
specific and/or mobile
context
• Question answering
glaucoma developed many years after chronic inflammation of uveal tract
glaucoma develop following chronic inflammation of uveal tract
glaucoma can appear soon in family history of glaucoma
glaucoma can appear soon in age over 40
glaucoma the risk of functional visual field loss
glaucoma contributing causes of functional visual field loss
glaucoma contributed to functional visual field loss
glaucoma is considered the second leading cause of functional visual field loss
glaucoma remains the second leading cause of functional visual field loss
This is a
unique
entity not a
string
22. A DBPEDIA IDEA?
• Connect to the Scholarly Ecosystem
• Crossref & Data Cite DOIs + ORCIDS
23. CONCLUSION
• DBPedia and Wikipedia KBs are great reference sources
• Beyond expected use for…
• Internal knowledge curation
• Stress testing
• We’re hiring
“Mendeley Suggest” is our personalised article recommender. It is based on what users have in their libraries, and recommends other related articles. Uses taxonomies