1. Prof. Dr. Sören Auer
Knowledge Graphs Winter School
February 23rd, 2021
Introduction to
Knowledge Graphs
2. Page 2
About me - Prof. Dr. Sören Auer
Now: Professor for Data Science and Digital Libraries, Leibniz University of Hannover
Director TIB Leibniz Information Center for Science & Technology
• TIB is with >500 employees the largest science and technology information centre world-wide
• Strategy: organizing research data and information using knowledge graphs
• Member of the board of L3S research center – a world-leading responsible AI
Previously: U Bonn, Fraunhofer, U Leipzig, U Pennsylvania, Ural State Uni Ekaterinburg, TU Dresden
Publications in major venues: Web Conf., IJCAI, AAAI, ISWC, ESWC, K-CAP, TPDL, JWS, SWJ, JDIQ
H-index: 57, >21.000 citations, >15 best paper awards incl. test-of-time and 10-year awards
Major scientific contributions:
• Technology platforms: OntoWiki & DBpedia, LOD2 Linked Data and BigDataEurope software stacks
• Acquisition of >20M€ for my research groups in Leipzig, Bonn and Hannover
• Strategic projects: ERC ScienceGraph, LOD2, BigDataEurope, Marie Curie ITN WDAqua
Impact & Transfer: W3C standards, 5 students now professors, successful spin-off company,
portfolio of open-source software, Int. Data Spaces Initiative, Big Data Value Association
3. --- VERTRAULICH ---
Zuse Z3: the
beginning of
Computing –
close to the
hardware
Foto: Konrad Zuse
Internet
Archiv/Deutsches
Museum/DFG
5. --- VERTRAULICH ---
We can make things
more intuitive
Picture: The illustrated recipes
of lucy eldridge
http://thefoxisblack.com/2013/
07/18/the-illustrated-recipes-
of-lucy-eldridge/
11. Page 11
Machine Learning and Big Data
http://www.spacemachine.net/views/2016/3/datasets-over-algorithms
AI is not just the next hype after Big Data, Big Data is the
reason why we have AI!
13. Page 13
Tackling the Variety Dimension
with the FAIR and Linked Data Principles
1. Use URIs to identify the “things” in your data
2. Use http:// URIs so people (and machines) can look
them up on the web
3. When a URI is looked up, return a
description of the thing in the W3C
Resource Description Format (RDF)
4. Include links to related things
http://www.w3.org/DesignIssues/LinkedData.html
14. Page 14
1. Graph based RDF data model consisting of S-P-O statements (facts)
RDF & Linked Data in a Nutshell
WinterSchool dbpedia:
Paderborn
23.02.2021
KnowGraphs
conf:organizes
conf:starts
conf:takesPlaceIn
2. Serialised as RDF Triples:
KnowGraphs conf:organizes WinterSchool .
WinterSchool conf:starts “2021-02-23”^^xsd:date .
WinterSchool conf:takesPlaceAt dbpedia:Paderborn .
3. Publication under URL in Web, Intranet, Extranet
Subject Predicate Object
15. Page 15
Creating Knowledge Graphs with RDF
Linked Data
located in
label
industry
headquarters
full name
DHL
Post Tower
162.5 m
Bonn
Logistics Logistik
DHL International GmbH
height
物流
label
16. Page 16
Graph consists of:
Resources (identified via URIs)
Literals: data values with data type (URI) or language (multilinguality integrated)
Attributes of resources are also URI-identified (from vocabularies)
Various data sources and vocabularies can be arbitrarily mixed and meshed
URIs can be shortened with namespace prefixes; e.g. dbp: → http://dbpedia.org/resource/
RDF Data Model (a bit more technical)
gn:locatedIn
rdfs:label
dbo:industry
ex:headquarters
foaf:name
dbp:DHL_International_GmbH
dbp:Post_Tower
"162.5"^^xsd:decimal
dbp:Bonn
dbp:Logistics
"Logistik"@de
"DHL International GmbH"^^xsd:string
ex:height
"物流"@zh
rdfs:label
rdf:value
unit:Meter
ex:unit
17. Page 17
Knowledge Graph Example: DBpedia
• Automatically extracted
from Wikipedia infoboxes
• Crystalization point
of the LOD Cloud
https://lod-cloud.net/
18. Vocabularies – Breaking the mold!
• Semantic data virtualization allows for continuous expansion and enhancement of data and
metadata across data sources without loosing the overall perspective
Relational
data models
1:1 Relation between
Data Model und Application
Graph based
data model
Subject
Predicate
Object / Subject
Predicate
Object / Subject
1:n Relation between
Data Model and Application
19. RDF mediates between different Data Models &
bridges between Conceptual and Operational Layers
Id Title Screen
5624 SmartTV 104cm
5627 Tablet 21cm
Prod:5624 rdf:type Electronics
Prod:5624 rdfs:label “SmartTV”
Prod:5624 hasScreenSize “104”^^unit:cm
...
Electronics
Vehicle
Car Bus Truck
Vehicle rdf:type owl:Thing
Car rdfs:subClassOf Vehicle
Bus rdfs:subClassOf Vehicle
...
Tabular/Relational Data
Taxonomic/Tree Data
Logical Axioms / Schema
Male rdfs:subClassOf Human
Female rdfs:subClassOf Human
Male owl:disjointWith Female
...
Sören Auer 19
20. Seite 20
Example: Mapping of Research Data to Ontologies
Krankheit Symptom Prävalenz
Grippe Fieber 1000
Krebs Blutung 30
... ... ...
Disease ICD-10 Symptoms Medication
Influenza J10 Fever Amantadin
Cancer C00-C97 Bleeding Chemotherapy
... ... ... ...
Symptom
Disease
ICD-10
Code
Prevalence
ICD-10
Code
Type
Drug
Name Classification
Concepts
Attributes
hasSymptom
... ... ...
hasTreatment
Vocabulary Layer
Data
Layer
Relations
Mappings
22. Page 22
• collaborative, community activity
to create, maintain, and promote
schemas for structured data on
the Internet
• can be used with many different
encodings, including RDFa,
Microdata and JSON-LD
• covers entities, relationships
between entities and actions
• can easily be extended through a
well-documented extension model
• >10 million sites use Schema.org
to markup their web pages and
email messages
• Founded by Google, Microsoft,
Yahoo and Yandex
Vocabulary Example:
Schema.org
23. Die Semantic Web Layer Cake 2001
http://www.w3.org/2001/10/03-sww-1/slide7-0.html
• Monolithisch basierend auf
XML
• Fokus auf schwergewichtige
Semantik (Ontologien, Logic,
Reasoning)
24. The Semantic Web Layer Cake now – Bridging between Data
Unicode URIs
XML JSON CSV RDB HTML
RDF
RDF/XML JSON-LD CSV2RDF R2RML RDFa
RDF Data Shapes RDF-Schema
Vocabularies
Ontologies
SKOS Thesauri
Logic
Rules
SPARQL
(Access
control),
Signatur,
Encryption
(HTTPS/CERT/DANE),
• Lingua Franca of Data integration
with many technology interfaces
(XML, HTML, JSON, CSV, RDB,…)
• Focus on lightweight
vocabularies, rules,
thesauri etc.
• Less “invasive”
25. RDF - the Lingua Franca of Data Integration
• RDF is simple
• We can easily encode and combine all kinds of data models (relational, taxonomic, graphs,
object-oriented, …)
• RDF supports distributed data and schema
• We can seamlessly evolve simple semantic representations (vocabularies) to more complex
ones (e.g. ontologies)
• Small representational units (URI/IRIs, triples) facilitate mixing and mashing
• RDF can be viewed from many perspectives: facts, graphs, ER, logical axioms, graphs, objects
• RDF integrates well with other formalisms - HTML (RDFa), XML (RDF/XML), JSON (JSON-LD),
CSV, …
• Linking and referencing between different knowledge bases, systems and platforms facilitates
the creation of sustainable data ecosystems
25
26. Page 26
• Fabric of concept, class, property, relationships, entity descriptions
• Uses a knowledge representation formalism
(typically RDF, RDF-Schema, OWL)
• Holistic knowledge (multi-domain, source, granularity):
• instance data (ground truth),
• open (e.g. DBpedia, WikiData), private (e.g. supply chain data),
closed data (product models),
• derived, aggregated data,
• schema data (vocabularies, ontologies)
• meta-data (e.g. provenance, versioning, documentation licensing)
• comprehensive taxonomies to categorize entities
• links between internal and external data
• mappings to data stored in other systems and databases
Knowledge Graphs – A definition
Smart Data for
Machine Learning
27. Page 27
Manual
• Curation / Crowdsourcing
Markup
• schema.org
Mapping Structured Data
• R2RML/RML
Leveraging Natural Language
Processing (NLP) from text
• Named Entity Recognition
• Relation Extraction
Knowledge Graph Creation
Ignaz Wanders: Build your own Knowledge Graph: From unstructured dark
data to valuable business insights
https://medium.com/vectrconsulting/build-your-own-knowledge-graph-
975cf6dde67f
28. Page 28
Querying Knowledge Graphs
Graph Patterns
Corresponding SPARQL Query:
SELECT ?ev, ?vn1, ?vn2 WHERE {
?ev a Food_Festival .
?ev venue ?vn1 .
?ev venue ?vn2 .
}
A. Hogan, E. Blomqvist, M. Cochez, C. d'Amato, G. de Melo, C. Gutierrez, J. E. Labra Gayo, S. Kirrane, S.
Neumaier, A. Polleres, R. Navigli, A.-C. Ngonga Ngomo, S. M. Rashid, A. Rula, L. Schmelzeisen, J. Sequeda,
S. Staab, Antoine Zimmermann: Knowledge Graphs, arXiv:2003.02320 [cs.AI]
30. Page 30
Knowledge Graph Refinement
Completion
• Filling missing edges
• Often addressed with link prediction
• Special tasks: type and identity prediction
Correction
• Fact validation
• Inconsistency repair
31. Page 31
Knowledge Graph Quality
[1] Zaveri, Rula, Maurino, Auer, Lehmann: Quality Assessment for Linked Open Data. Semantic Web Journal, 2015
A1: server responds to a SPARQL query
A2: RDF dump is available
A3: detection of dereferenceability of URIs
A4: HTTP response header with appropriate content type
A5: dereferenceability of all forward links
CM1: schema completeness: ratio of represented classes/properties
CM2: property completeness
CM3: population completeness: ratio of real-world objects
CM4: interlinking completeness: ratio of interlinked instances
Data quality is
“fitness for use”
Use cases vary
various quality
criteria/measures
organized along
various dimensions
33. Page 33
Instances in DBpedia & Wikidata
Knowledge Graphs on the Web -- an Overview
N. Heist, S. Hertling, D. Ringler, H. Paulheim
34. Page 34
Search Engine Optimization & Web-Commerce
Schema.org used by >20% of Web sites
Major search engines exploit semantic descriptions
Pharma, Lifesciences
Mature, comprehensive vocabularies and ontologies
Billions of disease, drug, clinical trial descriptions
Digital Libraries
Many established vocabularies (DublinCore, FRBR,
EDM)
Millions of aggregated from thousands of memory
institutions in Europeana, German Digital Library
Emerging Knowledge Graphs & Data Spaces
35. Page 35
Initiatives for decentral, semantic data spaces
Web/Ecommerce Digital Libraries Life Sciences Industry Open
Government Data
Vocabularies schema.org Europeana Data
Model
DCAT, DC, PROV-
O, FOAF, VoiD
DCAT, IDS
Vocabulary
DCAT
Participants ~30% of Web pages Memory Institutions
(2000 in Germany)
Pharma companies 80 companies
(SAP, Siemens,
Telekom, PWC)
EU, Countries,
Cities, Counties
License
Governance
CC-BY-SA
GitHub,
Google, Microsoft, Ya
ndex...
CC0
Europeana
Association
CC-BY-SA IDS Association Open Data
Applications Google Knowledge
Graph (Produkte,
Personen, ...)
DDB.de,
Europeana.eu
OpenPhacts.org Industrial Data
Space
Transparency,
Mobility, Budget,
Planing
36. Page 36
The Trinity of Semantic Integration
Knowledge Graphs
• Complex fabric of concepts
& relationships
• Focus on heterogenous,
multi-domain knowledge
representation
Data Spaces
• Community of
organizations agreeing on
standards for data access/
security/ semantics/
governance/ licenses
• Focus on data sharing &
exchange
Semantic Data Lakes
• Storage facility for
enterprise/research data
• Use Big Data (HDFS)
management
• Focus on scalable data
access
Use in a single organization Intra-organizational use
38. Page 38
Knowledge Graph Challenges & Opportunities
Knowledge graphs typically cover
• Multiple domains
• Various levels of granularity
• Data from multiple sources
• Various degrees of structure
Challenges
• Quality
• Coherence
• Co-evolution
• Update propagation
• Curation & interaction
Opportunities
• Background knowledge for various
applications (e.g. question answering,
data integration, machine learning)
• Facilitate intra-organizational data
exchange (data value chains)
38
Knowledge Graphs on the Web -- an Overview
N. Heist, S. Hertling, D. Ringler, H. Paulheim
DBpedia YAGO WikiData BabelNet
Cyc NELL CaLiGraph Voldemort
39. Page 39
Comparison of various enterprise data
integration paradigms
Paradigm Data
Model
Integr.
Strategy
Conceptual/
operational
Hetero-
geneous
data
Intern./
extern.
data
No. of
sources
Type of
integr.
Domain
coverage
Se-
mantic
repres.
XML
Schema
DOM trees LaV operational medium both medium high
Data
Warehouse
relational GaV operational - partially medium physical small medium
Data Lake various LaV operational large physical high medium
MDM UML GaV conceptual - - small physical small medium
PIM / PCS trees GaV operational partially partially - physical medium medium
Enterprise
search
document - operational partially large virtual high low
EKG RDF LaV both medium both high very high
[1] M. Galkin, S. Auer, M.-E. Vidal, S. Scerri: Enterprise Knowledge Graphs: A Semantic Approach for Knowledge
Management in the Next Generation of Enterprise Information Systems. ICEIS (2) 2017: 88-98
42. Page 42
Perspectives on data turn into silos
Parts of data are being curated, duplicated, annotated and simply
changed over time, making reconciliation and interpretation a challenge
Engineering Manufactur. Logistics Marketing
. . .
44. App. 1 App. 2 App. 3 App. 1 App. 2 App. 3
Data Access limited
to connected source
Exploding cost
of ETL
Full Access to All Data
Lean Architecture
Great Synergies in data
lifting
Knowledge Graph based Enterprise Data
Innovation Architecture
The future of data management is semantic!
Enterprise Integration with a
Semantic Data Lake
The Problem today
45. Management
Accounting
Risk Management
Regulatory Reporting
Treasury Marketing
Accounting
Corporate
Memory
Inbound
Data Sources
Outbound and
Consumption
Inbound Raw Data Store
Knowledge Graph for Meta Data, KPI Definition and Data Models
Frontend to Access Relationship and KPI Definition /
Documentation
Frontend to Access (ad hoc) Reports
Outbound Data Delivery to Target
Systems
Big Data DWH-
Infrastructur
e
High Level Architecture
Corporate Memory
49. Page 49
How did information flows change
in the digital era?
50. Page 50
How does it work today?
The World of Publishing &
Communication has profundely changed
• New means adapted to the new possibilities were
developed, e.g. „zooming“, dynamics
• Business models changed completely
• More focus on data, interlinking of data / services and
search in the data
• Integration, crowdsourcing, data curation play an
important role
52. Page 52
Scholarly Communication has not changed
(much)
17th century 19th century 20th century 21th century
Meanwhile other information intense domains were
completely disrupted:
53. Page 53
Challenges we are facing:
We need to rethink the way how research
is represented and communicated
[1] http://thecostofknowledge.com, https://www.projekt-deal.de
[2] M. Baker: 1,500 scientists lift the lid on reproducibility, Nature, 2016.
[3] Science and Engineering Publication Output Trends, National Science Foundation, 2018.
[4] J. Couzin-Frankel: Secretive and Subjective, Peer Review Proves Resistant to Study. Science, 2013.
Digitalisation
of Science
Data integration
and analysis
Digital
collaboration
Monopolisation by
commercial actors
Publisher
look-in effects
Maximization
of profits [1]
Reproducibility
Crisis
Majority of
experiments are
hard or not
reproducible [2]
Proliferation
of publications
Publication output
doubled within a
decade
continues to rise
[3]
Deficiency
of Peer Review
Deteriorating
quality [4]
Predatory
publishing
54. Page 54
Lack of…
Root Cause –
Deficiency of Scholarly Communication?
Transparency
information is hidden
in text
Integratability
fitting different
research results
together
Machine assistance
unstructured content
is hard to process
Identifyability
of concepts beyond
metadata
Collaboration
one brain barrier
Overview
Scientists look for the
needle in the haystack
55. Page 55
How good is CRISPR
(wrt. precision, safety, cost)?
What specifics has genome
editing with insects?
Who has applied it to
butterflies?
Search for CRISPR:
> 238.000 Results
Source: https://scholar.google.de/scholar?hl=de&as_sdt=0%2C5&q=CRISPR&btnG=, 04.2019
58. Page 58
KGs are proven to capture factual knowledge
Research Challenge: Manage
• Uncertainty & disagreement
• Varying semantic granularity
• Emergence, evolution & provenance
• Integrating existing domain models
But maintain flexibility and simplicity
Cognitive Knowledge Graphs
for scholarly knowledge
Towards Cognitive
Knowledge Graphs
• Fabric of knowledge molecules – compact,
relatively simple, structured units of knowledge
• Can be incrementally enriched, annotated, interlinked …
59. Page 59
Factual
Base entities Real world
Granularity Atomic Entities
Evolution
Addition/deletion
of facts
Collaboration Fact enrichment
From Factual Knowledge Graphs
Today
60. Page 60
Factual Cognitive
Base entities Real world Conceptual
Granularity Atomic Entities
Interlinked descriptions (molecules)
with annotations (provenance)
Evolution
Addition/deletion
of facts
Concept drift,
varying aggregation levels
Collaboration Fact enrichment Emergent semantics
From Factual to Cognitive Knowledge Graphs
Today Needed for SKG
62. Page 62
1. Original Publication
Chemistry Example: Populating the Graph
2. Adaptive Graph Curation & Completion
Author Robert Reed
Research Problem Genome editing in Lepidoptera
Methods CRISPR / cas9
Applied on Lepidoptera
Experimental Data
https://doi.org/10.5281/zenodo.89691
6
3. Graph representation
CRISPR / cas9 editing
in Lepidoptera
https://doi.org/10.1101/130344
Robert Reed
https://orcid.org/0000-0002-6065-6728
Genome editing in
Lepidoptera
Experimental Data
https://doi.org/10.5281/zenodo.896916
adresses
CRSPRS/cas9
isEvaluatedWith
Genome editing
https://www.wikidata.org/wiki/Q24630389
63. Page 63
Research Challenge:
• Intuitive exploration leveraging the
rich semantic representations
• Answer natural language questions
Exploration and Question Answering
Questi
on
parsin
g Named
Entity
Recogniti
on (NER)
& Linking
(NEL)
Relatio
n
extracti
on
Query
con-
structi
on
Query
executi
on
Result
renderi
ng
Q: How do different
genome editing techniques
compare?
SELECT Approach, Feature WHERE {
Approach adresses GenomEditing .
Approach hasFeature Feature }
[1] K. Singh, S. Auer et al: Why Reinvent
the Wheel? Let's Build Question
Answering Systems Together. The Web
Conference (WWW 2018).
Q: How do different
genome editing techniques
compare?
64. Page 64
Engineered Nucleases Site-specificity Safety Ease-of-use / costs/ speed
zinc finger nucleases (ZFN) ++
9-18nt
+ --
$$$: screening, testing to define efficiency
transcription activator-like
effector nucleases (TALENs)
+++
9-16nt
++ ++
Easy to engineer
1 week / few hundred dollar
engineered meganucleases +++
12-40 nt
0 --
$$$ Protein engineering, high-throughput
screening
CRISPR system/cas9 ++
5-12 nt
- +++
Easy to engineer
few days / less 200 dollar
Result:
Automatic Generation of Comparisons / Surveys
Q: How do different genome editing techniques
compare?
72. Page 72
Hybrid AI – combination of smart data (knowledge graphs) and smart analytics
Distributed semantic technologies – knowledge representation using vocabularies,
ontologies
Question Answering
• Open Question Answering architecture – flexible, knowledge-based integration
architecture for QA components and pipelines
• Dialogue Systems - combination of language models and goal-driven question
answering
Integration with Crowdsourcing
Knowlege Graphs, Semantic Data Lakes
Robotics – usage of semantics for actuation
Agile Interoperability – leveraging community driven vocabulary development
Cognitive Data challenges where
Knowledge Graphs can make a difference
73. Page 73
The Team
Prof. (Univ. S. Bolivar)
Dr. Maria Esther Vidal
Software Development
Dr. Kemele Endris
Collaborators TIB Scientific Data Mgmt.
Group Leaders PostDocs
Project Management
Doctoral Researchers
Dr. Markus Stocker Dr. Gábor Kismihók Dr. Javad Chamanara Dr. Jennifer D’Souza
Allard Oelen Yaser Jaradeh Manuel Prinz
Alex Garatzogianni
Collaborators InfAI Leipzig / AKSW
Dr. Michael Martin Natanael Arndt
Dr. Lars Vogt
Vitalis Wiens Kheir Eddine Farfar
Muhammad Haris
Administration
Katja Bartel Simone Matern
Die Z3 war der erste funktionsfähige Digitalrechner weltweit und wurde 1941 von Konrad Zuse in Zusammenarbeit mit Helmut Schreyer in Berlin gebaut. Die Z3 wurde in elektromagnetischer Relaistechnik mit 600 Relais für das Rechenwerk und 1400 Relais für das Speicherwerk ausgeführt.
Longquan stoneware incense burner, China, 12th-13th century AD. Part of the Percival David Collection of Chinese Ceramics.
Breakthroughs in AI come after data is available, not after algorithmic discoveries
If you think about AI, think about the data, not algorithms
Fun fact: most major AI companies share their internal deep learning toolkits
Map the silos to their domain appropriate schemas
Link the nodes (Linked Data)
The schema can be virtual – multiple schemas/views may be appropriate
You could argue: That MDM & BI Hub-Spoke systems have had the objective of the “Solution Tomorrow”, but were never ableto fulfill on this promise due to their reliance on relational paradigm that prevent them from having the flexibility to truly providean unlimited amount of perspectives on the same data. MDM & BI Hubs in the opposite have required all perspectives to be alignedwith the one single truth that was physically incorporated in the backbone and paradigm of these respective approaches.
Kemele M. Endris, Mikhail Galkin, Ioanna Lytra, Mohamed Nadjib Mami, Maria-Esther Vidal, Sören Auer:MULDER: Querying the Linked Data Web by Bridging RDF Molecule Templates. DEXA (1) 2017: 3-18
D. Diefenbach, K. Singh, A. Both, D. Cherix, C. Lange, S. Auer. 2017. The Qanary Ecosystem: Getting New Insights by Composing Question Answering Pipelines. Int. Conf. on Web Engineering ICWE 2017.
K. Singh, A. Sethupat, A. Both, S. Shekarpour, I. Lytra, R. Usbeck, A. Vyas, A. Khikmatullaev, D. Punjani, C. Lange, M.-E. Vidal, J. Lehmann, S. Auer: Why Reinvent the Wheel-Let's Build Question Answering Systems Together. The Web Conference (WWW 2018).
S. Shekarpour, E. Marx, S. Auer, A. P. Sheth: RQUERY: Rewriting Natural Language Queries on Knowledge Graphs to Alleviate the Vocabulary Mismatch Problem. AAAI 2017: 3936-3943
D. Lukovnikov, A. Fischer, J. Lehmann, S. Auer: Neural Network-based Question Answering over Knowledge Graphs on Word and Character Level. WWW 2017: 1211-1220