SlideShare une entreprise Scribd logo
1  sur  48
Télécharger pour lire hors ligne
Assessing and Refining Mappings to RDF
to Improve Dataset Quality
Kontokostas@informatik.uni-leipzig.de
@jimkont
Anastasia Dimou1, Dimitris Kontokostas2, Markus Freudenberg2,
Ruben Verborgh1, Jens Lehmann2, Erik Mannens1,
Sebastian Hellmann2, Rik Van de Walle1
Anastasia.Dimou@UGent.be
@natadimou
1Ghent University – iMinds – MMLab
2AKSW – Leipzig University
http://RML.io ● http://RDFUnit.aksw.org
Linked Open Data
semantically annotated using
different vocabularies or ontologies
and interlinked data representations
published in the form of RDF datasets
derive from originally heterogeneous
(semi-)structured data
RDF Dataset Quality
varies significantly ranging
from expensively curated
to relatively low quality datasets
RDF Dataset Quality - Intrinsic Dimension
determines the RDF Dataset Quality
by assessing it for possible violations
with respect to
accuracy (e.g. malformed datatype literals)
consistency (e.g. disjoint classes/properties)
RDF Dataset Quality Assessment (DQA)
DQA with RDFUnit
Mappings Quality Assessment (MQA)
MQA with RDFUnit over RML
Mapping & Dataset Quality Assessment Workflow
Mapping Refinements
Mappings & Quality Assessment Results
RDF Dataset Quality Assessment (DQA)
DQA with RDFUnit
Mappings Quality Assessment (MQA)
MQA with RDFUnit over RML
Mapping & Dataset Quality Assessment Workflow
Mapping Refinements
Mappings & Quality Assessment Results
Violations
Most frequent violations are
related to the dataset's schema
(vocabularies or ontologies)
dbo:birthDate range  xsd:date
dbo:birthDate domain  dbo:Person
http://example.com/
Chuck_Bednarik
dbo:Event
"1925-05-01"
xsd:gYear
dbo:birthDate
RDF DQA with RDFUnit
test-driven data-debugging framework
based on SPARQL-patterns
http://rdfunit.aksw.org
D. Kontokostas, P. Westphal, S. Auer, S. Hellmann, J. Lehmann, R. Cornelissen, and A. J. Zaveri
Test-driven evaluation of linked data quality
In Proceedings of the 23rd International Conference on World Wide Web
RDF DQA with RDFUnit
test-driven data-debugging framework
based on SPARQL-patterns
dbo:birthDate
http://example.com/
Chuck_Bednarik
dbo:Event
"1925-05-01"
xsd:gYear
http://rdfunit.aksw.org
RDF DQA with RDFUnit
…WHERE { ?resource %%P1%% ?c.
FILTER (DATATYPE(?c) != %%D1%%) }
http://rdfunit.aksw.org
…WHERE { ?resource dbo:birthDate ?c.
FILTER (DATATYPE(?c) != xsd:date) }
dbo:birthDate
http://example.com/
Chuck_Bednarik
dbo:Event
"1925-05-01"
xsd:gYear
Violations
Most frequent violations are
related to the dataset's schema
(vocabularies or ontologies)
Similar violations occur repeatedly
within a single RDF dataset
http://example.com/
Giddeon_Massie
dbo:Event
"1981-08-27"
xsd:gYear
http://example.com/
Brick_Bronsky
dbo:Event
"1964"
xsd:gYear
http://example.com/
Steve_Meilinger
dbo:Event
"1930-12-12"
xsd:gYear
dbo:birthDate
http://example.com/
Chuck_Bednarik
dbo:Event
"1925-05-01"
xsd:gYear
http://example.com/
Matt_McBride
dbo:Event
"1985-05-23"
xsd:gYear
dbo:birthDate
dbo:birthDate
dbo:birthDate
dbo:birthDate
sets of triples of a dataset have repetitive patterns
http://example.com/
Brick_Bronsky
dbo:Event
"1964"
xsd:gYear
http://example.com/
Steve_Meilinger
dbo:Event
"1930-12-12"
xsd:gYear
dbo:birthDate
http://example.com/
Chuck_Bednarik
dbo:Event
"1925-05-01"
xsd:gYear
http://example.com/
Matt_McBride
dbo:Event
"1985-05-23"
xsd:gYear
dbo:birthDate
dbo:birthDate
dbo:birthDate
dbo:birthDate
http://example.com/
Chuck_Bednarik
dbo:Event
"1925-05-01"
xsd:gYear
http://example.com/
Matt_McBride
dbo:Event
"1985-05-23"
xsd:gYear
dbo:birthDate
http://example.com/
{Name}_{Surname}
dbo:Event
"Birth"
xsd:gYear
sets of triples of a dataset have repetitive patterns
dbo:birthDate
sets of triples of a dataset have repetitive patterns
dbo:birthDate
http://example.com/
{Name}_{Surname}
dbo:Event
“Birth"
xsd:gYear
Mapping languages
formalize patterns into rules
to generate the RDF dataset
from the original data
Instead of applying Quality Assessment
to the already published RDF dataset
as part of data consumption
Apply Quality Assessment to the Mappings
that generate the RDF dataset
Incorporate Quality Assessment
in the publishing workflow
DQA: Dataset Quality Assessment
is applied by third parties
to already published RDF dataset
violations
DQA
DQA: Dataset Quality Assessment
Adjustments to the dataset
are manually but rarely applied
but not at the root (hard to identify)
are overwritten if a new version of
the original data is mapped & published
violations
DQA
RDF Dataset Quality Assessment (DQA)
DQA with RDFUnit
Mappings Quality Assessment (MQA)
MQA with RDFUnit over RML
Mapping & Dataset Quality Assessment Workflow
Mapping Refinements
Mappings & Quality Assessment Results
sets of triples of a dataset have repetitive patterns
dbo:birthDatehttp://example.com/
{Name}_{Surname}
dbo:Event
“Birth"
xsd:gYear
Mapping languages
formalize patterns into rules
to generate the RDF dataset
from the original data
sets of triples of a dataset have repetitive patterns
Name Surname Birth
Chuck Bednarik 1925-05-01
Matt McBride 1985-05-23
Steve Meilinger 1930-12-12
Brick Bronsky 1964
Giddeon Massie 1981-08-27
dbo:birthDatehttp://example.com/
{Name}_{Surname}
dbo:Event
“Birth"
xsd:gYear
RDF Mapping Language (RML)
specify the mapping definitions to
generate RDF representation
from heterogeneous data sources
extends the W3C-recommended R2RML
http://rml.io
A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, and R. Van de Walle.
RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data.
In Proceedings of the 7th Workshop on Linked Data on the Web (LDOW2014), 2014.
RDF Mapping Language (RML)
http://rml.io
<#Mapping>
rr:subjectMap [ rr:class dbo:Event
rr:template "http://example.com/{Name}_{Surname}" ] ;
rr:predicateObjectMap [ rr:predicate dbo:birthDate
rr:objectMap [ rml:reference "Birth" ; rr:datatype xsd:gYear ] ] .
dbo:birthDate
http://example.com/
{Name}_{Surname}
dbo:Event
“Birth"
xsd:gYear
http://rml.io
data map doc
Mapping
Processor
RDF Mapping Language (RML)
data map doc
Mapping
Processor
violations
DQA
http://rml.io
DQA: Dataset Quality Assessment
MQA with RDFUnit over RML
dbo:birthDate
http://example.com/
{Name}_{Surname}
dbo:Event
“Birth"
xsd:gYear
…WHERE { ?resource %%P1%% ?c.
FILTER (DATATYPE(?c) != %%D1%%) }
…WHERE { ?resource dbo:birthDate ?c.
FILTER (DATATYPE(?c) != xsd:date) }
… WHERE {
?resource rr:predicateObjectMap ?poMap.
?poMap rr:predicate %%P1%%;
rr:objectMap ?objM.
?objM rr:datatype ?c.
FILTER (?c != %%D1%%) }
<#Mapping>
rr:subjectMap [ rr:class dbo:Event
rr:template "http://example.com/{Name}_{Surname}" ] ;
rr:predicateObjectMap [ rr:predicate dbo:birthDate
rr:objectMap [ rml:reference "Age" ; rr:datatype xsd:gYear ] ] .
data map doc
Mapping
Processor
violations
MQA
MQA: Mapping Quality Assessment
data map doc
Mapping
Processor
violations
MDQA
MDQA: Uniform Mapping & Dataset
Quality Assessment
MQA: Mapping Quality Assessment
discover violations before
they are even generated
specify the origin of the violation
RDFUnit over RML
dbo:birthDate
http://example.com/
Chuck_Bednarik
dbo:Event
"1925-05-01"
xsd:gYear
… WHERE {
?resource rr:predicateObjectMap ?poMap.
?poMap rr:predicate %%P1%%;
rr:objectMap ?objM.
?objM rr:datatype ?c.
FILTER (?c != %%D1%%) }
RDFUnit over RML
<#Result>
rut:testCase rut:datatypeError
spin:violationRoot <#ObjectMap> ;
spin:violationPath rr:datatype ;
spin:violationValue xsd:gYear ;
rut:missingValue xsd:date .
… WHERE {
?resource rr:predicateObjectMap ?poMap.
?poMap rr:predicate dbo:birthDate;
rr:objectMap ?objM.
?objM rr:datatype ?c.
FILTER (?c != xsd:date) }
dbo:birthDate
http://example.com/
Chuck_Bednarik
dbo:Event
"1925-05-01"
xsd:gYear
MQA: Mapping Quality Assessment
discover violations before
they are even generated
specify the origin of the violation
easily apply structural adjustments
to the mapping definitions
RDF Dataset Quality Assessment (DQA)
DQA with RDFUnit
Mappings Quality Assessment (MQA)
MQA with RDFUnit over RML
Mapping & Dataset Quality Assessment Workflow
Mapping Refinements
Mappings & Quality Assessment Results
data map doc
Mapping
Processor
violations
MDQA
MDQA: Uniform Mapping & Dataset
Quality Assessment
<#Result>
rut:testCase rut:datatypeError
spin:violationRoot <#ObjectMap> ;
spin:violationPath rr:datatype ;
spin:violationValue xsd:gYear ;
rut:missingValue xsd:date .
data map doc
Mapping
Processor
violations
MDQA
MDQA: Uniform Mapping & Dataset
Quality Assessment
<#Result>
rut:testCase rut:datatypeError
spin:violationRoot <#ObjectMap> ;
spin:violationPath rr:datatype ;
spin:violationValue xsd:gYear ;
rut:missingValue xsd:date .
DEL: <#ObjectMap> rr:datatype xsd:gYear
ADD: <#ObjectMap> rr:datatype xsd:date
data map doc
Mapping
Processor
Mapping
Refinements
violations
MDQA
Uniform Mapping & Dataset
Quality Assessment Workflow
MQA with RDFUnit over RML
dbo:birthDate
http://example.com/
Chuck_Bednarik
dbo:Person
"1925-05-01"
xsd:date
DEL: <#ObjectMap> rr:datatype xsd:gYear
ADD: <#ObjectMap> rr:datatype xsd:date
<#Result>
rut:testCase rut:datatypeError
spin:violationRoot <#ObjectMap> ;
spin:violationPath rr:datatype ;
spin:violationValue xsd:float ;
rut:missingValue xsd:int .
data
new
map doc
map doc
Mapping
Processor
Mapping
Refinements
violations
MDQA
(optional)
Uniform Mapping & Dataset
Quality Assessment Workflow
data
new
map doc
map doc
Mapping
Processor
Mapping
Refinements
violations
MDQA
(optional)
Uniform Mapping & Dataset
Quality Assessment Workflow
Beyond Mapping Quality Assessment
certain test cases inevitably
require the RDF Dataset
cardinality,
functionality,
symmetricity
Beyond Mapping Quality Assessment
certain test cases inevitably
require the RDF Dataset
cardinality,
functionality,
symmetricity
reflect to the data,
DO NOT affected by the mapping definitions
Mapping Quality Assessment (MQA)
prevent the violations generation
prevent same violations to appear
repeatedly over distinct entities
allow intuitively combining
different ontologies and vocabularies
RDF Dataset Quality Assessment (DQA)
DQA with RDFUnit
Mappings Quality Assessment (MQA)
MQA with RDFUnit over RML
Mapping & Dataset Quality Assessment Workflow
Mapping Refinements
Mappings & Quality Assessment Results
Dataset Vs Mapping Quality Assessment
Number of Violations
Dataset Quality Assessment Mapping Quality Assessment
#fail test cases #violations #fail test cases #violations
DBPedia EN 1,128 3.2M 1 160
DBPedia NL 683 815k 1 124
DBLP 7 8.1M 2 8
*Dbpedia and D2RQ Mappings were translated to RML mappings
Dataset Vs Mapping Quality Assessment
Time
Dataset Quality Assessment Mapping Quality Assessment
size time size time
DBPedia EN 62M 16h 115K 11s
DBPedia NL 21M 1.5h 53K 6s
DBLP 12M 12h 368 12s
CEUR-WS* 2.4k 6s 702 5s
iLastic 150k 12s 825 15s
*CEUR-WS submission to the ESWC Semantic Publishing Challenge (2014 Vs 2015)
Mapping Quality Assessment
Mapping Quality Assessment
size time
DBPedia EN 115K 11s
DBPedia NL 53K 6s
DBPedia All 511K 32s
* http://mappings.dbpedia.org/validation
Live update of DBpedia Mapping Quality Assessment results every night! 
Violations
Most frequent violations are
related to the dataset's schema
(vocabularies or ontologies)
Similar violations occur repeatedly
within a single RDF dataset
The situation aggravates the more
ontologies and vocabularies
are reused and combined
Quality Assessment
shifted from data consumption
to data publication
integrated systematically
in the publishing workflow
violations are identified,
resolved and will not re-appear
RDF dataset of higher Quality is generated

Contenu connexe

Tendances

The Network Data Structure in Computing
The Network Data Structure in ComputingThe Network Data Structure in Computing
The Network Data Structure in Computing
Marko Rodriguez
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
Revolution Analytics
 
Open library data and embrace the world library linked data
Open library data and embrace the world library linked dataOpen library data and embrace the world library linked data
Open library data and embrace the world library linked data
皓仁 柯
 
DB-IR-ranking
DB-IR-rankingDB-IR-ranking
DB-IR-ranking
FELIX75
 

Tendances (20)

A Survey of Entity Ranking over RDF Graphs
A Survey of Entity Ranking over RDF GraphsA Survey of Entity Ranking over RDF Graphs
A Survey of Entity Ranking over RDF Graphs
 
20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
 
Hack U Barcelona 2011
Hack U Barcelona 2011Hack U Barcelona 2011
Hack U Barcelona 2011
 
The WorldCat Search API
The WorldCat Search APIThe WorldCat Search API
The WorldCat Search API
 
Visualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscapeVisualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscape
 
The Network Data Structure in Computing
The Network Data Structure in ComputingThe Network Data Structure in Computing
The Network Data Structure in Computing
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
Linked (Open) Data
Linked (Open) DataLinked (Open) Data
Linked (Open) Data
 
Efficient RDF Interchange (ERI) Format for RDF Data Streams
Efficient RDF Interchange (ERI) Format for RDF Data StreamsEfficient RDF Interchange (ERI) Format for RDF Data Streams
Efficient RDF Interchange (ERI) Format for RDF Data Streams
 
DB and IR Integration
DB and IR IntegrationDB and IR Integration
DB and IR Integration
 
SWT Lecture Session 2 - RDF
SWT Lecture Session 2 - RDFSWT Lecture Session 2 - RDF
SWT Lecture Session 2 - RDF
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
MR^3: Meta-Model Management based on RDFs Revision Reflection
MR^3: Meta-Model Management based on RDFs Revision ReflectionMR^3: Meta-Model Management based on RDFs Revision Reflection
MR^3: Meta-Model Management based on RDFs Revision Reflection
 
Open library data and embrace the world library linked data
Open library data and embrace the world library linked dataOpen library data and embrace the world library linked data
Open library data and embrace the world library linked data
 
DB-IR-ranking
DB-IR-rankingDB-IR-ranking
DB-IR-ranking
 
Deriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataDeriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF Data
 
FedX - Optimization Techniques for Federated Query Processing on Linked Data
FedX - Optimization Techniques for Federated Query Processing on Linked DataFedX - Optimization Techniques for Federated Query Processing on Linked Data
FedX - Optimization Techniques for Federated Query Processing on Linked Data
 
Introduction To RDF and RDFS
Introduction To RDF and RDFSIntroduction To RDF and RDFS
Introduction To RDF and RDFS
 
Australian Open government and research data pilot survey 2017
Australian Open government and research data pilot survey 2017Australian Open government and research data pilot survey 2017
Australian Open government and research data pilot survey 2017
 

En vedette

Assessment & adjustment for data quality used in the South African DISTRICT ...
Assessment & adjustment for data quality used in the South African DISTRICT ...Assessment & adjustment for data quality used in the South African DISTRICT ...
Assessment & adjustment for data quality used in the South African DISTRICT ...
Routine Health Information NetwOrk (RHINO)
 

En vedette (20)

DBpedia ♥ Commons
DBpedia ♥ CommonsDBpedia ♥ Commons
DBpedia ♥ Commons
 
DBpedia past, present & future
DBpedia past, present & futureDBpedia past, present & future
DBpedia past, present & future
 
DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014
 
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
 
Graph databases & data integration - the case of RDF
Graph databases & data integration - the case of RDFGraph databases & data integration - the case of RDF
Graph databases & data integration - the case of RDF
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Semantically enhanced quality assurance in the jurion business use case
Semantically enhanced quality assurance in the jurion  business use caseSemantically enhanced quality assurance in the jurion  business use case
Semantically enhanced quality assurance in the jurion business use case
 
2014 review of data quality assessment methods
2014 review of data quality assessment methods2014 review of data quality assessment methods
2014 review of data quality assessment methods
 
LDIF Lightening Talk
LDIF Lightening TalkLDIF Lightening Talk
LDIF Lightening Talk
 
Data Usability Assessment for Remote Sensing Data: Accuracy of Interactive Da...
Data Usability Assessment for Remote Sensing Data: Accuracy of Interactive Da...Data Usability Assessment for Remote Sensing Data: Accuracy of Interactive Da...
Data Usability Assessment for Remote Sensing Data: Accuracy of Interactive Da...
 
Assessment & adjustment for data quality used in the South African DISTRICT ...
Assessment & adjustment for data quality used in the South African DISTRICT ...Assessment & adjustment for data quality used in the South African DISTRICT ...
Assessment & adjustment for data quality used in the South African DISTRICT ...
 
DBpedia i18n - Amsterdam Meeting (30/01/2014)
DBpedia i18n - Amsterdam Meeting (30/01/2014)DBpedia i18n - Amsterdam Meeting (30/01/2014)
DBpedia i18n - Amsterdam Meeting (30/01/2014)
 
DBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in DublinDBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in Dublin
 
Leveraging DBpedia for Adaptive Crowdsourcing in Linked Data Quality Assessment
Leveraging DBpedia for Adaptive Crowdsourcing in Linked Data Quality AssessmentLeveraging DBpedia for Adaptive Crowdsourcing in Linked Data Quality Assessment
Leveraging DBpedia for Adaptive Crowdsourcing in Linked Data Quality Assessment
 
LDQ 2014 DQ Methodology
LDQ 2014 DQ MethodologyLDQ 2014 DQ Methodology
LDQ 2014 DQ Methodology
 
Data quality assessment of OSM datasets of Ringroad, Kathmandu, Nepal
Data quality assessment of OSM datasets of Ringroad, Kathmandu, NepalData quality assessment of OSM datasets of Ringroad, Kathmandu, Nepal
Data quality assessment of OSM datasets of Ringroad, Kathmandu, Nepal
 
8th DBpedia meeting / California 2016
8th DBpedia meeting /  California 20168th DBpedia meeting /  California 2016
8th DBpedia meeting / California 2016
 
Using Web Data Provenance for Quality Assessment
Using Web Data Provenance for Quality AssessmentUsing Web Data Provenance for Quality Assessment
Using Web Data Provenance for Quality Assessment
 
METHODS, MATHEMATICAL MODELS, DATA QUALITY ASSESSMENT AND RESULT INTERPRETATI...
METHODS, MATHEMATICAL MODELS, DATA QUALITY ASSESSMENT AND RESULT INTERPRETATI...METHODS, MATHEMATICAL MODELS, DATA QUALITY ASSESSMENT AND RESULT INTERPRETATI...
METHODS, MATHEMATICAL MODELS, DATA QUALITY ASSESSMENT AND RESULT INTERPRETATI...
 
MEASURE Evaluation Data Quality Assessment Methodology and Tools
MEASURE Evaluation Data Quality Assessment Methodology and ToolsMEASURE Evaluation Data Quality Assessment Methodology and Tools
MEASURE Evaluation Data Quality Assessment Methodology and Tools
 

Similaire à Assessing and Refining Mappings to RDF to Improve Dataset Quality

Similaire à Assessing and Refining Mappings to RDF to Improve Dataset Quality (20)

Test-driven Assessment of [R2]RML Mappings to Improve Dataset Quality
Test-driven Assessment of [R2]RML Mappings to Improve Dataset Quality Test-driven Assessment of [R2]RML Mappings to Improve Dataset Quality
Test-driven Assessment of [R2]RML Mappings to Improve Dataset Quality
 
Semantika Introduction
Semantika IntroductionSemantika Introduction
Semantika Introduction
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
Rdf data-model-and-storage
Rdf data-model-and-storageRdf data-model-and-storage
Rdf data-model-and-storage
 
Stream processing: The Matrix Revolutions
Stream processing: The Matrix RevolutionsStream processing: The Matrix Revolutions
Stream processing: The Matrix Revolutions
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...
 
IRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description Framework
 
Quality aware subgraph matching over inconsistent probabilistic graph databases
Quality aware subgraph matching over inconsistent probabilistic graph databasesQuality aware subgraph matching over inconsistent probabilistic graph databases
Quality aware subgraph matching over inconsistent probabilistic graph databases
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
 
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
 
Tese phd
Tese phdTese phd
Tese phd
 
Poster - Completeness Statements about RDF Data Sources and Their Use for Qu...
Poster - Completeness Statements about RDF Data Sources and Their Use for Qu...Poster - Completeness Statements about RDF Data Sources and Their Use for Qu...
Poster - Completeness Statements about RDF Data Sources and Their Use for Qu...
 
Graph databases & data integration v2
Graph databases & data integration v2Graph databases & data integration v2
Graph databases & data integration v2
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
QUALITY-AWARE SUBGRAPH MATCHING OVER INCONSISTENT PROBABILISTIC GRAPH DATABASES
QUALITY-AWARE SUBGRAPH MATCHING OVER INCONSISTENT PROBABILISTIC GRAPH DATABASESQUALITY-AWARE SUBGRAPH MATCHING OVER INCONSISTENT PROBABILISTIC GRAPH DATABASES
QUALITY-AWARE SUBGRAPH MATCHING OVER INCONSISTENT PROBABILISTIC GRAPH DATABASES
 
Relational Database to RDF (RDB2RDF)
Relational Database to RDF (RDB2RDF)Relational Database to RDF (RDB2RDF)
Relational Database to RDF (RDB2RDF)
 
Applying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesApplying large scale text analytics with graph databases
Applying large scale text analytics with graph databases
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
 

Plus de andimou

Plus de andimou (6)

What Factors Influence the Design of a Linked Data Generation Algorithm?
What Factors Influence the Design of a Linked Data Generation Algorithm?What Factors Influence the Design of a Linked Data Generation Algorithm?
What Factors Influence the Design of a Linked Data Generation Algorithm?
 
High quality Linked Data generation for librarians
High quality Linked Data generation for librariansHigh quality Linked Data generation for librarians
High quality Linked Data generation for librarians
 
iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...
iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...
iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...
 
Towards an Interface for User-Friendly Linked Data Generation Administration
Towards an Interface for User-Friendly Linked Data Generation AdministrationTowards an Interface for User-Friendly Linked Data Generation Administration
Towards an Interface for User-Friendly Linked Data Generation Administration
 
Extraction and Semantic Annotation of Workshop Proceedings in HTML using RML
Extraction and Semantic Annotation of Workshop Proceedings in HTML using RMLExtraction and Semantic Annotation of Workshop Proceedings in HTML using RML
Extraction and Semantic Annotation of Workshop Proceedings in HTML using RML
 
Visualizing the information of a Linked Open Data enabled Research Informatio...
Visualizing the information of a Linked Open Data enabled Research Informatio...Visualizing the information of a Linked Open Data enabled Research Informatio...
Visualizing the information of a Linked Open Data enabled Research Informatio...
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Assessing and Refining Mappings to RDF to Improve Dataset Quality

  • 1. Assessing and Refining Mappings to RDF to Improve Dataset Quality Kontokostas@informatik.uni-leipzig.de @jimkont Anastasia Dimou1, Dimitris Kontokostas2, Markus Freudenberg2, Ruben Verborgh1, Jens Lehmann2, Erik Mannens1, Sebastian Hellmann2, Rik Van de Walle1 Anastasia.Dimou@UGent.be @natadimou 1Ghent University – iMinds – MMLab 2AKSW – Leipzig University http://RML.io ● http://RDFUnit.aksw.org
  • 2. Linked Open Data semantically annotated using different vocabularies or ontologies and interlinked data representations published in the form of RDF datasets derive from originally heterogeneous (semi-)structured data
  • 3. RDF Dataset Quality varies significantly ranging from expensively curated to relatively low quality datasets
  • 4. RDF Dataset Quality - Intrinsic Dimension determines the RDF Dataset Quality by assessing it for possible violations with respect to accuracy (e.g. malformed datatype literals) consistency (e.g. disjoint classes/properties)
  • 5. RDF Dataset Quality Assessment (DQA) DQA with RDFUnit Mappings Quality Assessment (MQA) MQA with RDFUnit over RML Mapping & Dataset Quality Assessment Workflow Mapping Refinements Mappings & Quality Assessment Results
  • 6. RDF Dataset Quality Assessment (DQA) DQA with RDFUnit Mappings Quality Assessment (MQA) MQA with RDFUnit over RML Mapping & Dataset Quality Assessment Workflow Mapping Refinements Mappings & Quality Assessment Results
  • 7. Violations Most frequent violations are related to the dataset's schema (vocabularies or ontologies) dbo:birthDate range  xsd:date dbo:birthDate domain  dbo:Person http://example.com/ Chuck_Bednarik dbo:Event "1925-05-01" xsd:gYear dbo:birthDate
  • 8. RDF DQA with RDFUnit test-driven data-debugging framework based on SPARQL-patterns http://rdfunit.aksw.org D. Kontokostas, P. Westphal, S. Auer, S. Hellmann, J. Lehmann, R. Cornelissen, and A. J. Zaveri Test-driven evaluation of linked data quality In Proceedings of the 23rd International Conference on World Wide Web
  • 9. RDF DQA with RDFUnit test-driven data-debugging framework based on SPARQL-patterns dbo:birthDate http://example.com/ Chuck_Bednarik dbo:Event "1925-05-01" xsd:gYear http://rdfunit.aksw.org
  • 10. RDF DQA with RDFUnit …WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) } http://rdfunit.aksw.org …WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) } dbo:birthDate http://example.com/ Chuck_Bednarik dbo:Event "1925-05-01" xsd:gYear
  • 11. Violations Most frequent violations are related to the dataset's schema (vocabularies or ontologies) Similar violations occur repeatedly within a single RDF dataset
  • 13. sets of triples of a dataset have repetitive patterns http://example.com/ Brick_Bronsky dbo:Event "1964" xsd:gYear http://example.com/ Steve_Meilinger dbo:Event "1930-12-12" xsd:gYear dbo:birthDate http://example.com/ Chuck_Bednarik dbo:Event "1925-05-01" xsd:gYear http://example.com/ Matt_McBride dbo:Event "1985-05-23" xsd:gYear dbo:birthDate dbo:birthDate dbo:birthDate
  • 15. sets of triples of a dataset have repetitive patterns dbo:birthDate http://example.com/ {Name}_{Surname} dbo:Event “Birth" xsd:gYear Mapping languages formalize patterns into rules to generate the RDF dataset from the original data
  • 16. Instead of applying Quality Assessment to the already published RDF dataset as part of data consumption Apply Quality Assessment to the Mappings that generate the RDF dataset Incorporate Quality Assessment in the publishing workflow
  • 17. DQA: Dataset Quality Assessment is applied by third parties to already published RDF dataset violations DQA
  • 18. DQA: Dataset Quality Assessment Adjustments to the dataset are manually but rarely applied but not at the root (hard to identify) are overwritten if a new version of the original data is mapped & published violations DQA
  • 19. RDF Dataset Quality Assessment (DQA) DQA with RDFUnit Mappings Quality Assessment (MQA) MQA with RDFUnit over RML Mapping & Dataset Quality Assessment Workflow Mapping Refinements Mappings & Quality Assessment Results
  • 20. sets of triples of a dataset have repetitive patterns dbo:birthDatehttp://example.com/ {Name}_{Surname} dbo:Event “Birth" xsd:gYear Mapping languages formalize patterns into rules to generate the RDF dataset from the original data
  • 21. sets of triples of a dataset have repetitive patterns Name Surname Birth Chuck Bednarik 1925-05-01 Matt McBride 1985-05-23 Steve Meilinger 1930-12-12 Brick Bronsky 1964 Giddeon Massie 1981-08-27 dbo:birthDatehttp://example.com/ {Name}_{Surname} dbo:Event “Birth" xsd:gYear
  • 22. RDF Mapping Language (RML) specify the mapping definitions to generate RDF representation from heterogeneous data sources extends the W3C-recommended R2RML http://rml.io A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, and R. Van de Walle. RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data. In Proceedings of the 7th Workshop on Linked Data on the Web (LDOW2014), 2014.
  • 23. RDF Mapping Language (RML) http://rml.io <#Mapping> rr:subjectMap [ rr:class dbo:Event rr:template "http://example.com/{Name}_{Surname}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Birth" ; rr:datatype xsd:gYear ] ] . dbo:birthDate http://example.com/ {Name}_{Surname} dbo:Event “Birth" xsd:gYear
  • 26. MQA with RDFUnit over RML dbo:birthDate http://example.com/ {Name}_{Surname} dbo:Event “Birth" xsd:gYear …WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) } …WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) } … WHERE { ?resource rr:predicateObjectMap ?poMap. ?poMap rr:predicate %%P1%%; rr:objectMap ?objM. ?objM rr:datatype ?c. FILTER (?c != %%D1%%) } <#Mapping> rr:subjectMap [ rr:class dbo:Event rr:template "http://example.com/{Name}_{Surname}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Age" ; rr:datatype xsd:gYear ] ] .
  • 28. data map doc Mapping Processor violations MDQA MDQA: Uniform Mapping & Dataset Quality Assessment
  • 29. MQA: Mapping Quality Assessment discover violations before they are even generated specify the origin of the violation
  • 30. RDFUnit over RML dbo:birthDate http://example.com/ Chuck_Bednarik dbo:Event "1925-05-01" xsd:gYear … WHERE { ?resource rr:predicateObjectMap ?poMap. ?poMap rr:predicate %%P1%%; rr:objectMap ?objM. ?objM rr:datatype ?c. FILTER (?c != %%D1%%) }
  • 31. RDFUnit over RML <#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:gYear ; rut:missingValue xsd:date . … WHERE { ?resource rr:predicateObjectMap ?poMap. ?poMap rr:predicate dbo:birthDate; rr:objectMap ?objM. ?objM rr:datatype ?c. FILTER (?c != xsd:date) } dbo:birthDate http://example.com/ Chuck_Bednarik dbo:Event "1925-05-01" xsd:gYear
  • 32. MQA: Mapping Quality Assessment discover violations before they are even generated specify the origin of the violation easily apply structural adjustments to the mapping definitions
  • 33. RDF Dataset Quality Assessment (DQA) DQA with RDFUnit Mappings Quality Assessment (MQA) MQA with RDFUnit over RML Mapping & Dataset Quality Assessment Workflow Mapping Refinements Mappings & Quality Assessment Results
  • 34. data map doc Mapping Processor violations MDQA MDQA: Uniform Mapping & Dataset Quality Assessment <#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:gYear ; rut:missingValue xsd:date .
  • 35. data map doc Mapping Processor violations MDQA MDQA: Uniform Mapping & Dataset Quality Assessment <#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:gYear ; rut:missingValue xsd:date . DEL: <#ObjectMap> rr:datatype xsd:gYear ADD: <#ObjectMap> rr:datatype xsd:date
  • 36. data map doc Mapping Processor Mapping Refinements violations MDQA Uniform Mapping & Dataset Quality Assessment Workflow
  • 37. MQA with RDFUnit over RML dbo:birthDate http://example.com/ Chuck_Bednarik dbo:Person "1925-05-01" xsd:date DEL: <#ObjectMap> rr:datatype xsd:gYear ADD: <#ObjectMap> rr:datatype xsd:date <#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:float ; rut:missingValue xsd:int .
  • 40. Beyond Mapping Quality Assessment certain test cases inevitably require the RDF Dataset cardinality, functionality, symmetricity
  • 41. Beyond Mapping Quality Assessment certain test cases inevitably require the RDF Dataset cardinality, functionality, symmetricity reflect to the data, DO NOT affected by the mapping definitions
  • 42. Mapping Quality Assessment (MQA) prevent the violations generation prevent same violations to appear repeatedly over distinct entities allow intuitively combining different ontologies and vocabularies
  • 43. RDF Dataset Quality Assessment (DQA) DQA with RDFUnit Mappings Quality Assessment (MQA) MQA with RDFUnit over RML Mapping & Dataset Quality Assessment Workflow Mapping Refinements Mappings & Quality Assessment Results
  • 44. Dataset Vs Mapping Quality Assessment Number of Violations Dataset Quality Assessment Mapping Quality Assessment #fail test cases #violations #fail test cases #violations DBPedia EN 1,128 3.2M 1 160 DBPedia NL 683 815k 1 124 DBLP 7 8.1M 2 8 *Dbpedia and D2RQ Mappings were translated to RML mappings
  • 45. Dataset Vs Mapping Quality Assessment Time Dataset Quality Assessment Mapping Quality Assessment size time size time DBPedia EN 62M 16h 115K 11s DBPedia NL 21M 1.5h 53K 6s DBLP 12M 12h 368 12s CEUR-WS* 2.4k 6s 702 5s iLastic 150k 12s 825 15s *CEUR-WS submission to the ESWC Semantic Publishing Challenge (2014 Vs 2015)
  • 46. Mapping Quality Assessment Mapping Quality Assessment size time DBPedia EN 115K 11s DBPedia NL 53K 6s DBPedia All 511K 32s * http://mappings.dbpedia.org/validation Live update of DBpedia Mapping Quality Assessment results every night! 
  • 47. Violations Most frequent violations are related to the dataset's schema (vocabularies or ontologies) Similar violations occur repeatedly within a single RDF dataset The situation aggravates the more ontologies and vocabularies are reused and combined
  • 48. Quality Assessment shifted from data consumption to data publication integrated systematically in the publishing workflow violations are identified, resolved and will not re-appear RDF dataset of higher Quality is generated