SlideShare a Scribd company logo
1 of 24
Validating statistical index data
represented in RDF using
SPARQL queries
WESO Research Group
University of Oviedo, Spain
Jose Emilio Labra Gayo Jose María Álvarez Rodríguez
Motivation
The WebIndex Project
Measure impact of the Web in different countries
First publication: 2012, 2 web sites:
Visualization
http://thewebindex.org
Data portal
http://data.webfoundation.org
WebIndex Workflow
Raw Data
Excel
Computed data
Excel
External
sources
All data
RDF Datastore
Visualizations
Technical details
Index made from
61 countries (100 planned for 2013)
85 indicators:
51 Primary (questionnaires)
34 Secondary (external sources)
WebIndex RDF data
Modeled on top of RDF Data Cube
More than 1 million triples
Linked data: DBPedia, Organizations, etc.
Records data computation
Country 2009 2010 2011
Spain 4 5 3
Finland 4 5 6
Armenia 1 1 1
Chile 6 8 10.6
Country 2009 2010 2011
Spain 4 5 3
Finland 4 5 6
Armenia 1 1 1
Chile 6 8 10.6
Country 2009 2010 2011
Spain 4 5 3
Finland 4 5 6
Armenia 1 1 1
Chile 6 8 10.6
Country 2009 2010 2011
Spain 4 5 3
Finland 4 5 6
Armenia 1 1 1
Chile 6 8 10.6
Country 2009 2010 2011
Spain 4 5 3
Finland 4 6
Armenia 1
Chile 6 8
Country 2009 2010 2011
Spain 4 5 3
Finland 4 6
Armenia 1
Chile 6 8
WebIndex
computation process (1)
Simplified with one indicator, 3 years and 4 countries
Raw Data
Country 2009 2010 2011
Spain 4 5 3
Finland 4 5 6
Armenia 1 1 1
Chile 6 8 10.6
Imputed Data
Filtered Data (Indicator A)
Country 2009 2010 2011
Spain -0.57 -0.57 -0.92
Finland -0.57 -0.57 -0.14
Chile 1.15 1.15 1.06
Normalized Data (z-scores)
Country 2009 2010 2011
Spain 4 5 3
Finland 4 6
Armenia 1
Chile 6 8
Country 2009 2010 2011
Spain 4 5 3
Finland 4 5 6
Armenia 1 1 1
Chile 6 8 10.6
Country 2009 2010 2011
Spain -0.57 -0.57 -0.92
Finland -0.57 -0.57 -0.14
Chile 1.15 1.15 1.06
Country 2009 2010 2011
Spain -0.57 -0.57 -0.92
Finland -0.57 -0.57 -0.14
Chile 1.15 1.15 1.06
More details can be found here: http://thewebindex.org/about/methodology/computation/
WebIndex
computation Process (2)
Simplified with one indicator, 3 years and 4 countries
Country 2009 2010 2011
Spain -0.57 -0.57 -0.92
Finland -0.57 -0.57 -0.14
Chile 1.15 1.15 1.06
Normalized Data (z-scores)
Country 2009 2010 2011
Spain -0.57 -0.57 -0.92
Finland -0.57 -0.57 -0.14
Chile 1.15 1.15 1.06
Country 2009 2010 2011
Spain -0.57 -0.57 -0.92
Finland -0.57 -0.57 -0.14
Chile 1.15 1.15 1.06
More details can be found here: http://thewebindex.org/about/methodology/computation/
Adjusted data
Country A B C D ...
Spain 8 7 9.1 7.1 ...
Finland 7 8 7.1 8 ...
Chile 8 9 7.6 6 ...
Group indicators
Country Readiness Impact Web Composite
Spain 5.7 3.5 5.1 4.5
Finland 5.5 3.9 7.1 4.9
Chile 6.7 4.5 7.6 5.1
Rankings
Country Readiness Impact Web Composite
Spain 2 3 3 3
Finland 3 2 2 2
Chile 1 1 1 1
Example of data
representation in RDF
Example:
obs:obsM23 a qb:Observation ;
cex:computation
[ a cex:Z-Score ;
cex:observation obs:obsA23 ;
cex:slice slice:sliceA09 ;
] ;
cex:value -0.57 ;
cex:md5-checksum "2917835203..." ;
cex:indicator indicator:A ;
cex:concept country:ESP ;
qb:dataSet dataset:A-Normalized ;
# ... other declarations omitted for brevity
It declares that the value
of this observation was obtained as
z-score of obs:obsA23 over
slice:sliceA09
Each observation follows the RDF Data Cube vocabulary extented with
metadata about how it was obtained
Computex
Vocabulary for statistical computations
Extends RDF Data Cube
Ontology at: http://purl.org/weso/computex
Some terms:
cex:Concept
cex:Indicator
cex:Computation
cex:WeightSchema
qb:Observation
...
Validation process
Last year (2012)
Template based validation
MD5 checksum of each observation and some fields
This year (2013)
SPARQL based validation
3 levels of validation
RDF Data Cube
Shapes of data
Computation process
Ultimate goal: automatically compute the index
Validation approach
SPARQL CONSTRUCT queries instead of ASK
IF no error THEN empty model
ELSE RDF graph with error
informationCONSTRUCT {
[ a cex:Error ;
cex:errorParam
[cex:name "obs"; cex:value ?obs ] ,
[cex:name "value1"; cex:value ?value1 ] ,
[cex:name "value2"; cex:value ?value2 ] ;
cex:msg "Observation has two different values" . ]
} WHERE {
?obs a qb:Observation .
?obs cex:value ?value1 .
?obs cex:value ?value2 .
FILTER ( ?value1 != ?value2 )
}
ASK WHERE {
?dim a qb:DimensionProperty .
FILTER NOT EXISTS { ?dim rdfs:range [] }
}
SPARQL queries
RDF Data Cube
We converted RDF Data Cube integrity
constraints from ASK to CONSTRUCT queries
CONSTRUCT {
[ a cex:Error ;
cex:errorParam [cex:name "dim"; cex:value ?dim ] ;
cex:msg "Every Dimension must have a declared range" .
]
} WHERE {
?dim a qb:DimensionProperty .
FILTER NOT EXISTS { ?dim rdfs:range [] }
}
SPARQL expressivity
SPARQL can express complex validation patterns.
Example: Mean
CONSTRUCT {
[ a cex:Error ;
cex:errorParam # ...omitted
cex:msg "Mean value does not match" ] .
} WHERE {
?obs a qb:Observation ;
cex:computation ?comp ;
cex:value ?val .
?comp a cex:Mean .
{ SELECT (AVG(?value) as ?mean) ?comp WHERE {
?comp cex:observation ?obs1 .
?obs1 cex:value ?value ;
} GROUP BY ?comp }
FILTER( abs(?mean - ?val) > 0.0001) }}
Limitations of
SPARQL expressivity
Some built-in functions are not standardized
Example: z-score employs standard deviation.
It requires built-in function: sqrt
Available in some SPARQL implementations
Ranking of values was not obvious (*)
2 solutions:
Using GROUP_CONCAT
Check a value against all the other values
Neither solution is efficient
(*) http://stackoverflow.com/questions/17313730/how-to-rank-values-in-sparql
SPARQL expressivity
CONSTRUCT { # ... omitted for brevity
} WHERE {
?obs cex:computation [a cex:AverageGrowth; cex:observations ?ls] ;
cex:value ?val .
?ls rdf:first [cex:value ?v1 ] .
{ SELECT ( SUM(?v_n / ?v_n1)/COUNT(*) as ?meanGrowth)
WHERE {
?ls rdf:rest* [ rdf:first [ cex:value ?v_n ] ;
rdf:rest [ rdf:first [ cex:value ?v_n1 ]]] .
}
} FILTER (abs(?meanGrowth * ?v1 - ?val) > 0.001)
} * This solution was provided by Joshua Taylor
RDF Profiles
Descriptions (with constraints) of dataset families
Can be used to validate (and compute) RDF graphs
Examples:
Cube (Statistical data)
Normalization Update queries +
Integrity constraints
Computex (statistical computations)
Adds more queries
Other user defined profiles
Cube
Computex
WebIndexCloudIndex
Cube Profile
W3c Candidate Recommendation (http://www.w3.org/TR/vocab-data-cube/)
1 base ontology + RDF Schema inference
2 update queries (normalization algorithm)
21 integrity constraint queries
cex:cube a cex:ValidationProfile ;
cex:name "RDF Data Cube" ;
cex:ontologyBase cube:cube.ttl ;
cex:expandSteps (
[ cex:name "Closure" ; cex:uri cube:closure.ru ]
[ cex:name "Flatten" ; cex:uri cube:flatten.ru ]
) ;
cex:integrityQuery
[ cex:name "IC-1. Unique DataSet" ; cex:uri cube:q1.sparql ];
# ...
.
Computex Profile
1 base ontology
http://purl.org/weso/computex
Imports RDF Data Cube
cex:computex a cex:ValidationProfile ;
cex:name "Computex" ;
cex:ontologyBase cex:computex.ttl ;
cex:import profile:cube.ttl ;
cex:expandSteps
( [ cex:name "Copy Raw"; cex:uri cex:copyRaw.sparql ]
# Other UPDATE queries
) ;
cex:integrityQuery
[ cex:name "Adjusted" cex:uri cex:adjusted.sparql ] ,
# Other integrity queries
Computex Validation tool
Available at:
http://computex.herokuapp.com
Features
Informative error messages
Option to generate EARL report
2 validation profiles (Cube and Computex)
Source code in Scala available at:
http://github.com/weso/computex
Based on profiles
Computex Validation Tool
http://computex.herokuapp.com
Conclusions
WebIndex = use case for RDF Validation
SPARQL queries as an implementation approach
Challenges:
Expressivity limits of SPARQL
Complexity of some queries
Concept of RDF Profiles
Declarative, Turtle syntax
Intermediate level between OWL and SPARQL
END OF PRESENTATION
COMPLEMENTARY SLIDES
Limits of
SPARQL expressivity
Ranking of values using GROUP_CONCAT
SELECT ?x ?v ?ranking {
?x :value ?v .
{ SELECT (GROUP_CONCAT(?x;separator="") as ?ordered) {
{ SELECT ?x {
?x :value ?v .
} ORDER BY DESC(?v)
}
}
}
BIND (str(?x) as ?xName)
BIND (strbefore(?ordered,?xName) as ?before)
BIND ((strlen(?before) / strlen(?xName)) + 1 as ?ranking)
} ORDER BY ?ranking
Limits of
SPARQL expressivity
Ranking of values comparing all values
SELECT ?x ?v (COUNT(*) as ?ranking) WHERE {
?x :value ?v .
[] :value ?u .
FILTER( ?v <= ?u )
}
GROUP BY ?x ?v
ORDER BY ?ranking
* This solution was provided by Joshua Taylor

More Related Content

What's hot

Geo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic MapsGeo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic MapsElasticsearch
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Write Graph Algorithms Like a Boss Andrew Ray
Write Graph Algorithms Like a Boss Andrew RayWrite Graph Algorithms Like a Boss Andrew Ray
Write Graph Algorithms Like a Boss Andrew RayDatabricks
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetAnkit Beohar
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveSpark Summit
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and DataframeNamgee Lee
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedDatabricks
 
Review of some successes
Review of some successesReview of some successes
Review of some successesAndrea Zaliani
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big DataLeonardo Gamas
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Spark Summit
 
ST-Toolkit, a Framework for Trajectory Data Warehousing
ST-Toolkit, a Framework for Trajectory Data WarehousingST-Toolkit, a Framework for Trajectory Data Warehousing
ST-Toolkit, a Framework for Trajectory Data WarehousingSimone Campora
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSigmoid
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 

What's hot (20)

Geo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic MapsGeo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic Maps
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Write Graph Algorithms Like a Boss Andrew Ray
Write Graph Algorithms Like a Boss Andrew RayWrite Graph Algorithms Like a Boss Andrew Ray
Write Graph Algorithms Like a Boss Andrew Ray
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs dataset
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and Dataframe
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
 
Review of some successes
Review of some successesReview of some successes
Review of some successes
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big Data
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
 
ST-Toolkit, a Framework for Trajectory Data Warehousing
ST-Toolkit, a Framework for Trajectory Data WarehousingST-Toolkit, a Framework for Trajectory Data Warehousing
ST-Toolkit, a Framework for Trajectory Data Warehousing
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
MATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAPMATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAP
 

Similar to Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

Representing verifiable statistical index computations as linked data
Representing verifiable statistical index computations as linked dataRepresenting verifiable statistical index computations as linked data
Representing verifiable statistical index computations as linked dataJose Emilio Labra Gayo
 
RDF Validation Future work and applications
RDF Validation Future work and applicationsRDF Validation Future work and applications
RDF Validation Future work and applicationsJose Emilio Labra Gayo
 
ECU ODS data integration using OWB and SSIS UNC Cause 2013
ECU ODS data integration using OWB and SSIS UNC Cause 2013ECU ODS data integration using OWB and SSIS UNC Cause 2013
ECU ODS data integration using OWB and SSIS UNC Cause 2013Keith Washer
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can helpChristian Tzolov
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8dallemang
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Lucas Jellema
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with SupersetDataWorks Summit
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataGiorgos Santipantakis
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And VisualizationIvan Ermilov
 
Access Data from XPages with the Relational Controls
Access Data from XPages with the Relational ControlsAccess Data from XPages with the Relational Controls
Access Data from XPages with the Relational ControlsTeamstudio
 
Flink Forward Europe 2019 - Berlin
Flink Forward Europe 2019 - BerlinFlink Forward Europe 2019 - Berlin
Flink Forward Europe 2019 - BerlinDavid Morin
 
Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webMahdi Atawneh
 
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...Yann Pauly
 
Rennes Meetup 2019-09-26 - Change data capture in production
Rennes Meetup 2019-09-26 - Change data capture in productionRennes Meetup 2019-09-26 - Change data capture in production
Rennes Meetup 2019-09-26 - Change data capture in productionDavid Morin
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB
 

Similar to Validating statistical Index Data represented in RDF using SPARQL Queries: Computex (20)

Representing verifiable statistical index computations as linked data
Representing verifiable statistical index computations as linked dataRepresenting verifiable statistical index computations as linked data
Representing verifiable statistical index computations as linked data
 
RDF Validation Future work and applications
RDF Validation Future work and applicationsRDF Validation Future work and applications
RDF Validation Future work and applications
 
ECU ODS data integration using OWB and SSIS UNC Cause 2013
ECU ODS data integration using OWB and SSIS UNC Cause 2013ECU ODS data integration using OWB and SSIS UNC Cause 2013
ECU ODS data integration using OWB and SSIS UNC Cause 2013
 
70487.pdf
70487.pdf70487.pdf
70487.pdf
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can help
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
Oow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BIOow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BI
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 
Access Data from XPages with the Relational Controls
Access Data from XPages with the Relational ControlsAccess Data from XPages with the Relational Controls
Access Data from XPages with the Relational Controls
 
Flink Forward Europe 2019 - Berlin
Flink Forward Europe 2019 - BerlinFlink Forward Europe 2019 - Berlin
Flink Forward Europe 2019 - Berlin
 
Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the web
 
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...
 
Rennes Meetup 2019-09-26 - Change data capture in production
Rennes Meetup 2019-09-26 - Change data capture in productionRennes Meetup 2019-09-26 - Change data capture in production
Rennes Meetup 2019-09-26 - Change data capture in production
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Jacob Keecheril
Jacob KeecherilJacob Keecheril
Jacob Keecheril
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 

More from Jose Emilio Labra Gayo

Introducción a la investigación/doctorado
Introducción a la investigación/doctoradoIntroducción a la investigación/doctorado
Introducción a la investigación/doctoradoJose Emilio Labra Gayo
 
Challenges and applications of RDF shapes
Challenges and applications of RDF shapesChallenges and applications of RDF shapes
Challenges and applications of RDF shapesJose Emilio Labra Gayo
 
Legislative data portals and linked data quality
Legislative data portals and linked data qualityLegislative data portals and linked data quality
Legislative data portals and linked data qualityJose Emilio Labra Gayo
 
Validating RDF data: Challenges and perspectives
Validating RDF data: Challenges and perspectivesValidating RDF data: Challenges and perspectives
Validating RDF data: Challenges and perspectivesJose Emilio Labra Gayo
 
Legislative document content extraction based on Semantic Web technologies
Legislative document content extraction based on Semantic Web technologiesLegislative document content extraction based on Semantic Web technologies
Legislative document content extraction based on Semantic Web technologiesJose Emilio Labra Gayo
 
Como publicar datos: hacia los datos abiertos enlazados
Como publicar datos: hacia los datos abiertos enlazadosComo publicar datos: hacia los datos abiertos enlazados
Como publicar datos: hacia los datos abiertos enlazadosJose Emilio Labra Gayo
 
Arquitectura de la Web y Computación en el Servidor
Arquitectura de la Web y Computación en el ServidorArquitectura de la Web y Computación en el Servidor
Arquitectura de la Web y Computación en el ServidorJose Emilio Labra Gayo
 

More from Jose Emilio Labra Gayo (20)

Publicaciones de investigación
Publicaciones de investigaciónPublicaciones de investigación
Publicaciones de investigación
 
Introducción a la investigación/doctorado
Introducción a la investigación/doctoradoIntroducción a la investigación/doctorado
Introducción a la investigación/doctorado
 
Challenges and applications of RDF shapes
Challenges and applications of RDF shapesChallenges and applications of RDF shapes
Challenges and applications of RDF shapes
 
Legislative data portals and linked data quality
Legislative data portals and linked data qualityLegislative data portals and linked data quality
Legislative data portals and linked data quality
 
Validating RDF data: Challenges and perspectives
Validating RDF data: Challenges and perspectivesValidating RDF data: Challenges and perspectives
Validating RDF data: Challenges and perspectives
 
Wikidata
WikidataWikidata
Wikidata
 
Legislative document content extraction based on Semantic Web technologies
Legislative document content extraction based on Semantic Web technologiesLegislative document content extraction based on Semantic Web technologies
Legislative document content extraction based on Semantic Web technologies
 
ShEx by Example
ShEx by ExampleShEx by Example
ShEx by Example
 
Introduction to SPARQL
Introduction to SPARQLIntroduction to SPARQL
Introduction to SPARQL
 
Introducción a la Web Semántica
Introducción a la Web SemánticaIntroducción a la Web Semántica
Introducción a la Web Semántica
 
RDF Data Model
RDF Data ModelRDF Data Model
RDF Data Model
 
2017 Tendencias en informática
2017 Tendencias en informática2017 Tendencias en informática
2017 Tendencias en informática
 
RDF, linked data and semantic web
RDF, linked data and semantic webRDF, linked data and semantic web
RDF, linked data and semantic web
 
Introduction to SPARQL
Introduction to SPARQLIntroduction to SPARQL
Introduction to SPARQL
 
19 javascript servidor
19 javascript servidor19 javascript servidor
19 javascript servidor
 
Como publicar datos: hacia los datos abiertos enlazados
Como publicar datos: hacia los datos abiertos enlazadosComo publicar datos: hacia los datos abiertos enlazados
Como publicar datos: hacia los datos abiertos enlazados
 
16 Alternativas XML
16 Alternativas XML16 Alternativas XML
16 Alternativas XML
 
XSLT
XSLTXSLT
XSLT
 
XPath
XPathXPath
XPath
 
Arquitectura de la Web y Computación en el Servidor
Arquitectura de la Web y Computación en el ServidorArquitectura de la Web y Computación en el Servidor
Arquitectura de la Web y Computación en el Servidor
 

Recently uploaded

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Recently uploaded (20)

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

  • 1. Validating statistical index data represented in RDF using SPARQL queries WESO Research Group University of Oviedo, Spain Jose Emilio Labra Gayo Jose María Álvarez Rodríguez
  • 2. Motivation The WebIndex Project Measure impact of the Web in different countries First publication: 2012, 2 web sites: Visualization http://thewebindex.org Data portal http://data.webfoundation.org
  • 3. WebIndex Workflow Raw Data Excel Computed data Excel External sources All data RDF Datastore Visualizations
  • 4. Technical details Index made from 61 countries (100 planned for 2013) 85 indicators: 51 Primary (questionnaires) 34 Secondary (external sources) WebIndex RDF data Modeled on top of RDF Data Cube More than 1 million triples Linked data: DBPedia, Organizations, etc. Records data computation
  • 5. Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Country 2009 2010 2011 Spain 4 5 3 Finland 4 6 Armenia 1 Chile 6 8 Country 2009 2010 2011 Spain 4 5 3 Finland 4 6 Armenia 1 Chile 6 8 WebIndex computation process (1) Simplified with one indicator, 3 years and 4 countries Raw Data Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Imputed Data Filtered Data (Indicator A) Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 Normalized Data (z-scores) Country 2009 2010 2011 Spain 4 5 3 Finland 4 6 Armenia 1 Chile 6 8 Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 More details can be found here: http://thewebindex.org/about/methodology/computation/
  • 6. WebIndex computation Process (2) Simplified with one indicator, 3 years and 4 countries Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 Normalized Data (z-scores) Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 More details can be found here: http://thewebindex.org/about/methodology/computation/ Adjusted data Country A B C D ... Spain 8 7 9.1 7.1 ... Finland 7 8 7.1 8 ... Chile 8 9 7.6 6 ... Group indicators Country Readiness Impact Web Composite Spain 5.7 3.5 5.1 4.5 Finland 5.5 3.9 7.1 4.9 Chile 6.7 4.5 7.6 5.1 Rankings Country Readiness Impact Web Composite Spain 2 3 3 3 Finland 3 2 2 2 Chile 1 1 1 1
  • 7. Example of data representation in RDF Example: obs:obsM23 a qb:Observation ; cex:computation [ a cex:Z-Score ; cex:observation obs:obsA23 ; cex:slice slice:sliceA09 ; ] ; cex:value -0.57 ; cex:md5-checksum "2917835203..." ; cex:indicator indicator:A ; cex:concept country:ESP ; qb:dataSet dataset:A-Normalized ; # ... other declarations omitted for brevity It declares that the value of this observation was obtained as z-score of obs:obsA23 over slice:sliceA09 Each observation follows the RDF Data Cube vocabulary extented with metadata about how it was obtained
  • 8. Computex Vocabulary for statistical computations Extends RDF Data Cube Ontology at: http://purl.org/weso/computex Some terms: cex:Concept cex:Indicator cex:Computation cex:WeightSchema qb:Observation ...
  • 9. Validation process Last year (2012) Template based validation MD5 checksum of each observation and some fields This year (2013) SPARQL based validation 3 levels of validation RDF Data Cube Shapes of data Computation process Ultimate goal: automatically compute the index
  • 10. Validation approach SPARQL CONSTRUCT queries instead of ASK IF no error THEN empty model ELSE RDF graph with error informationCONSTRUCT { [ a cex:Error ; cex:errorParam [cex:name "obs"; cex:value ?obs ] , [cex:name "value1"; cex:value ?value1 ] , [cex:name "value2"; cex:value ?value2 ] ; cex:msg "Observation has two different values" . ] } WHERE { ?obs a qb:Observation . ?obs cex:value ?value1 . ?obs cex:value ?value2 . FILTER ( ?value1 != ?value2 ) }
  • 11. ASK WHERE { ?dim a qb:DimensionProperty . FILTER NOT EXISTS { ?dim rdfs:range [] } } SPARQL queries RDF Data Cube We converted RDF Data Cube integrity constraints from ASK to CONSTRUCT queries CONSTRUCT { [ a cex:Error ; cex:errorParam [cex:name "dim"; cex:value ?dim ] ; cex:msg "Every Dimension must have a declared range" . ] } WHERE { ?dim a qb:DimensionProperty . FILTER NOT EXISTS { ?dim rdfs:range [] } }
  • 12. SPARQL expressivity SPARQL can express complex validation patterns. Example: Mean CONSTRUCT { [ a cex:Error ; cex:errorParam # ...omitted cex:msg "Mean value does not match" ] . } WHERE { ?obs a qb:Observation ; cex:computation ?comp ; cex:value ?val . ?comp a cex:Mean . { SELECT (AVG(?value) as ?mean) ?comp WHERE { ?comp cex:observation ?obs1 . ?obs1 cex:value ?value ; } GROUP BY ?comp } FILTER( abs(?mean - ?val) > 0.0001) }}
  • 13. Limitations of SPARQL expressivity Some built-in functions are not standardized Example: z-score employs standard deviation. It requires built-in function: sqrt Available in some SPARQL implementations Ranking of values was not obvious (*) 2 solutions: Using GROUP_CONCAT Check a value against all the other values Neither solution is efficient (*) http://stackoverflow.com/questions/17313730/how-to-rank-values-in-sparql
  • 14. SPARQL expressivity CONSTRUCT { # ... omitted for brevity } WHERE { ?obs cex:computation [a cex:AverageGrowth; cex:observations ?ls] ; cex:value ?val . ?ls rdf:first [cex:value ?v1 ] . { SELECT ( SUM(?v_n / ?v_n1)/COUNT(*) as ?meanGrowth) WHERE { ?ls rdf:rest* [ rdf:first [ cex:value ?v_n ] ; rdf:rest [ rdf:first [ cex:value ?v_n1 ]]] . } } FILTER (abs(?meanGrowth * ?v1 - ?val) > 0.001) } * This solution was provided by Joshua Taylor
  • 15. RDF Profiles Descriptions (with constraints) of dataset families Can be used to validate (and compute) RDF graphs Examples: Cube (Statistical data) Normalization Update queries + Integrity constraints Computex (statistical computations) Adds more queries Other user defined profiles Cube Computex WebIndexCloudIndex
  • 16. Cube Profile W3c Candidate Recommendation (http://www.w3.org/TR/vocab-data-cube/) 1 base ontology + RDF Schema inference 2 update queries (normalization algorithm) 21 integrity constraint queries cex:cube a cex:ValidationProfile ; cex:name "RDF Data Cube" ; cex:ontologyBase cube:cube.ttl ; cex:expandSteps ( [ cex:name "Closure" ; cex:uri cube:closure.ru ] [ cex:name "Flatten" ; cex:uri cube:flatten.ru ] ) ; cex:integrityQuery [ cex:name "IC-1. Unique DataSet" ; cex:uri cube:q1.sparql ]; # ... .
  • 17. Computex Profile 1 base ontology http://purl.org/weso/computex Imports RDF Data Cube cex:computex a cex:ValidationProfile ; cex:name "Computex" ; cex:ontologyBase cex:computex.ttl ; cex:import profile:cube.ttl ; cex:expandSteps ( [ cex:name "Copy Raw"; cex:uri cex:copyRaw.sparql ] # Other UPDATE queries ) ; cex:integrityQuery [ cex:name "Adjusted" cex:uri cex:adjusted.sparql ] , # Other integrity queries
  • 18. Computex Validation tool Available at: http://computex.herokuapp.com Features Informative error messages Option to generate EARL report 2 validation profiles (Cube and Computex) Source code in Scala available at: http://github.com/weso/computex Based on profiles
  • 20. Conclusions WebIndex = use case for RDF Validation SPARQL queries as an implementation approach Challenges: Expressivity limits of SPARQL Complexity of some queries Concept of RDF Profiles Declarative, Turtle syntax Intermediate level between OWL and SPARQL
  • 23. Limits of SPARQL expressivity Ranking of values using GROUP_CONCAT SELECT ?x ?v ?ranking { ?x :value ?v . { SELECT (GROUP_CONCAT(?x;separator="") as ?ordered) { { SELECT ?x { ?x :value ?v . } ORDER BY DESC(?v) } } } BIND (str(?x) as ?xName) BIND (strbefore(?ordered,?xName) as ?before) BIND ((strlen(?before) / strlen(?xName)) + 1 as ?ranking) } ORDER BY ?ranking
  • 24. Limits of SPARQL expressivity Ranking of values comparing all values SELECT ?x ?v (COUNT(*) as ?ranking) WHERE { ?x :value ?v . [] :value ?u . FILTER( ?v <= ?u ) } GROUP BY ?x ?v ORDER BY ?ranking * This solution was provided by Joshua Taylor