SlideShare une entreprise Scribd logo
1  sur  13
The WebDataCommons 
Microdata, RDFa, and Microformat 
Dataset Series 
Robert Meusel, Petar Petrovski, and 
Christian Bizer
2 
HTML-embedded Structured Data on the Web 
More and more websites semantically markup the content of 
their HTML pages. 
RDFa 
Microdata 
Microformats 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
1. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 
3. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 
4. _:node1 <http://schema.org/Offer/price> "u20AC 
5. _:node1 <http://schema.org/Offer/priceCurrency> 
3 
Dataset Creation 
 Common Crawl Foundation Corpora of 2010, 2012 and 2013 
• Snapshot of popular pages of the Web 
• Continuously new crawls available 
 Parsing the HTML pages using Apache Any23 
• Using a distributed framework on 100 parallel EC2 instances 
type> <http://schema.org/Product> . 
2. _:node1 <http://schema.org/Product/name> 
"Predator Instinct FG Fuu00DFballschuh"@de . 
type> <http://schema.org/Offer> . 
219,95"@de . 
"EUR"@de . 
6. … 
Any23 
The framework is easy to adapt and is publicly available at: 
http://webdatacommons.org/framework/ 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
4 
Dataset Series Overview 
 Series contains three datasets from 2010, 2012 and 2013 
 All together over 30 billion RDF quads 
 Each dataset is again split into subsets including quads 
extracted for a particular markup language 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
5 
Overview of 2013 dataset 
 Over 1.7 million domains using at least one markup language 
 Over 17 billion quads with over 4 billion records (typed entities) 
 hCard still most dominant among domains 
 Microdata contains the largest number of quads 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
6 
Divergence in Class and Property Usage in 2013 
 Small number of classes and properties is 
used by a large number of domains 
 RDFa: 646k classes and 27k properties, 
but <1k classes and ~2k properties are 
used by at least two different domains 
 MD: 15k classes and 170k properties, but 
~1.2k classes and <13k properties are 
used by at least two different domains. 
Classes and Properties used by solely one 
domain are mostly typos 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
7 
RDFa Insights 2013 
 Usage of various vocabularies to describe information: 
• Strong presents of Open Graph Protocol (e.g. Facebook) 
• FOAF and SIOC (Blog-Software as Drupal) 
 Largest topics covered are: 
• Articles and Documents (Blogs and News portals) 
• Products, Reviews and Ratings 
• Organizations 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
8 
Microdata Insights 2013 and 2012 
 Clear increase of development in comparison to 2012 
 Still two vocabularies deployed: data-vocabulary and schema.org 
 Largest topical areas: 
• Postal Addresses and Locations 
• Products, Offers and Ratings 
• Organizations and Persons 
• Articles and Blogs 
• Breadcrumb 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
9 
Focus on Schema.org/Product 
 One of the largest public available 
product collections 
 Almost 100 million records 
described with name, offer and 
image 
 34 million records contain a 
further description 
 11% of all product records include 
a brand 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
10 
Microformats Insights 2013 
 Most dominant vocabulary is hCard 
 Still a very solid deployment 
 Topics are: 
• Persons & Organizations 
• Events 
• Products and reviews 
• Recipes 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
11 
Opportunities & Challenges 
Opportunities 
 Vast amounts of free data, 
created from people all over 
the world 
 Large topical coverage from 
broad areas (as products) to 
niche (as recipes) 
 High up-to-dateness of 
information, as popular 
pages potentially update 
their content frequently 
Challenges 
 Data quality assessment, as 
the data is created by 
experts and rookies 
 Further information 
extraction, as a flat schema 
and rather low number of 
properties are used 
 Identity resolution, as the 
data does hardly contain 
identifiers 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
12 
Possible Application Domains 
 Enriching existing knowledge bases 
• E.g. mapping DBPedia Classes and Properties to the corresponding classes and 
properties within the available vocabularies to add missing information and 
extend entity knowledge 
• As shown by Lehmberg et al. winner of the Semantic Web Challenge (Big Data 
Track) 2014, this data can be used as additional source (besides others) to gather 
and return wider search results 
 Design and adaption of algorithms and methods to face the 
characteristics of such web data 
• Training of data extraction methods to gather not marked data within the HTML 
pages 
• Further extraction of additional information from the raw data, e.g. extraction of 
skills, requirements etc. from job posting descriptions 
 Starting point for further data discovery 
• The dataset can be used as starting points for further data crawling, as not all 
pages from a domain are included (in most of the cases) 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
13 
Thank you! Questions? Feedback? 
Data and more statistics can be found at: 
http://webdatacommons.org/structureddata/index.html 
More interesting datasets and analysis can be found at the 
website of WebDataCommons: 
http://webdatacommons.org/index.html 
Acknowledgement 
The extraction and analysis of the datasets was supported by AWS in Education Grant 
and the EU FP7 project LOD2. Special thanks to SWSA for supporting the travel to ISWC 
2014. 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series

Contenu connexe

Tendances

Top 5 Considerations When Evaluating NoSQL
Top 5 Considerations When Evaluating NoSQLTop 5 Considerations When Evaluating NoSQL
Top 5 Considerations When Evaluating NoSQL
MongoDB
 
Wed batsakis tut_chalasdlenges of preservations
Wed batsakis tut_chalasdlenges of preservationsWed batsakis tut_chalasdlenges of preservations
Wed batsakis tut_chalasdlenges of preservations
eswcsummerschool
 

Tendances (20)

data.bris - Use case, role and functionality for CKAN adoption
data.bris - Use case, role and functionality for CKAN adoptiondata.bris - Use case, role and functionality for CKAN adoption
data.bris - Use case, role and functionality for CKAN adoption
 
Top 5 Considerations When Evaluating NoSQL
Top 5 Considerations When Evaluating NoSQLTop 5 Considerations When Evaluating NoSQL
Top 5 Considerations When Evaluating NoSQL
 
RDAP 16 Poster: Hacking the figshare API to Create Enhanced Metadata Records
RDAP 16 Poster: Hacking the figshare API to Create Enhanced Metadata RecordsRDAP 16 Poster: Hacking the figshare API to Create Enhanced Metadata Records
RDAP 16 Poster: Hacking the figshare API to Create Enhanced Metadata Records
 
The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise
 
Wed batsakis tut_chalasdlenges of preservations
Wed batsakis tut_chalasdlenges of preservationsWed batsakis tut_chalasdlenges of preservations
Wed batsakis tut_chalasdlenges of preservations
 
Umesha naik metadata
Umesha naik metadataUmesha naik metadata
Umesha naik metadata
 
Metadata : Concentrating on the data, not on the scheme
Metadata : Concentrating on the data, not on the schemeMetadata : Concentrating on the data, not on the scheme
Metadata : Concentrating on the data, not on the scheme
 
Resilient Linked Data
Resilient Linked DataResilient Linked Data
Resilient Linked Data
 
Establishing the Connection: Creating a Linked Data Version of the BNB
Establishing the Connection: Creating a Linked Data Version of the BNBEstablishing the Connection: Creating a Linked Data Version of the BNB
Establishing the Connection: Creating a Linked Data Version of the BNB
 
Using Linked Data Resources to generate web pages based on a BBC case study
Using Linked Data Resources to generate web pages based on a BBC case studyUsing Linked Data Resources to generate web pages based on a BBC case study
Using Linked Data Resources to generate web pages based on a BBC case study
 
Data quality problem and solution
Data quality problem and solutionData quality problem and solution
Data quality problem and solution
 
PID services - understandability and findability of data
PID services - understandability and findability of dataPID services - understandability and findability of data
PID services - understandability and findability of data
 
PID Services for FAIR data
PID Services for FAIR dataPID Services for FAIR data
PID Services for FAIR data
 
Gap Analysis
Gap AnalysisGap Analysis
Gap Analysis
 
The Information Workbench - Linked Data and Semantic Wikis in the Enterprise
The Information Workbench - Linked Data and Semantic Wikis in the EnterpriseThe Information Workbench - Linked Data and Semantic Wikis in the Enterprise
The Information Workbench - Linked Data and Semantic Wikis in the Enterprise
 
Weaving SIOC into the Web of Linked Data
Weaving SIOC into the Web of Linked DataWeaving SIOC into the Web of Linked Data
Weaving SIOC into the Web of Linked Data
 
Metadata Standards
Metadata StandardsMetadata Standards
Metadata Standards
 
Crossref LIVE US Online
Crossref LIVE US OnlineCrossref LIVE US Online
Crossref LIVE US Online
 
Basic concept of Linked & Linked open Government data
Basic concept of Linked & Linked open Government data Basic concept of Linked & Linked open Government data
Basic concept of Linked & Linked open Government data
 
Linked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the SoftwareLinked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the Software
 

Similaire à The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

Wed roman tut_open_datapub
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapub
eswcsummerschool
 

Similaire à The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014 (20)

Wed roman tut_open_datapub
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapub
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
 
Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org
 
IWMW 1998: Deploying new web technologies
IWMW 1998: Deploying new web technologiesIWMW 1998: Deploying new web technologies
IWMW 1998: Deploying new web technologies
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
 
Quick Introduction to the Semantic Web, RDFa & Microformats
Quick Introduction to the Semantic Web, RDFa & MicroformatsQuick Introduction to the Semantic Web, RDFa & Microformats
Quick Introduction to the Semantic Web, RDFa & Microformats
 
Rank | Analyse | Lead | Search
Rank | Analyse | Lead | SearchRank | Analyse | Lead | Search
Rank | Analyse | Lead | Search
 
KEDL DBpedia 2019
KEDL DBpedia  2019KEDL DBpedia  2019
KEDL DBpedia 2019
 
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryNavigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data Discovery
 
Modèles de données et langages de description ouverts 6 - 2021-2022
Modèles de données et langages de description ouverts   6 - 2021-2022Modèles de données et langages de description ouverts   6 - 2021-2022
Modèles de données et langages de description ouverts 6 - 2021-2022
 
How google is using linked data today and vision for tomorrow
How google is using linked data today and vision for tomorrowHow google is using linked data today and vision for tomorrow
How google is using linked data today and vision for tomorrow
 
Linked Energy Data Generation
Linked Energy Data GenerationLinked Energy Data Generation
Linked Energy Data Generation
 
RDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itRDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use it
 
GoodRelations & RDFa for Deep Comparison Shopping on a Web Scale
GoodRelations & RDFa for Deep Comparison Shopping on a Web ScaleGoodRelations & RDFa for Deep Comparison Shopping on a Web Scale
GoodRelations & RDFa for Deep Comparison Shopping on a Web Scale
 
Introduction to W3C Linked Data Platform
Introduction to W3C Linked Data PlatformIntroduction to W3C Linked Data Platform
Introduction to W3C Linked Data Platform
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Apache Any23 - Anything to Triples
Apache Any23 - Anything to TriplesApache Any23 - Anything to Triples
Apache Any23 - Anything to Triples
 

Dernier

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Dernier (20)

Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 

The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

  • 1. The WebDataCommons Microdata, RDFa, and Microformat Dataset Series Robert Meusel, Petar Petrovski, and Christian Bizer
  • 2. 2 HTML-embedded Structured Data on the Web More and more websites semantically markup the content of their HTML pages. RDFa Microdata Microformats The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 3. 1. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 3. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 4. _:node1 <http://schema.org/Offer/price> "u20AC 5. _:node1 <http://schema.org/Offer/priceCurrency> 3 Dataset Creation  Common Crawl Foundation Corpora of 2010, 2012 and 2013 • Snapshot of popular pages of the Web • Continuously new crawls available  Parsing the HTML pages using Apache Any23 • Using a distributed framework on 100 parallel EC2 instances type> <http://schema.org/Product> . 2. _:node1 <http://schema.org/Product/name> "Predator Instinct FG Fuu00DFballschuh"@de . type> <http://schema.org/Offer> . 219,95"@de . "EUR"@de . 6. … Any23 The framework is easy to adapt and is publicly available at: http://webdatacommons.org/framework/ The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 4. 4 Dataset Series Overview  Series contains three datasets from 2010, 2012 and 2013  All together over 30 billion RDF quads  Each dataset is again split into subsets including quads extracted for a particular markup language The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 5. 5 Overview of 2013 dataset  Over 1.7 million domains using at least one markup language  Over 17 billion quads with over 4 billion records (typed entities)  hCard still most dominant among domains  Microdata contains the largest number of quads The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 6. 6 Divergence in Class and Property Usage in 2013  Small number of classes and properties is used by a large number of domains  RDFa: 646k classes and 27k properties, but <1k classes and ~2k properties are used by at least two different domains  MD: 15k classes and 170k properties, but ~1.2k classes and <13k properties are used by at least two different domains. Classes and Properties used by solely one domain are mostly typos The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 7. 7 RDFa Insights 2013  Usage of various vocabularies to describe information: • Strong presents of Open Graph Protocol (e.g. Facebook) • FOAF and SIOC (Blog-Software as Drupal)  Largest topics covered are: • Articles and Documents (Blogs and News portals) • Products, Reviews and Ratings • Organizations The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 8. 8 Microdata Insights 2013 and 2012  Clear increase of development in comparison to 2012  Still two vocabularies deployed: data-vocabulary and schema.org  Largest topical areas: • Postal Addresses and Locations • Products, Offers and Ratings • Organizations and Persons • Articles and Blogs • Breadcrumb The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 9. 9 Focus on Schema.org/Product  One of the largest public available product collections  Almost 100 million records described with name, offer and image  34 million records contain a further description  11% of all product records include a brand The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 10. 10 Microformats Insights 2013  Most dominant vocabulary is hCard  Still a very solid deployment  Topics are: • Persons & Organizations • Events • Products and reviews • Recipes The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 11. 11 Opportunities & Challenges Opportunities  Vast amounts of free data, created from people all over the world  Large topical coverage from broad areas (as products) to niche (as recipes)  High up-to-dateness of information, as popular pages potentially update their content frequently Challenges  Data quality assessment, as the data is created by experts and rookies  Further information extraction, as a flat schema and rather low number of properties are used  Identity resolution, as the data does hardly contain identifiers The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 12. 12 Possible Application Domains  Enriching existing knowledge bases • E.g. mapping DBPedia Classes and Properties to the corresponding classes and properties within the available vocabularies to add missing information and extend entity knowledge • As shown by Lehmberg et al. winner of the Semantic Web Challenge (Big Data Track) 2014, this data can be used as additional source (besides others) to gather and return wider search results  Design and adaption of algorithms and methods to face the characteristics of such web data • Training of data extraction methods to gather not marked data within the HTML pages • Further extraction of additional information from the raw data, e.g. extraction of skills, requirements etc. from job posting descriptions  Starting point for further data discovery • The dataset can be used as starting points for further data crawling, as not all pages from a domain are included (in most of the cases) The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 13. 13 Thank you! Questions? Feedback? Data and more statistics can be found at: http://webdatacommons.org/structureddata/index.html More interesting datasets and analysis can be found at the website of WebDataCommons: http://webdatacommons.org/index.html Acknowledgement The extraction and analysis of the datasets was supported by AWS in Education Grant and the EU FP7 project LOD2. Special thanks to SWSA for supporting the travel to ISWC 2014. The WebDataCommons Microdata, RDFa, and Microformats Dataset Series