SlideShare une entreprise Scribd logo
1  sur  13
Big data, selection bias, and ways
to correct for it
Piet Daas, Bart Buelens
Thanks to: Jan van den Brakel, Marco Puts, MartijnTennekes
Chang Sun, Jade Cock and AgataTroost
Using of Big Data
– Statistics Netherlands has been studying the potential
application and use of Big Data since a number of years
– How have we used Big Data so far?
– Three types of Big Data use
‐ 1) Combined with survey (or admin) data
‐ 2) Single source, but complete (census like)
‐ 3) Single source, but incomplete (part of population)
– Important considerations
– Quality of the data (and metadata)
– Coverage and ´selectivity´ of the population
2
1. Type of Big Data use
– 1) Survey based, Big Data as additional source
‐ Consumer confidence + sentiment in social media
‐ CPI traditional + scanner data + web collected prices
‐ Survey methodology is the basis
‐ Methodological considerations:
‐ For some Big Data sources information needs to be
extracted first, e.g.
• Determining sentiment of social media messages
• Using pictures to identify product on the web
3
1. Consumer confidence + social media
(~10%) (~80%)
- Combined sentiment of public Dutch Facebook and Twitter messages per month
correlates ~0.9 with (monthly) Consumer Confidence survey data
- Raw monthly aggregates of both series cointegrate
- Social media sentiment improves precision of survey based Consumer Confidence
estimate (Van den Brakel et al. (2017) Survey Methodology, forthcoming)
2. Type of Big Data use
– 2) Big Data as the main/single source, Census approach
‐ Road sensor based traffic intensity statistics
‐ CPI fully based on web collected prices
‐ Land use statistics based on satellite images
‐ AIS data of ships for maritime statistics
‐ These Big Data sources have in common that:
• Target population is completely included (i.e. census)
(e.g. roads, products, country, vessels)
• Variable in source is identical/very similar/can be converted to
the one needed!
5
2. Dutch highways
6
2. Dutch highways + road sensors
7
2. Road sensor based intensity estimates
Time (years)
Numberofvehicles
- Findings of 5 quality indicators are used to select (daily) data of sensors used
- Missing data is the biggest problem (~40% of expected data is absent)
- Vehicle estimates are calculated per road segment with sensor weights
- Low sensor coverage of highways in first half of 2010 results in poor estimates
3. Type of Big Data use
– 3) Big Data as the main source, but population not complete
‐ Social tension indicator using social media
‐ ‘Day time population’ using mobile phone data
‐ Tourism statistics using mobile phone data
‐ Energy statistics using smart meters
‐ …
‐ Part of the target population is included
‐ Need to find ways to deal with/correct for missing part
9
3. Type of Big Data use
– 3) Try to ‘deal’ with missing part of ‘population’
‐ Social tension monitor using social media
• Detect relevant messages with keywords
• Relative number of messages are used per day
‐ ‘Day time population’ using mobile phone data (1 provider)
• Assume 1/3 of the population uses this provider
• Use age distribution of provider population for correction
• Future: Verify findings with data of another provider
‐ Tourism statistics using mobile phone data (1 provider)
• Not done yet: Change of foreign phones accessing providers
network
‐ It’s essential to find ways to obtain characteristics of the
population included in the Big Data source!
• Is challenging because sometimes directly available
background characteristics are absent
• Look for features (=measurable properties)
10
3. Selectivity of mobile phone data
Number of people in ‘Assen’ city
Motor race (TT)
90.000 visitors
Truckstar festival
55.000 visitors
Overestimating the number of visitors based on mobile phone data
of a single provider
Big Data based statistics
– It’s possible, but depends on type of use
– 1) Survey based -> Need to ‘link’ Big Data source
– 2) Big Data census like -> Coverage (units) and
comparability (variable)
– 3) Big Data incomplete -> Selectivity, coverage and stability
of population in source
Especially topic 3 requires more methodological research
- Find ways to determine coverage and correct for selectivity
by extracting and studying ‘features’
- Find other data sources to increase coverage of target
population
12
Thank you for your attention!@pietdaas

Contenu connexe

Tendances

Online Forums vs. Social Networks: Two Case Studies to support eGovernment wi...
Online Forums vs. Social Networks: Two Case Studies to support eGovernment wi...Online Forums vs. Social Networks: Two Case Studies to support eGovernment wi...
Online Forums vs. Social Networks: Two Case Studies to support eGovernment wi...
Timo Wandhoefer
 
Neso nuffic presentation in Seoul
Neso nuffic presentation in SeoulNeso nuffic presentation in Seoul
Neso nuffic presentation in Seoul
Maurice Vergeer
 

Tendances (20)

Individual project 2.20
Individual project 2.20Individual project 2.20
Individual project 2.20
 
Twitter in the 2013 Australian Election
Twitter in the 2013 Australian ElectionTwitter in the 2013 Australian Election
Twitter in the 2013 Australian Election
 
On user generated content, teleology and predictability in social systems
On user generated content, teleology and predictability in social systemsOn user generated content, teleology and predictability in social systems
On user generated content, teleology and predictability in social systems
 
Social Media in Australia: A ‘Big Data’ Perspective on Twitter
Social Media in Australia: A ‘Big Data’ Perspective on TwitterSocial Media in Australia: A ‘Big Data’ Perspective on Twitter
Social Media in Australia: A ‘Big Data’ Perspective on Twitter
 
‘Big Social Data’ in Context: Connecting Social Media Data and Other Sources
‘Big Social Data’ in Context: Connecting Social Media Data and Other Sources‘Big Social Data’ in Context: Connecting Social Media Data and Other Sources
‘Big Social Data’ in Context: Connecting Social Media Data and Other Sources
 
Profiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivityProfiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivity
 
Analysing the Norwegian Twittersphere
Analysing the Norwegian TwittersphereAnalysing the Norwegian Twittersphere
Analysing the Norwegian Twittersphere
 
Dynamics of a Scandal: The Centrelink Robodebt Affair on Twitter
Dynamics of a Scandal: The Centrelink Robodebt Affair on TwitterDynamics of a Scandal: The Centrelink Robodebt Affair on Twitter
Dynamics of a Scandal: The Centrelink Robodebt Affair on Twitter
 
SOCIRANK: IDENTIFYING AND RANKING PREVALENT NEWS TOPICS USING SOCIAL MEDIA FA...
SOCIRANK: IDENTIFYING AND RANKING PREVALENT NEWS TOPICS USING SOCIAL MEDIA FA...SOCIRANK: IDENTIFYING AND RANKING PREVALENT NEWS TOPICS USING SOCIAL MEDIA FA...
SOCIRANK: IDENTIFYING AND RANKING PREVALENT NEWS TOPICS USING SOCIAL MEDIA FA...
 
Passive expert - sourcing, for policy making in the EU
Passive expert - sourcing,  for policy making in the EUPassive expert - sourcing,  for policy making in the EU
Passive expert - sourcing, for policy making in the EU
 
The role of online monitoring in influencing political behaviour: an explorat...
The role of online monitoring in influencing political behaviour: an explorat...The role of online monitoring in influencing political behaviour: an explorat...
The role of online monitoring in influencing political behaviour: an explorat...
 
Twitter Based Election Prediction and Analysis
Twitter Based Election Prediction and AnalysisTwitter Based Election Prediction and Analysis
Twitter Based Election Prediction and Analysis
 
Gaza Co-Tweet
Gaza Co-TweetGaza Co-Tweet
Gaza Co-Tweet
 
Social Media Mining - Chapter 2 (Graph Essentials)
Social Media Mining - Chapter 2 (Graph Essentials)Social Media Mining - Chapter 2 (Graph Essentials)
Social Media Mining - Chapter 2 (Graph Essentials)
 
Infotainment and the Impact of Connective Action: The Case of #MilkedDry
Infotainment and the Impact of Connective Action: The Case of #MilkedDryInfotainment and the Impact of Connective Action: The Case of #MilkedDry
Infotainment and the Impact of Connective Action: The Case of #MilkedDry
 
Online Forums vs. Social Networks: Two Case Studies to support eGovernment wi...
Online Forums vs. Social Networks: Two Case Studies to support eGovernment wi...Online Forums vs. Social Networks: Two Case Studies to support eGovernment wi...
Online Forums vs. Social Networks: Two Case Studies to support eGovernment wi...
 
IRJET- Fake News Detection and Rumour Source Identification
IRJET- Fake News Detection and Rumour Source IdentificationIRJET- Fake News Detection and Rumour Source Identification
IRJET- Fake News Detection and Rumour Source Identification
 
One Day in the Life of a National Twittersphere
One Day in the Life of a National TwittersphereOne Day in the Life of a National Twittersphere
One Day in the Life of a National Twittersphere
 
The Use of Twitter Hashtags in the Formation of Ad Hoc Publics
The Use of Twitter Hashtags in the Formation of Ad Hoc PublicsThe Use of Twitter Hashtags in the Formation of Ad Hoc Publics
The Use of Twitter Hashtags in the Formation of Ad Hoc Publics
 
Neso nuffic presentation in Seoul
Neso nuffic presentation in SeoulNeso nuffic presentation in Seoul
Neso nuffic presentation in Seoul
 

Similaire à Isi 2017 presentation on Big Data and bias

La telefonía móvil como fuente de información para el estudio de la movilidad...
La telefonía móvil como fuente de información para el estudio de la movilidad...La telefonía móvil como fuente de información para el estudio de la movilidad...
La telefonía móvil como fuente de información para el estudio de la movilidad...
Esri España
 

Similaire à Isi 2017 presentation on Big Data and bias (20)

Big Data presentation for Statistics Canada
Big Data presentation for Statistics CanadaBig Data presentation for Statistics Canada
Big Data presentation for Statistics Canada
 
Opportunities and methodological challenges of Big Data for official statist...
Opportunities and methodological challenges of  Big Data for official statist...Opportunities and methodological challenges of  Big Data for official statist...
Opportunities and methodological challenges of Big Data for official statist...
 
Big Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in EindhovenBig Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in Eindhoven
 
La telefonía móvil como fuente de información para el estudio de la movilidad...
La telefonía móvil como fuente de información para el estudio de la movilidad...La telefonía móvil como fuente de información para el estudio de la movilidad...
La telefonía móvil como fuente de información para el estudio de la movilidad...
 
The potential of smartphone data for national travel surveys
The potential of smartphone data for national travel surveysThe potential of smartphone data for national travel surveys
The potential of smartphone data for national travel surveys
 
Feasibility Study on the Use of Mobile Positioning Data in Tourism Statistics...
Feasibility Study on the Use of Mobile Positioning Data in Tourism Statistics...Feasibility Study on the Use of Mobile Positioning Data in Tourism Statistics...
Feasibility Study on the Use of Mobile Positioning Data in Tourism Statistics...
 
Mobile Computing, Internet of Things, and Big Data for Urban Informatics
Mobile Computing, Internet of Things, and Big Data for Urban InformaticsMobile Computing, Internet of Things, and Big Data for Urban Informatics
Mobile Computing, Internet of Things, and Big Data for Urban Informatics
 
Tourism Service Portfolio
Tourism Service PortfolioTourism Service Portfolio
Tourism Service Portfolio
 
Odp rwanda-odra-rajiv
Odp rwanda-odra-rajivOdp rwanda-odra-rajiv
Odp rwanda-odra-rajiv
 
Understanding Human Mobility
Understanding Human MobilityUnderstanding Human Mobility
Understanding Human Mobility
 
NOVELOG - New cooperative business models and guidance for sustainable city l...
NOVELOG - New cooperative business models and guidance for sustainable city l...NOVELOG - New cooperative business models and guidance for sustainable city l...
NOVELOG - New cooperative business models and guidance for sustainable city l...
 
Big Data World
Big Data WorldBig Data World
Big Data World
 
Big Data and Nowcasting
Big Data and NowcastingBig Data and Nowcasting
Big Data and Nowcasting
 
Bowdoin College Digital Image of the City - Infrastructure (1)
Bowdoin College Digital Image of the City - Infrastructure (1)Bowdoin College Digital Image of the City - Infrastructure (1)
Bowdoin College Digital Image of the City - Infrastructure (1)
 
A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data
A Data Scientist Exploration in the World of Heterogeneous Open Geospatial DataA Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data
A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data
 
The Impact of OTT on Media Consumption Habits
The Impact of OTT on Media Consumption HabitsThe Impact of OTT on Media Consumption Habits
The Impact of OTT on Media Consumption Habits
 
Cuyahoga Greenways: Community Meeting #3
Cuyahoga Greenways: Community Meeting #3Cuyahoga Greenways: Community Meeting #3
Cuyahoga Greenways: Community Meeting #3
 
Multimodal Mopbility Planning Using Big Data in Toronto
Multimodal Mopbility Planning Using Big Data in TorontoMultimodal Mopbility Planning Using Big Data in Toronto
Multimodal Mopbility Planning Using Big Data in Toronto
 
Irmgard Wetzstein, Peter Leitner: Social media analytics for sustainable migr...
Irmgard Wetzstein, Peter Leitner: Social media analytics for sustainable migr...Irmgard Wetzstein, Peter Leitner: Social media analytics for sustainable migr...
Irmgard Wetzstein, Peter Leitner: Social media analytics for sustainable migr...
 
From open data to data-driven services
From open data to data-driven servicesFrom open data to data-driven services
From open data to data-driven services
 

Plus de Piet J.H. Daas

Plus de Piet J.H. Daas (20)

Big Data and official statistics with examples of their use
Big Data and official statistics with examples of their useBig Data and official statistics with examples of their use
Big Data and official statistics with examples of their use
 
IT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsIT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics Netherlands
 
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
 
EMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniques
 
Use of social media for official statistics
Use of social media for official statisticsUse of social media for official statistics
Use of social media for official statistics
 
Responsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsResponsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics Netherlands
 
CBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSCBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONS
 
Ntts2017 presentation 45
Ntts2017 presentation 45Ntts2017 presentation 45
Ntts2017 presentation 45
 
Big Data presentation Mannheim
Big Data presentation MannheimBig Data presentation Mannheim
Big Data presentation Mannheim
 
Extracting information from ' messy' social media data
Extracting information from ' messy' social media dataExtracting information from ' messy' social media data
Extracting information from ' messy' social media data
 
Big data cbs_piet_daas
Big data cbs_piet_daasBig data cbs_piet_daas
Big data cbs_piet_daas
 
Gebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiekGebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiek
 
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data MethodologyUsing Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
 
Quality challenges in modernising business statistics
Quality challenges in modernising business statisticsQuality challenges in modernising business statistics
Quality challenges in modernising business statistics
 
Quality Approaches to Big Data
Quality Approaches to Big DataQuality Approaches to Big Data
Quality Approaches to Big Data
 
Big data @ CBS
Big data @ CBSBig data @ CBS
Big data @ CBS
 
Strata Big data presentation
Strata Big data presentationStrata Big data presentation
Strata Big data presentation
 
Big Data, the Future of Statistics: Experiences at Statistics Netherlands
Big Data, the Future of Statistics: Experiences at Statistics NetherlandsBig Data, the Future of Statistics: Experiences at Statistics Netherlands
Big Data, the Future of Statistics: Experiences at Statistics Netherlands
 
Big data Big impact?
Big data Big impact?Big data Big impact?
Big data Big impact?
 
Bi dutch meeting data science
Bi dutch meeting data scienceBi dutch meeting data science
Bi dutch meeting data science
 

Dernier

Call Girls in Chandni Chowk (delhi) call me [9953056974] escort service 24X7
Call Girls in Chandni Chowk (delhi) call me [9953056974] escort service 24X7Call Girls in Chandni Chowk (delhi) call me [9953056974] escort service 24X7
Call Girls in Chandni Chowk (delhi) call me [9953056974] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Russian🍌Dazzling Hottie Get☎️ 9053900678 ☎️call girl In Chandigarh By Chandig...
Russian🍌Dazzling Hottie Get☎️ 9053900678 ☎️call girl In Chandigarh By Chandig...Russian🍌Dazzling Hottie Get☎️ 9053900678 ☎️call girl In Chandigarh By Chandig...
Russian🍌Dazzling Hottie Get☎️ 9053900678 ☎️call girl In Chandigarh By Chandig...
Chandigarh Call girls 9053900678 Call girls in Chandigarh
 
VIP Call Girls Agra 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Agra 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Agra 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Agra 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 

Dernier (20)

Call Girls in Chandni Chowk (delhi) call me [9953056974] escort service 24X7
Call Girls in Chandni Chowk (delhi) call me [9953056974] escort service 24X7Call Girls in Chandni Chowk (delhi) call me [9953056974] escort service 24X7
Call Girls in Chandni Chowk (delhi) call me [9953056974] escort service 24X7
 
Call Girls Nanded City Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Nanded City Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Nanded City Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Nanded City Call Me 7737669865 Budget Friendly No Advance Booking
 
Postal Ballots-For home voting step by step process 2024.pptx
Postal Ballots-For home voting step by step process 2024.pptxPostal Ballots-For home voting step by step process 2024.pptx
Postal Ballots-For home voting step by step process 2024.pptx
 
celebrity 💋 Agra Escorts Just Dail 8250092165 service available anytime 24 hour
celebrity 💋 Agra Escorts Just Dail 8250092165 service available anytime 24 hourcelebrity 💋 Agra Escorts Just Dail 8250092165 service available anytime 24 hour
celebrity 💋 Agra Escorts Just Dail 8250092165 service available anytime 24 hour
 
World Press Freedom Day 2024; May 3rd - Poster
World Press Freedom Day 2024; May 3rd - PosterWorld Press Freedom Day 2024; May 3rd - Poster
World Press Freedom Day 2024; May 3rd - Poster
 
Top Rated Pune Call Girls Bhosari ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated  Pune Call Girls Bhosari ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...Top Rated  Pune Call Girls Bhosari ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated Pune Call Girls Bhosari ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
 
A PPT on digital India initiative by Government of India
A PPT on digital India initiative by Government of IndiaA PPT on digital India initiative by Government of India
A PPT on digital India initiative by Government of India
 
Russian🍌Dazzling Hottie Get☎️ 9053900678 ☎️call girl In Chandigarh By Chandig...
Russian🍌Dazzling Hottie Get☎️ 9053900678 ☎️call girl In Chandigarh By Chandig...Russian🍌Dazzling Hottie Get☎️ 9053900678 ☎️call girl In Chandigarh By Chandig...
Russian🍌Dazzling Hottie Get☎️ 9053900678 ☎️call girl In Chandigarh By Chandig...
 
The Economic and Organised Crime Office (EOCO) has been advised by the Office...
The Economic and Organised Crime Office (EOCO) has been advised by the Office...The Economic and Organised Crime Office (EOCO) has been advised by the Office...
The Economic and Organised Crime Office (EOCO) has been advised by the Office...
 
Tuvalu Coastal Adaptation Project (TCAP)
Tuvalu Coastal Adaptation Project (TCAP)Tuvalu Coastal Adaptation Project (TCAP)
Tuvalu Coastal Adaptation Project (TCAP)
 
(NEHA) Call Girls Nagpur Call Now 8250077686 Nagpur Escorts 24x7
(NEHA) Call Girls Nagpur Call Now 8250077686 Nagpur Escorts 24x7(NEHA) Call Girls Nagpur Call Now 8250077686 Nagpur Escorts 24x7
(NEHA) Call Girls Nagpur Call Now 8250077686 Nagpur Escorts 24x7
 
Pimpri Chinchwad ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi R...
Pimpri Chinchwad ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi R...Pimpri Chinchwad ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi R...
Pimpri Chinchwad ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi R...
 
2024: The FAR, Federal Acquisition Regulations, Part 30
2024: The FAR, Federal Acquisition Regulations, Part 302024: The FAR, Federal Acquisition Regulations, Part 30
2024: The FAR, Federal Acquisition Regulations, Part 30
 
2024: The FAR, Federal Acquisition Regulations, Part 31
2024: The FAR, Federal Acquisition Regulations, Part 312024: The FAR, Federal Acquisition Regulations, Part 31
2024: The FAR, Federal Acquisition Regulations, Part 31
 
The NAP process & South-South peer learning
The NAP process & South-South peer learningThe NAP process & South-South peer learning
The NAP process & South-South peer learning
 
Coastal Protection Measures in Hulhumale'
Coastal Protection Measures in Hulhumale'Coastal Protection Measures in Hulhumale'
Coastal Protection Measures in Hulhumale'
 
Financing strategies for adaptation. Presentation for CANCC
Financing strategies for adaptation. Presentation for CANCCFinancing strategies for adaptation. Presentation for CANCC
Financing strategies for adaptation. Presentation for CANCC
 
VIP Call Girls Agra 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Agra 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Agra 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Agra 7001035870 Whatsapp Number, 24/07 Booking
 
Call Girls Chakan Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chakan Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Chakan Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chakan Call Me 7737669865 Budget Friendly No Advance Booking
 
1935 CONSTITUTION REPORT IN RIPH FINALLS
1935 CONSTITUTION REPORT IN RIPH FINALLS1935 CONSTITUTION REPORT IN RIPH FINALLS
1935 CONSTITUTION REPORT IN RIPH FINALLS
 

Isi 2017 presentation on Big Data and bias

  • 1. Big data, selection bias, and ways to correct for it Piet Daas, Bart Buelens Thanks to: Jan van den Brakel, Marco Puts, MartijnTennekes Chang Sun, Jade Cock and AgataTroost
  • 2. Using of Big Data – Statistics Netherlands has been studying the potential application and use of Big Data since a number of years – How have we used Big Data so far? – Three types of Big Data use ‐ 1) Combined with survey (or admin) data ‐ 2) Single source, but complete (census like) ‐ 3) Single source, but incomplete (part of population) – Important considerations – Quality of the data (and metadata) – Coverage and ´selectivity´ of the population 2
  • 3. 1. Type of Big Data use – 1) Survey based, Big Data as additional source ‐ Consumer confidence + sentiment in social media ‐ CPI traditional + scanner data + web collected prices ‐ Survey methodology is the basis ‐ Methodological considerations: ‐ For some Big Data sources information needs to be extracted first, e.g. • Determining sentiment of social media messages • Using pictures to identify product on the web 3
  • 4. 1. Consumer confidence + social media (~10%) (~80%) - Combined sentiment of public Dutch Facebook and Twitter messages per month correlates ~0.9 with (monthly) Consumer Confidence survey data - Raw monthly aggregates of both series cointegrate - Social media sentiment improves precision of survey based Consumer Confidence estimate (Van den Brakel et al. (2017) Survey Methodology, forthcoming)
  • 5. 2. Type of Big Data use – 2) Big Data as the main/single source, Census approach ‐ Road sensor based traffic intensity statistics ‐ CPI fully based on web collected prices ‐ Land use statistics based on satellite images ‐ AIS data of ships for maritime statistics ‐ These Big Data sources have in common that: • Target population is completely included (i.e. census) (e.g. roads, products, country, vessels) • Variable in source is identical/very similar/can be converted to the one needed! 5
  • 7. 2. Dutch highways + road sensors 7
  • 8. 2. Road sensor based intensity estimates Time (years) Numberofvehicles - Findings of 5 quality indicators are used to select (daily) data of sensors used - Missing data is the biggest problem (~40% of expected data is absent) - Vehicle estimates are calculated per road segment with sensor weights - Low sensor coverage of highways in first half of 2010 results in poor estimates
  • 9. 3. Type of Big Data use – 3) Big Data as the main source, but population not complete ‐ Social tension indicator using social media ‐ ‘Day time population’ using mobile phone data ‐ Tourism statistics using mobile phone data ‐ Energy statistics using smart meters ‐ … ‐ Part of the target population is included ‐ Need to find ways to deal with/correct for missing part 9
  • 10. 3. Type of Big Data use – 3) Try to ‘deal’ with missing part of ‘population’ ‐ Social tension monitor using social media • Detect relevant messages with keywords • Relative number of messages are used per day ‐ ‘Day time population’ using mobile phone data (1 provider) • Assume 1/3 of the population uses this provider • Use age distribution of provider population for correction • Future: Verify findings with data of another provider ‐ Tourism statistics using mobile phone data (1 provider) • Not done yet: Change of foreign phones accessing providers network ‐ It’s essential to find ways to obtain characteristics of the population included in the Big Data source! • Is challenging because sometimes directly available background characteristics are absent • Look for features (=measurable properties) 10
  • 11. 3. Selectivity of mobile phone data Number of people in ‘Assen’ city Motor race (TT) 90.000 visitors Truckstar festival 55.000 visitors Overestimating the number of visitors based on mobile phone data of a single provider
  • 12. Big Data based statistics – It’s possible, but depends on type of use – 1) Survey based -> Need to ‘link’ Big Data source – 2) Big Data census like -> Coverage (units) and comparability (variable) – 3) Big Data incomplete -> Selectivity, coverage and stability of population in source Especially topic 3 requires more methodological research - Find ways to determine coverage and correct for selectivity by extracting and studying ‘features’ - Find other data sources to increase coverage of target population 12
  • 13. Thank you for your attention!@pietdaas