SlideShare une entreprise Scribd logo
1  sur  15
Jim Harris
               Blogger‐in‐Chief
             www.ocdqblog.com
Jim Harris
             Digitally signed by Jim Harris
             DN: cn=Jim Harris, o=Obsessive-Compulsive Data Quality (OCDQ), ou, email=jim.harris@ocdqblog.
             com, c=US
             Date: 2010.03.04 10:55:20 -06'00'
Jim Harris
        Blogger‐in‐Chief
       www.ocdqblog.com



     E‐mail
     jim.harris@ocdqblog.com

     Twitter
     twitter.com/ocdqblog

     LinkedIn
     linkedin.com/in/jimharris




Adventures in Data Profiling     Copyright © 2010, Jim Harris. All rights reserved.   2
Let the Adventures Begin . . .
 This will be a vendor‐neutral presentation:

     Focusing on general methodology of data profiling 
     and common functionality of data profiling tools 

     Discussing how a data profiling tool helps automate 
     some of the work needed for preliminary data analysis

     Reviewing fictional data and results produced by a 
     fictional data profiling tool to illustrate basic concepts 


Adventures in Data Profiling      Copyright © 2010, Jim Harris. All rights reserved.   3
Understanding Your Data
     Understanding your data is essential to using it 
     effectively and improving its quality

     You need a reality check for the perceptions and 
     assumptions you have about the quality of your data 

     You need to prepare meaningful questions to ask your 
     business analysts and subject matter experts

     There is simply no substitute for data analysis 

Adventures in Data Profiling     Copyright © 2010, Jim Harris. All rights reserved.   4
Profiling Your Data
 Data profiling includes many types of analysis such as:

     Verify data matches the metadata that describes it

     Identify representations for the absence of data

     Identify potential default and invalid values

     Check data formats for inconsistencies

     Assess domain, structural, and relational integrity

Adventures in Data Profiling     Copyright © 2010, Jim Harris. All rights reserved.   5
Getting Your Data Freq On
     Data profiling tools can help you by automating some 
     of the grunt work needed to begin your data analysis

     One of their basic features is the ability to generate 
     statistical summaries and frequency distributions for 
     the unique values and formats found within your fields

  Therefore, I like to refer to using a data profiling tool as:
              “Getting Your Data Freq On”

Adventures in Data Profiling     Copyright © 2010, Jim Harris. All rights reserved.   6
Let Me Count The Ways


NULL – record count of NULL values          Cardinality – count of the number of 
                                            distinct actual values
Missing – record count of Missing values 
(i.e., non‐NULL absence of data)            Uniqueness – percentage calculated as 
                                            Cardinality divided by total record count
Actual – record count of Actual values  
(i.e., non‐NULL and non‐Missing)            Distinctness – percentage calculated as 
                                            Cardinality divided by Actual
Completeness – percentage calculated as 
Actual divided by total record count
  Adventures in Data Profiling              Copyright © 2010, Jim Harris. All rights reserved.   7
You Uniquely Complete Me
                                   Completeness and 
                                   Uniqueness are useful in 
                                   evaluating potential key 
                                   fields and especially a 
                                   single primary key, 
                                   which should be both: 
                                         100% Complete
                                         100% Unique




Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   8
It’s a Distinct Possibility



                                 Distinctness can be useful 
                                 in evaluating the potential
                                 for duplicate records
                                 < 100% Distinct means some 
                                 distinct actual values occur on 
                                 more than one record
Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   9
Gimme the lo down, Drill‐down




Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   10
Freq’ing Distribution of Values



                                        Frequency distribution of 
                                        values is useful for fields 
                                        with a low cardinality
                                        Extremely low cardinality 
                                        might be an indication of 
                                        default values

Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   11
Reviewing the Top N List




Reviewing the Top N most 
frequently occurring values
 Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   12
Freq’ing Distribution of Formats




Frequency distribution of formats is useful for fields having 
both a high cardinality and free‐form values
Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   13
Unlocking the Combination




Combination of values 
and formats can help with 
unlocking the mystery of 
more complex fields

  Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   14
. . . the Adventures Conclude
What can just your analysis of data tell you about it?

    Understand your data better by first looking at it from a 
    starting point of blissful ignorance and curiosity

    A tool can help automate some of the grunt work, but 
    the actual data analysis can not be automated

    Your analytical goal is not to try to find answers, but to 
    discover the right questions

Adventures in Data Profiling      Copyright © 2010, Jim Harris. All rights reserved.   15

Contenu connexe

Tendances

Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratchdmurph4
 
Data quality - The True Big Data Challenge
Data quality - The True Big Data ChallengeData quality - The True Big Data Challenge
Data quality - The True Big Data ChallengeStefan Kühn
 
Data quality architecture
Data quality architectureData quality architecture
Data quality architectureanicewick
 
Sound Data Quality for CRM
Sound Data Quality for CRMSound Data Quality for CRM
Sound Data Quality for CRMDivya Malik
 
Data profiling-best-practices
Data profiling-best-practicesData profiling-best-practices
Data profiling-best-practicesBlaise Cheuteu
 
Data Quality Technical Architecture
Data Quality Technical ArchitectureData Quality Technical Architecture
Data Quality Technical ArchitectureHarshendu Desai
 
Lecture 22
Lecture 22Lecture 22
Lecture 22Shani729
 
Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp...
Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp...Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp...
Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp...University of Twente
 
Big Data Expo 2015 - Trillium software Big Data and the Data Quality
Big Data Expo 2015 - Trillium software Big Data and the Data QualityBig Data Expo 2015 - Trillium software Big Data and the Data Quality
Big Data Expo 2015 - Trillium software Big Data and the Data QualityBigDataExpo
 
Lecture 21
Lecture 21Lecture 21
Lecture 21Shani729
 
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Denny Lee
 
Lecture 23
Lecture 23Lecture 23
Lecture 23Shani729
 
Foundation of data quality
Foundation of data qualityFoundation of data quality
Foundation of data qualityKhaled Mosharraf
 
Enterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for HealthcareEnterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for HealthcareDATA360US
 
Data quality management Basic
Data quality management BasicData quality management Basic
Data quality management BasicKhaled Mosharraf
 

Tendances (20)

Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratch
 
Data quality - The True Big Data Challenge
Data quality - The True Big Data ChallengeData quality - The True Big Data Challenge
Data quality - The True Big Data Challenge
 
Data quality architecture
Data quality architectureData quality architecture
Data quality architecture
 
Sound Data Quality for CRM
Sound Data Quality for CRMSound Data Quality for CRM
Sound Data Quality for CRM
 
Data profiling-best-practices
Data profiling-best-practicesData profiling-best-practices
Data profiling-best-practices
 
Data Quality Technical Architecture
Data Quality Technical ArchitectureData Quality Technical Architecture
Data Quality Technical Architecture
 
Lecture 22
Lecture 22Lecture 22
Lecture 22
 
Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp...
Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp...Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp...
Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp...
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Big Data Expo 2015 - Trillium software Big Data and the Data Quality
Big Data Expo 2015 - Trillium software Big Data and the Data QualityBig Data Expo 2015 - Trillium software Big Data and the Data Quality
Big Data Expo 2015 - Trillium software Big Data and the Data Quality
 
Lecture 21
Lecture 21Lecture 21
Lecture 21
 
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
 
Lecture 23
Lecture 23Lecture 23
Lecture 23
 
Tamr overview
Tamr overviewTamr overview
Tamr overview
 
Data analytics
Data analyticsData analytics
Data analytics
 
Foundation of data quality
Foundation of data qualityFoundation of data quality
Foundation of data quality
 
Enterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for HealthcareEnterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for Healthcare
 
Data analytics
Data analyticsData analytics
Data analytics
 
Machine Learning and Multi Drug Resistant(MDR) Infections case study
Machine Learning and Multi Drug Resistant(MDR) Infections case studyMachine Learning and Multi Drug Resistant(MDR) Infections case study
Machine Learning and Multi Drug Resistant(MDR) Infections case study
 
Data quality management Basic
Data quality management BasicData quality management Basic
Data quality management Basic
 

En vedette

Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profilingShailja Khurana
 
2007 Tidc India Profiling
2007 Tidc India Profiling2007 Tidc India Profiling
2007 Tidc India Profilingdanrinkes
 
Marketing Data Utilization
Marketing Data UtilizationMarketing Data Utilization
Marketing Data UtilizationCRMIT
 
Uncover Untold Stories in Your Data: A Deep Dive on Data Profiling
Uncover Untold Stories in Your Data: A Deep Dive on Data ProfilingUncover Untold Stories in Your Data: A Deep Dive on Data Profiling
Uncover Untold Stories in Your Data: A Deep Dive on Data ProfilingJosiah Renaudin
 
NCDM Datamining Case Study 2010
NCDM Datamining Case Study 2010NCDM Datamining Case Study 2010
NCDM Datamining Case Study 2010Jim Stafford
 
Bitcoin, Transaction Fees and The Cost of Poor Quality
Bitcoin, Transaction Fees and The Cost of Poor QualityBitcoin, Transaction Fees and The Cost of Poor Quality
Bitcoin, Transaction Fees and The Cost of Poor QualityRSky215
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
Exposing Your Hidden Costs of Performance
Exposing Your Hidden Costs of PerformanceExposing Your Hidden Costs of Performance
Exposing Your Hidden Costs of PerformanceJuran Global
 
Cost of poor quality
Cost of  poor qualityCost of  poor quality
Cost of poor qualityAbdullah Sasy
 
Cost of poor quality presentation5
Cost of poor quality presentation5Cost of poor quality presentation5
Cost of poor quality presentation5Imran Jamil
 
Cost of-poor-quality - juran institute
Cost of-poor-quality - juran instituteCost of-poor-quality - juran institute
Cost of-poor-quality - juran instituteManish Chaurasia
 
Fifth Elephant 2014 talk - Crafting Visual Stories with Data
Fifth Elephant 2014 talk - Crafting Visual Stories with DataFifth Elephant 2014 talk - Crafting Visual Stories with Data
Fifth Elephant 2014 talk - Crafting Visual Stories with DataAmit Kapoor
 
Quality is a cost
Quality is a costQuality is a cost
Quality is a costbatch18
 
Crafting Visual Stories with Data
Crafting Visual Stories with DataCrafting Visual Stories with Data
Crafting Visual Stories with DataAmit Kapoor
 

En vedette (20)

Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
2007 Tidc India Profiling
2007 Tidc India Profiling2007 Tidc India Profiling
2007 Tidc India Profiling
 
Marketing Data Utilization
Marketing Data UtilizationMarketing Data Utilization
Marketing Data Utilization
 
Uncover Untold Stories in Your Data: A Deep Dive on Data Profiling
Uncover Untold Stories in Your Data: A Deep Dive on Data ProfilingUncover Untold Stories in Your Data: A Deep Dive on Data Profiling
Uncover Untold Stories in Your Data: A Deep Dive on Data Profiling
 
NCDM Datamining Case Study 2010
NCDM Datamining Case Study 2010NCDM Datamining Case Study 2010
NCDM Datamining Case Study 2010
 
Bitcoin, Transaction Fees and The Cost of Poor Quality
Bitcoin, Transaction Fees and The Cost of Poor QualityBitcoin, Transaction Fees and The Cost of Poor Quality
Bitcoin, Transaction Fees and The Cost of Poor Quality
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Telling Visual Stories About Data
Telling Visual Stories About DataTelling Visual Stories About Data
Telling Visual Stories About Data
 
Exposing Your Hidden Costs of Performance
Exposing Your Hidden Costs of PerformanceExposing Your Hidden Costs of Performance
Exposing Your Hidden Costs of Performance
 
Cost of poor quality
Cost of  poor qualityCost of  poor quality
Cost of poor quality
 
Cost of poor quality presentation5
Cost of poor quality presentation5Cost of poor quality presentation5
Cost of poor quality presentation5
 
Cost of-poor-quality - juran institute
Cost of-poor-quality - juran instituteCost of-poor-quality - juran institute
Cost of-poor-quality - juran institute
 
Fifth Elephant 2014 talk - Crafting Visual Stories with Data
Fifth Elephant 2014 talk - Crafting Visual Stories with DataFifth Elephant 2014 talk - Crafting Visual Stories with Data
Fifth Elephant 2014 talk - Crafting Visual Stories with Data
 
Quality is a cost
Quality is a costQuality is a cost
Quality is a cost
 
QUALITY & COST
QUALITY & COSTQUALITY & COST
QUALITY & COST
 
Crafting Visual Stories with Data
Crafting Visual Stories with DataCrafting Visual Stories with Data
Crafting Visual Stories with Data
 
Big Data Profiling
Big Data Profiling Big Data Profiling
Big Data Profiling
 
Cost of Poor quality
Cost of  Poor qualityCost of  Poor quality
Cost of Poor quality
 
Cost of quality
Cost of qualityCost of quality
Cost of quality
 
Cost of quality
Cost of qualityCost of quality
Cost of quality
 

Similaire à Getting Your Data Freq On with Data Profiling

Privacy in AI/ML Systems: Practical Challenges and Lessons Learned
Privacy in AI/ML Systems: Practical Challenges and Lessons LearnedPrivacy in AI/ML Systems: Practical Challenges and Lessons Learned
Privacy in AI/ML Systems: Practical Challenges and Lessons LearnedKrishnaram Kenthapadi
 
Analytics At Work T. Davenport
Analytics At Work T. DavenportAnalytics At Work T. Davenport
Analytics At Work T. DavenportSaurabh Shah
 
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...SAS Institute India Pvt. Ltd
 
Information Management and Analytics
Information Management and Analytics Information Management and Analytics
Information Management and Analytics AKAGroup
 
Big Data, Analytics, and Social Good - The Challenges, The Opportunity
Big Data, Analytics, and Social Good - The Challenges, The OpportunityBig Data, Analytics, and Social Good - The Challenges, The Opportunity
Big Data, Analytics, and Social Good - The Challenges, The OpportunityJaime Fitzgerald
 
Week2day2 communicating data for impact
Week2day2 communicating data for impactWeek2day2 communicating data for impact
Week2day2 communicating data for impactNishant Kumar
 
Why does telling a story with your data matters  Explain the impo.docx
Why does telling a story with your data matters  Explain the impo.docxWhy does telling a story with your data matters  Explain the impo.docx
Why does telling a story with your data matters  Explain the impo.docxfranknwest27899
 
Data quality metrics infographic
Data quality metrics infographicData quality metrics infographic
Data quality metrics infographicIntellspot
 
How Data as a Service can make your campaigns more successful - Dun and Brads...
How Data as a Service can make your campaigns more successful - Dun and Brads...How Data as a Service can make your campaigns more successful - Dun and Brads...
How Data as a Service can make your campaigns more successful - Dun and Brads...B2B Marketing
 
Great Learning Amit Sharma March 2018
Great Learning Amit Sharma March 2018Great Learning Amit Sharma March 2018
Great Learning Amit Sharma March 2018Amit Sharma
 
2017 06-14-getting started with data science
2017 06-14-getting started with data science2017 06-14-getting started with data science
2017 06-14-getting started with data scienceThinkful
 
For the Love of Big Data
For the Love of Big DataFor the Love of Big Data
For the Love of Big DataRobert Sutor
 
4 Critical Requirements for Building Truly Intelligent AI Models
4 Critical Requirements for Building Truly Intelligent AI Models4 Critical Requirements for Building Truly Intelligent AI Models
4 Critical Requirements for Building Truly Intelligent AI ModelsInnodata, Inc
 
HPE IDOL Technical Overview - july 2016
HPE IDOL Technical Overview - july 2016HPE IDOL Technical Overview - july 2016
HPE IDOL Technical Overview - july 2016Andrey Karpov
 
Healthcare Best Practices in Data Warehousing & Analytics
Healthcare Best Practices in Data Warehousing & AnalyticsHealthcare Best Practices in Data Warehousing & Analytics
Healthcare Best Practices in Data Warehousing & AnalyticsDale Sanders
 
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)Gianluca Tarasconi
 

Similaire à Getting Your Data Freq On with Data Profiling (20)

Privacy in AI/ML Systems: Practical Challenges and Lessons Learned
Privacy in AI/ML Systems: Practical Challenges and Lessons LearnedPrivacy in AI/ML Systems: Practical Challenges and Lessons Learned
Privacy in AI/ML Systems: Practical Challenges and Lessons Learned
 
IDOL presentation
IDOL presentationIDOL presentation
IDOL presentation
 
Analytics At Work T. Davenport
Analytics At Work T. DavenportAnalytics At Work T. Davenport
Analytics At Work T. Davenport
 
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
 
Information Management and Analytics
Information Management and Analytics Information Management and Analytics
Information Management and Analytics
 
Big Data, Analytics, and Social Good - The Challenges, The Opportunity
Big Data, Analytics, and Social Good - The Challenges, The OpportunityBig Data, Analytics, and Social Good - The Challenges, The Opportunity
Big Data, Analytics, and Social Good - The Challenges, The Opportunity
 
Week2day2 communicating data for impact
Week2day2 communicating data for impactWeek2day2 communicating data for impact
Week2day2 communicating data for impact
 
Why does telling a story with your data matters  Explain the impo.docx
Why does telling a story with your data matters  Explain the impo.docxWhy does telling a story with your data matters  Explain the impo.docx
Why does telling a story with your data matters  Explain the impo.docx
 
Big Data v2.pptx
Big Data v2.pptxBig Data v2.pptx
Big Data v2.pptx
 
Complete Guide to Data Quality
Complete Guide to Data QualityComplete Guide to Data Quality
Complete Guide to Data Quality
 
Data quality metrics infographic
Data quality metrics infographicData quality metrics infographic
Data quality metrics infographic
 
How Data as a Service can make your campaigns more successful - Dun and Brads...
How Data as a Service can make your campaigns more successful - Dun and Brads...How Data as a Service can make your campaigns more successful - Dun and Brads...
How Data as a Service can make your campaigns more successful - Dun and Brads...
 
Great Learning Amit Sharma March 2018
Great Learning Amit Sharma March 2018Great Learning Amit Sharma March 2018
Great Learning Amit Sharma March 2018
 
2017 06-14-getting started with data science
2017 06-14-getting started with data science2017 06-14-getting started with data science
2017 06-14-getting started with data science
 
For the Love of Big Data
For the Love of Big DataFor the Love of Big Data
For the Love of Big Data
 
4 Critical Requirements for Building Truly Intelligent AI Models
4 Critical Requirements for Building Truly Intelligent AI Models4 Critical Requirements for Building Truly Intelligent AI Models
4 Critical Requirements for Building Truly Intelligent AI Models
 
HPE IDOL Technical Overview - july 2016
HPE IDOL Technical Overview - july 2016HPE IDOL Technical Overview - july 2016
HPE IDOL Technical Overview - july 2016
 
Healthcare Best Practices in Data Warehousing & Analytics
Healthcare Best Practices in Data Warehousing & AnalyticsHealthcare Best Practices in Data Warehousing & Analytics
Healthcare Best Practices in Data Warehousing & Analytics
 
Dark data
Dark dataDark data
Dark data
 
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
 

Dernier

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Dernier (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Getting Your Data Freq On with Data Profiling

  • 1. Jim Harris Blogger‐in‐Chief www.ocdqblog.com Jim Harris Digitally signed by Jim Harris DN: cn=Jim Harris, o=Obsessive-Compulsive Data Quality (OCDQ), ou, email=jim.harris@ocdqblog. com, c=US Date: 2010.03.04 10:55:20 -06'00'
  • 2. Jim Harris Blogger‐in‐Chief www.ocdqblog.com E‐mail jim.harris@ocdqblog.com Twitter twitter.com/ocdqblog LinkedIn linkedin.com/in/jimharris Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 2
  • 3. Let the Adventures Begin . . . This will be a vendor‐neutral presentation: Focusing on general methodology of data profiling  and common functionality of data profiling tools  Discussing how a data profiling tool helps automate  some of the work needed for preliminary data analysis Reviewing fictional data and results produced by a  fictional data profiling tool to illustrate basic concepts  Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 3
  • 4. Understanding Your Data Understanding your data is essential to using it  effectively and improving its quality You need a reality check for the perceptions and  assumptions you have about the quality of your data  You need to prepare meaningful questions to ask your  business analysts and subject matter experts There is simply no substitute for data analysis  Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 4
  • 5. Profiling Your Data Data profiling includes many types of analysis such as: Verify data matches the metadata that describes it Identify representations for the absence of data Identify potential default and invalid values Check data formats for inconsistencies Assess domain, structural, and relational integrity Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 5
  • 6. Getting Your Data Freq On Data profiling tools can help you by automating some  of the grunt work needed to begin your data analysis One of their basic features is the ability to generate  statistical summaries and frequency distributions for  the unique values and formats found within your fields Therefore, I like to refer to using a data profiling tool as: “Getting Your Data Freq On” Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 6
  • 7. Let Me Count The Ways NULL – record count of NULL values Cardinality – count of the number of  distinct actual values Missing – record count of Missing values  (i.e., non‐NULL absence of data) Uniqueness – percentage calculated as  Cardinality divided by total record count Actual – record count of Actual values   (i.e., non‐NULL and non‐Missing) Distinctness – percentage calculated as  Cardinality divided by Actual Completeness – percentage calculated as  Actual divided by total record count Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 7
  • 8. You Uniquely Complete Me Completeness and  Uniqueness are useful in  evaluating potential key  fields and especially a  single primary key,  which should be both:  100% Complete 100% Unique Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 8
  • 9. It’s a Distinct Possibility Distinctness can be useful  in evaluating the potential for duplicate records < 100% Distinct means some  distinct actual values occur on  more than one record Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 9
  • 10. Gimme the lo down, Drill‐down Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 10
  • 11. Freq’ing Distribution of Values Frequency distribution of  values is useful for fields  with a low cardinality Extremely low cardinality  might be an indication of  default values Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 11
  • 12. Reviewing the Top N List Reviewing the Top N most  frequently occurring values Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 12
  • 15. . . . the Adventures Conclude What can just your analysis of data tell you about it? Understand your data better by first looking at it from a  starting point of blissful ignorance and curiosity A tool can help automate some of the grunt work, but  the actual data analysis can not be automated Your analytical goal is not to try to find answers, but to  discover the right questions Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 15