SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
Data Management & Warehousing


     Data Quality:
Common Problems & Checks
              Date: 24 April 2009
           Location: Zagreb, Croatia

                  David M. Walker
               davidw@datamgmt.com
   +44 (0) 7050 028 911 - http://www.datamgmt.com
Agenda
•!    Introduction
•!    Common Problems
•!    Automated Checking
•!    Profiling Checks
•!    Conclusions



24 April 2009        © 2009 Data Management & Warehousing   Page 2
Introduction
•! Data Quality problems are a SOURCE SYSTEM
   issue and an ETL issue
        –! They just manifest themselves in the data warehouse
•! Prevention is better than cure
        –! Fixing the source system or the ETL is ALWAYS
           cheaper and more effective than cleaning the data in
           the ETL or in the Data Warehouse itself
•! Data Quality is a continuous process
        –! It is never finished and always needs to be monitored

24 April 2009            © 2009 Data Management & Warehousing   Page 3
The Impact of Poor Data Quality

•! Devalues the data warehouse
        –! Discourages people from trusting or using the
           system and therefore curtailing the life of the
           data warehouse
•! Highlights failings in the source system
   and/or the business process
        –! Businesses would rather fix at any cost in the
           data warehouse and pretend that there isn’t a
           source system problem
24 April 2009           © 2009 Data Management & Warehousing   Page 4
Common Problems
•! 11 types of problem that account for the
   most common problems
•! They usually reflect poor design and/or
   implementation of systems
•! Most can be fixed or monitored and
   managed to limit the impact



24 April 2009         © 2009 Data Management & Warehousing   Page 5
Referential Issues
•! Keys that are not unique
        –! Systems that do not enforce the unique
           primary key or flat files or spreadsheets
        –! Also generated by ETL that creates a
           surrogate key incorrectly
•! Referential Integrity Failures
        –! Where referentially integrity is not enforced
           values in the child table are created that are
           not in the parent table
24 April 2009           © 2009 Data Management & Warehousing   Page 6
Data Type Issues
•! Format Errors
        –! Typically in Date/Time type fields
        –! 02/04/2009 2nd April (UK) or 4th Feb (US)
•! Inappropriate Data Types
        –! Storing Dates in Character Strings
           20090624 as YYYYMMDD format string
        –! But what about 20090230?


24 April 2009          © 2009 Data Management & Warehousing   Page 7
Data Model Issues
•! De-normalised tables
        –! Commonly created for performance reasons
        –! Inherently duplicates data
        –! Often gets out of sync
•! Data/Column Retirement
        –! Upgrade to system retires a column
        –! ETL continues to use the old column
•! Poor Table/Column Naming
        –! Don’t assume that a column does what it says
        –! Don’t assume that a column is still being used for it’s original
           purpose

24 April 2009                 © 2009 Data Management & Warehousing            Page 8
Data Content Issues
•! Null Values
        –! Systems that have many optional fields will
           often have missing values
        –! Null values allow rows to be silently omitted
           from queries
•! Inappropriate Values
        –! Databases allow special characters and/or
           leading/trailing white space
        –! “DataspaceQuality” != “DatatabQuality”
24 April 2009          © 2009 Data Management & Warehousing   Page 9
Data Feed Issues
•! Missing Data
        –! Where a stream of files are loaded by ETL if
           one is dropped it can go un-noticed
        –! Common with CDR type loads in Telcos
•! Late Data
        –! A short term data quality issue
        –! Leaves users believing there is a problem
        –! Produces inconsistent reporting over time

24 April 2009          © 2009 Data Management & Warehousing   Page 10
Automated Checking
•! Regularly run checks
•! Broad Coverage across systems
        –! 100s and 1000s not 10s of queries
        –! Run in a low priority loop in the background
•! Used against:
        –! Sources
        –! Data Staging
        –! Data Warehouse
•! No Product Required
        –! We often implement this as a controlling shell script and lots of
           small scripts, one for each check

24 April 2009                 © 2009 Data Management & Warehousing        Page 11
Trending
•! Absolute Trending
        –! Track an expected value over time
                •! e.g. Returned Mail is usually less than 500 items per day
                •! If the value is <500 status is green, 501 to 1000 amber and >1000
                   red
•! Statistical Process Control (SPC) Trending
        –! Track an expected value where the value changes over time
                •!   e.g. Telco CDRs – expect more as the company grows
                •!   Don’t want to be continuously changing the threshold
                •!   Compare current load to historical means
                •!   If current load within 2 Standard Deviations – Green,
                     3 Standard Deviations – Amber,
                     4 or more Standard Deviations - Red

24 April 2009                        © 2009 Data Management & Warehousing        Page 12
Flow Control
•! Flow Control
        –! ETL manipulates data
                •! Joins, De-duplicates, Filters, Aggregates, etc
        –! Use the formula:
                Source Count - Filtered Count – DeDup Count – Target Count = 0

•! Trusted Source
        –! Compare the result with a third system
        –! e.g. Does the Count of Switch CDRs =
           Count of those processed by the billing system
           Count of those processed in the DWH

24 April 2009                         © 2009 Data Management & Warehousing       Page 13
Business Rules Based
•! Specific rules to match known business
   rules
        –! Account holders > 18 (Sys Date – DoB)
        –! Account holders < 115
        –! Credit Card numbers are 16 digits long
        –! Number of accounts without a status
•! Result should yield Zero


24 April 2009         © 2009 Data Management & Warehousing   Page 14
Automated Checking - Ops
•! Managed by exception
        –! Red given priority
        –! Amber are always followed up
•! Massive number of checks
        –! 100’s are good
        –! 1000’s are better
•! Presentation
        –! Alerts, RAG, Graphical, Numerical, etc.
24 April 2009           © 2009 Data Management & Warehousing   Page 15
Data Profiling Checks
•! Run manually because they need to be
   interpreted by a human
•! Leads to new business rules being added
   to the automated checks
•! Can be done with simple reporting tools or
   commercial data profiling tools



24 April 2009          © 2009 Data Management & Warehousing   Page 16
Frequency Outliers
•! Count discreet values in a table and check
   items with many more or less than normal
        –! e.g. DoB 01-01-01 many times more common
           than any other value indicates source default
           and something that needs work
        –! e.g. Count of SMS messages significantly
           lower on a given day may equate to a genuine
           system failure and therefore not a DQ
           problem
24 April 2009          © 2009 Data Management & Warehousing   Page 17
Maximum & Minimum
•! Determine what a valid range for any
   value should be
        –! e.g. age between 18 and 115
        –! Immediately finds individual data quality
           issues that can be resolved
        –! Allows an analyst to create new business
           rules to prevent future problems



24 April 2009          © 2009 Data Management & Warehousing   Page 18
Sequential Keys
•! If a system has a sequential key:

      Max Value – Min Value – Count = 0

•! If this is true – is it too perfect for an
   operational system and therefore test data
•! If this is false – what has caused the gaps,
   are the deletions intentional?
24 April 2009          © 2009 Data Management & Warehousing   Page 19
Data Types
•! Validation of mis-used data types before
   loading
        –! e.g. Dates in Character fields
        –! Format: YYYYMMDD
        –! Check MM between 01 and 12
        –! Check DD between 01 and 31
        –! Check MMDD does not include 0230, 0231
        –! etc.

24 April 2009         © 2009 Data Management & Warehousing   Page 20
Skewed Pattern Profiling
•! Looking for specific patterns in data
        –! e.g. UK National Insurance Numbers (?) have
           the format AA 99 99 99 A
        –! Pattern match all values looking for
           exceptions
•! Number Lengths are a special case
        –! e.g Credit Card Numbers are 16 digits long


24 April 2009          © 2009 Data Management & Warehousing   Page 21
Content Checking
•! Content Checking is the manual review of
   character strings
•! Needs a good understanding of the nature
   of the data
•! Often determines the need to do analysis
   of other types



24 April 2009         © 2009 Data Management & Warehousing   Page 22
Nulls & White Space
•! Nulls
        –! Fields that have large proportion of nulls are usually
           not useful
        –! Also common is default status of null
           (e.g. Account us either closed or null)
•! White Space
        –! Not Null fields with a single space
        –! Tab instead of space
        –! Leading/Trailing white space
        –! Double White Space: “DavidSpaceSpaceWalker”
24 April 2009             © 2009 Data Management & Warehousing   Page 23
Punctuation & Control Chars
•! Punctuation
        –! CSV files that are not properly quoted perform
           field shifts
        –! Address lines with extra commas
•! Control Characters
        –! Data fields that contain ASCII character codes
           0 to 31 and 127 to 159 are often ‘invisible’
           when viewed in queries but cause failures
        –! Also be aware of ‘code-page’ specifics
24 April 2009          © 2009 Data Management & Warehousing   Page 24
Problem Management Matrix




24 April 2009         © 2009 Data Management & Warehousing   Page 25
Continuous DQ Process




24 April 2009         © 2009 Data Management & Warehousing   Page 26
Quality is FREE …
… as long as you are prepared to
  INVEST HEAVILY in it

Philip Crosby 1980

Especially true of
Data Quality

24 April 2009         © 2009 Data Management & Warehousing   Page 27
Data Management & Warehousing


     Data Quality:
Common Problems & Checks

      Thank You
                  David M. Walker
               davidw@datamgmt.com
   +44 (0) 7050 028 911 - http://www.datamgmt.com

Contenu connexe

Tendances

Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...DATAVERSITY
 
LDM Webinar: Data Modeling & Business Intelligence
LDM Webinar: Data Modeling & Business IntelligenceLDM Webinar: Data Modeling & Business Intelligence
LDM Webinar: Data Modeling & Business IntelligenceDATAVERSITY
 
Approaching Data Quality
Approaching Data QualityApproaching Data Quality
Approaching Data QualityDATAVERSITY
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureJames Serra
 
Strategic Business Requirements for Master Data Management Systems
Strategic Business Requirements for Master Data Management SystemsStrategic Business Requirements for Master Data Management Systems
Strategic Business Requirements for Master Data Management SystemsBoris Otto
 
Glossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data GovernanceGlossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data GovernanceDATAVERSITY
 
Chapter 1: The Importance of Data Assets
Chapter 1: The Importance of Data AssetsChapter 1: The Importance of Data Assets
Chapter 1: The Importance of Data AssetsAhmed Alorage
 
‏‏‏‏Chapter 9: Data Warehousing and Business Intelligence Management
‏‏‏‏Chapter 9: Data Warehousing and Business Intelligence Management‏‏‏‏Chapter 9: Data Warehousing and Business Intelligence Management
‏‏‏‏Chapter 9: Data Warehousing and Business Intelligence ManagementAhmed Alorage
 
Talend Data Quality
Talend Data QualityTalend Data Quality
Talend Data QualityTalend
 
Etl process in data warehouse
Etl process in data warehouseEtl process in data warehouse
Etl process in data warehouseKomal Choudhary
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing conceptspcherukumalla
 
DAS Slides: Master Data Management – Aligning Data, Process, and Governance
DAS Slides: Master Data Management – Aligning Data, Process, and GovernanceDAS Slides: Master Data Management – Aligning Data, Process, and Governance
DAS Slides: Master Data Management – Aligning Data, Process, and GovernanceDATAVERSITY
 
‏‏Chapter 8: Reference and Master Data Management
‏‏Chapter 8: Reference and Master Data Management ‏‏Chapter 8: Reference and Master Data Management
‏‏Chapter 8: Reference and Master Data Management Ahmed Alorage
 
Data Management Maturity Assessment
Data Management Maturity AssessmentData Management Maturity Assessment
Data Management Maturity AssessmentFiras Hamdan
 
Best Practices in Metadata Management
Best Practices in Metadata ManagementBest Practices in Metadata Management
Best Practices in Metadata ManagementDATAVERSITY
 
Data Quality Strategy: A Step-by-Step Approach
Data Quality Strategy: A Step-by-Step ApproachData Quality Strategy: A Step-by-Step Approach
Data Quality Strategy: A Step-by-Step ApproachFindWhitePapers
 
Data-Ed Webinar: Data Governance Strategies
Data-Ed Webinar: Data Governance StrategiesData-Ed Webinar: Data Governance Strategies
Data-Ed Webinar: Data Governance StrategiesDATAVERSITY
 
Building a Data Governance Strategy
Building a Data Governance StrategyBuilding a Data Governance Strategy
Building a Data Governance StrategyAnalytics8
 

Tendances (20)

DAMA International DMBOK V2 - Comparison with V1
DAMA International DMBOK V2 - Comparison with V1DAMA International DMBOK V2 - Comparison with V1
DAMA International DMBOK V2 - Comparison with V1
 
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
 
LDM Webinar: Data Modeling & Business Intelligence
LDM Webinar: Data Modeling & Business IntelligenceLDM Webinar: Data Modeling & Business Intelligence
LDM Webinar: Data Modeling & Business Intelligence
 
Approaching Data Quality
Approaching Data QualityApproaching Data Quality
Approaching Data Quality
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Strategic Business Requirements for Master Data Management Systems
Strategic Business Requirements for Master Data Management SystemsStrategic Business Requirements for Master Data Management Systems
Strategic Business Requirements for Master Data Management Systems
 
Glossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data GovernanceGlossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data Governance
 
Chapter 1: The Importance of Data Assets
Chapter 1: The Importance of Data AssetsChapter 1: The Importance of Data Assets
Chapter 1: The Importance of Data Assets
 
‏‏‏‏Chapter 9: Data Warehousing and Business Intelligence Management
‏‏‏‏Chapter 9: Data Warehousing and Business Intelligence Management‏‏‏‏Chapter 9: Data Warehousing and Business Intelligence Management
‏‏‏‏Chapter 9: Data Warehousing and Business Intelligence Management
 
Talend Data Quality
Talend Data QualityTalend Data Quality
Talend Data Quality
 
Etl process in data warehouse
Etl process in data warehouseEtl process in data warehouse
Etl process in data warehouse
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
 
DAS Slides: Master Data Management – Aligning Data, Process, and Governance
DAS Slides: Master Data Management – Aligning Data, Process, and GovernanceDAS Slides: Master Data Management – Aligning Data, Process, and Governance
DAS Slides: Master Data Management – Aligning Data, Process, and Governance
 
‏‏Chapter 8: Reference and Master Data Management
‏‏Chapter 8: Reference and Master Data Management ‏‏Chapter 8: Reference and Master Data Management
‏‏Chapter 8: Reference and Master Data Management
 
Data Management Maturity Assessment
Data Management Maturity AssessmentData Management Maturity Assessment
Data Management Maturity Assessment
 
Best Practices in Metadata Management
Best Practices in Metadata ManagementBest Practices in Metadata Management
Best Practices in Metadata Management
 
Data Quality Strategy: A Step-by-Step Approach
Data Quality Strategy: A Step-by-Step ApproachData Quality Strategy: A Step-by-Step Approach
Data Quality Strategy: A Step-by-Step Approach
 
Data-Ed Webinar: Data Governance Strategies
Data-Ed Webinar: Data Governance StrategiesData-Ed Webinar: Data Governance Strategies
Data-Ed Webinar: Data Governance Strategies
 
080827 abramson inmon vs kimball
080827 abramson   inmon vs kimball080827 abramson   inmon vs kimball
080827 abramson inmon vs kimball
 
Building a Data Governance Strategy
Building a Data Governance StrategyBuilding a Data Governance Strategy
Building a Data Governance Strategy
 

En vedette

Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernAmin Chowdhury
 
Data quality management Basic
Data quality management BasicData quality management Basic
Data quality management BasicKhaled Mosharraf
 
Data Quality Rules introduction
Data Quality Rules introductionData Quality Rules introduction
Data Quality Rules introductiondatatovalue
 
Data quality overview
Data quality overviewData quality overview
Data quality overviewAlex Meadows
 
Data Quality Dashboards
Data Quality DashboardsData Quality Dashboards
Data Quality DashboardsWilliam Sharp
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratchdmurph4
 
Enterprise Business Intelligence & Data Warehousing: The Data Quality Conundrum
Enterprise Business Intelligence & Data Warehousing: The Data Quality ConundrumEnterprise Business Intelligence & Data Warehousing: The Data Quality Conundrum
Enterprise Business Intelligence & Data Warehousing: The Data Quality ConundrumRTTS
 
Data Quality Challenges & Solution Approaches in Yahoo!’s Massive Data
Data Quality Challenges & Solution Approaches in Yahoo!’s Massive DataData Quality Challenges & Solution Approaches in Yahoo!’s Massive Data
Data Quality Challenges & Solution Approaches in Yahoo!’s Massive DataDATAVERSITY
 
List of personal protective equipment to have
List of personal protective equipment to haveList of personal protective equipment to have
List of personal protective equipment to haveChristopher Dill
 
Infographic - Procurement Trends 2016
Infographic - Procurement Trends 2016Infographic - Procurement Trends 2016
Infographic - Procurement Trends 2016Jonathan Betts
 
Inside the circle of trust: Data management for modern enterprises
Inside the circle of trust: Data management for modern enterprisesInside the circle of trust: Data management for modern enterprises
Inside the circle of trust: Data management for modern enterprisesExperian Data Quality
 
RDAP 15 Navigating the Rocky Road to Research Data Acceptance
RDAP 15 Navigating the Rocky Road to Research Data AcceptanceRDAP 15 Navigating the Rocky Road to Research Data Acceptance
RDAP 15 Navigating the Rocky Road to Research Data AcceptanceASIS&T
 
Spend Analysis: What Your Data Is Telling You and Why It’s Worth Listening
Spend Analysis: What Your Data Is Telling You and Why It’s Worth ListeningSpend Analysis: What Your Data Is Telling You and Why It’s Worth Listening
Spend Analysis: What Your Data Is Telling You and Why It’s Worth ListeningSAP Ariba
 
Adapting Data Quality Assurance Approaches and Tools to Meet Local Needs
Adapting Data Quality Assurance Approaches and Tools to Meet Local Needs Adapting Data Quality Assurance Approaches and Tools to Meet Local Needs
Adapting Data Quality Assurance Approaches and Tools to Meet Local Needs MEASURE Evaluation
 
Informatica Command Line Statements
Informatica Command Line StatementsInformatica Command Line Statements
Informatica Command Line Statementsmnsk80
 
Open Source Business Intelligence Overview
Open Source Business Intelligence OverviewOpen Source Business Intelligence Overview
Open Source Business Intelligence OverviewAlex Meadows
 
Dimensional modelling-mod-3
Dimensional modelling-mod-3Dimensional modelling-mod-3
Dimensional modelling-mod-3Malik Alig
 

En vedette (20)

Data Quality Presentation
Data Quality PresentationData Quality Presentation
Data Quality Presentation
 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing Concern
 
Data quality management Basic
Data quality management BasicData quality management Basic
Data quality management Basic
 
Data Quality Definitions
Data Quality DefinitionsData Quality Definitions
Data Quality Definitions
 
Data Quality Rules introduction
Data Quality Rules introductionData Quality Rules introduction
Data Quality Rules introduction
 
Data quality overview
Data quality overviewData quality overview
Data quality overview
 
Data Quality Dashboards
Data Quality DashboardsData Quality Dashboards
Data Quality Dashboards
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratch
 
Enterprise Business Intelligence & Data Warehousing: The Data Quality Conundrum
Enterprise Business Intelligence & Data Warehousing: The Data Quality ConundrumEnterprise Business Intelligence & Data Warehousing: The Data Quality Conundrum
Enterprise Business Intelligence & Data Warehousing: The Data Quality Conundrum
 
Data Quality Challenges & Solution Approaches in Yahoo!’s Massive Data
Data Quality Challenges & Solution Approaches in Yahoo!’s Massive DataData Quality Challenges & Solution Approaches in Yahoo!’s Massive Data
Data Quality Challenges & Solution Approaches in Yahoo!’s Massive Data
 
List of personal protective equipment to have
List of personal protective equipment to haveList of personal protective equipment to have
List of personal protective equipment to have
 
Infographic - Procurement Trends 2016
Infographic - Procurement Trends 2016Infographic - Procurement Trends 2016
Infographic - Procurement Trends 2016
 
Big Data Pitfalls
Big Data PitfallsBig Data Pitfalls
Big Data Pitfalls
 
Inside the circle of trust: Data management for modern enterprises
Inside the circle of trust: Data management for modern enterprisesInside the circle of trust: Data management for modern enterprises
Inside the circle of trust: Data management for modern enterprises
 
RDAP 15 Navigating the Rocky Road to Research Data Acceptance
RDAP 15 Navigating the Rocky Road to Research Data AcceptanceRDAP 15 Navigating the Rocky Road to Research Data Acceptance
RDAP 15 Navigating the Rocky Road to Research Data Acceptance
 
Spend Analysis: What Your Data Is Telling You and Why It’s Worth Listening
Spend Analysis: What Your Data Is Telling You and Why It’s Worth ListeningSpend Analysis: What Your Data Is Telling You and Why It’s Worth Listening
Spend Analysis: What Your Data Is Telling You and Why It’s Worth Listening
 
Adapting Data Quality Assurance Approaches and Tools to Meet Local Needs
Adapting Data Quality Assurance Approaches and Tools to Meet Local Needs Adapting Data Quality Assurance Approaches and Tools to Meet Local Needs
Adapting Data Quality Assurance Approaches and Tools to Meet Local Needs
 
Informatica Command Line Statements
Informatica Command Line StatementsInformatica Command Line Statements
Informatica Command Line Statements
 
Open Source Business Intelligence Overview
Open Source Business Intelligence OverviewOpen Source Business Intelligence Overview
Open Source Business Intelligence Overview
 
Dimensional modelling-mod-3
Dimensional modelling-mod-3Dimensional modelling-mod-3
Dimensional modelling-mod-3
 

Similaire à Data Quality Checks Automated & Profiling

Tamr | Strata hadoop 2014 Michael Stonebraker
Tamr | Strata hadoop 2014 Michael StonebrakerTamr | Strata hadoop 2014 Michael Stonebraker
Tamr | Strata hadoop 2014 Michael StonebrakerTamr_Inc
 
Data science tips for data engineers
Data science tips for data engineersData science tips for data engineers
Data science tips for data engineersIBM Analytics
 
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)Huibert Aalbers
 
AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...
AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...
AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...Amazon Web Services
 
Oracle DB In-Memory technologie v kombinaci s procesorem M7
Oracle DB In-Memory technologie v kombinaci s procesorem M7Oracle DB In-Memory technologie v kombinaci s procesorem M7
Oracle DB In-Memory technologie v kombinaci s procesorem M7MarketingArrowECS_CZ
 
[DSC Europe 23][AICommerce] Ayoub Fakir Critical components for a successful ...
[DSC Europe 23][AICommerce] Ayoub Fakir Critical components for a successful ...[DSC Europe 23][AICommerce] Ayoub Fakir Critical components for a successful ...
[DSC Europe 23][AICommerce] Ayoub Fakir Critical components for a successful ...DataScienceConferenc1
 
ETL Testing - Introduction to ETL Testing
ETL Testing - Introduction to ETL TestingETL Testing - Introduction to ETL Testing
ETL Testing - Introduction to ETL TestingVibrant Event
 
ETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testingETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testingVibrant Event
 
Introduction to Datawarehousing
Introduction to  DatawarehousingIntroduction to  Datawarehousing
Introduction to Datawarehousingkarunakar81987
 
PRESENTATION: Capture. Compliance. Centralization. How Advanced Rendering Del...
PRESENTATION: Capture. Compliance. Centralization. How Advanced Rendering Del...PRESENTATION: Capture. Compliance. Centralization. How Advanced Rendering Del...
PRESENTATION: Capture. Compliance. Centralization. How Advanced Rendering Del...Adlib - The PDF Experts
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptRafiulHasan19
 

Similaire à Data Quality Checks Automated & Profiling (20)

Tamr | Strata hadoop 2014 Michael Stonebraker
Tamr | Strata hadoop 2014 Michael StonebrakerTamr | Strata hadoop 2014 Michael Stonebraker
Tamr | Strata hadoop 2014 Michael Stonebraker
 
Chap3.pptx
Chap3.pptxChap3.pptx
Chap3.pptx
 
Data science tips for data engineers
Data science tips for data engineersData science tips for data engineers
Data science tips for data engineers
 
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)
 
DWH_Session_1.pptx
DWH_Session_1.pptxDWH_Session_1.pptx
DWH_Session_1.pptx
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...
AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...
AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...
 
Oracle DB In-Memory technologie v kombinaci s procesorem M7
Oracle DB In-Memory technologie v kombinaci s procesorem M7Oracle DB In-Memory technologie v kombinaci s procesorem M7
Oracle DB In-Memory technologie v kombinaci s procesorem M7
 
Escape From Spreadsheet Hell
Escape From Spreadsheet HellEscape From Spreadsheet Hell
Escape From Spreadsheet Hell
 
[DSC Europe 23][AICommerce] Ayoub Fakir Critical components for a successful ...
[DSC Europe 23][AICommerce] Ayoub Fakir Critical components for a successful ...[DSC Europe 23][AICommerce] Ayoub Fakir Critical components for a successful ...
[DSC Europe 23][AICommerce] Ayoub Fakir Critical components for a successful ...
 
DW (1).ppt
DW (1).pptDW (1).ppt
DW (1).ppt
 
OLAP
OLAPOLAP
OLAP
 
Olap queries
Olap queriesOlap queries
Olap queries
 
ETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testingETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testing
 
ETL Testing - Introduction to ETL Testing
ETL Testing - Introduction to ETL TestingETL Testing - Introduction to ETL Testing
ETL Testing - Introduction to ETL Testing
 
ETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testingETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testing
 
Challenges in an E Commerce World
Challenges in an E Commerce World Challenges in an E Commerce World
Challenges in an E Commerce World
 
Introduction to Datawarehousing
Introduction to  DatawarehousingIntroduction to  Datawarehousing
Introduction to Datawarehousing
 
PRESENTATION: Capture. Compliance. Centralization. How Advanced Rendering Del...
PRESENTATION: Capture. Compliance. Centralization. How Advanced Rendering Del...PRESENTATION: Capture. Compliance. Centralization. How Advanced Rendering Del...
PRESENTATION: Capture. Compliance. Centralization. How Advanced Rendering Del...
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.ppt
 

Plus de David Walker

Moving To MicroServices
Moving To MicroServicesMoving To MicroServices
Moving To MicroServicesDavid Walker
 
Big Data Week 2016 - Worldpay - Deploying Secure Clusters
Big Data Week 2016  - Worldpay - Deploying Secure ClustersBig Data Week 2016  - Worldpay - Deploying Secure Clusters
Big Data Week 2016 - Worldpay - Deploying Secure ClustersDavid Walker
 
Data Works Berlin 2018 - Worldpay - PCI Compliance
Data Works Berlin 2018 - Worldpay - PCI ComplianceData Works Berlin 2018 - Worldpay - PCI Compliance
Data Works Berlin 2018 - Worldpay - PCI ComplianceDavid Walker
 
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy ClustersData Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy ClustersDavid Walker
 
Big Data Analytics 2017 - Worldpay - Empowering Payments
Big Data Analytics 2017  - Worldpay - Empowering PaymentsBig Data Analytics 2017  - Worldpay - Empowering Payments
Big Data Analytics 2017 - Worldpay - Empowering PaymentsDavid Walker
 
Data Driven Insurance Underwriting
Data Driven Insurance UnderwritingData Driven Insurance Underwriting
Data Driven Insurance UnderwritingDavid Walker
 
Data Driven Insurance Underwriting (Dutch Language Version)
Data Driven Insurance Underwriting (Dutch Language Version)Data Driven Insurance Underwriting (Dutch Language Version)
Data Driven Insurance Underwriting (Dutch Language Version)David Walker
 
An introduction to data virtualization in business intelligence
An introduction to data virtualization in business intelligenceAn introduction to data virtualization in business intelligence
An introduction to data virtualization in business intelligenceDavid Walker
 
BI SaaS & Cloud Strategies for Telcos
BI SaaS & Cloud Strategies for TelcosBI SaaS & Cloud Strategies for Telcos
BI SaaS & Cloud Strategies for TelcosDavid Walker
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platformDavid Walker
 
Gathering Business Requirements for Data Warehouses
Gathering Business Requirements for Data WarehousesGathering Business Requirements for Data Warehouses
Gathering Business Requirements for Data WarehousesDavid Walker
 
Data warehousing change in a challenging environment
Data warehousing change in a challenging environmentData warehousing change in a challenging environment
Data warehousing change in a challenging environmentDavid Walker
 
Building a data warehouse of call data records
Building a data warehouse of call data recordsBuilding a data warehouse of call data records
Building a data warehouse of call data recordsDavid Walker
 
Struggling with data management
Struggling with data managementStruggling with data management
Struggling with data managementDavid Walker
 
A linux mac os x command line interface
A linux mac os x command line interfaceA linux mac os x command line interface
A linux mac os x command line interfaceDavid Walker
 
Connections a life in the day of - david walker
Connections   a life in the day of - david walkerConnections   a life in the day of - david walker
Connections a life in the day of - david walkerDavid Walker
 
Conspectus data warehousing appliances – fad or future
Conspectus   data warehousing appliances – fad or futureConspectus   data warehousing appliances – fad or future
Conspectus data warehousing appliances – fad or futureDavid Walker
 
An introduction to social network data
An introduction to social network dataAn introduction to social network data
An introduction to social network dataDavid Walker
 
Using the right data model in a data mart
Using the right data model in a data martUsing the right data model in a data mart
Using the right data model in a data martDavid Walker
 
Implementing Netezza Spatial
Implementing Netezza SpatialImplementing Netezza Spatial
Implementing Netezza SpatialDavid Walker
 

Plus de David Walker (20)

Moving To MicroServices
Moving To MicroServicesMoving To MicroServices
Moving To MicroServices
 
Big Data Week 2016 - Worldpay - Deploying Secure Clusters
Big Data Week 2016  - Worldpay - Deploying Secure ClustersBig Data Week 2016  - Worldpay - Deploying Secure Clusters
Big Data Week 2016 - Worldpay - Deploying Secure Clusters
 
Data Works Berlin 2018 - Worldpay - PCI Compliance
Data Works Berlin 2018 - Worldpay - PCI ComplianceData Works Berlin 2018 - Worldpay - PCI Compliance
Data Works Berlin 2018 - Worldpay - PCI Compliance
 
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy ClustersData Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
 
Big Data Analytics 2017 - Worldpay - Empowering Payments
Big Data Analytics 2017  - Worldpay - Empowering PaymentsBig Data Analytics 2017  - Worldpay - Empowering Payments
Big Data Analytics 2017 - Worldpay - Empowering Payments
 
Data Driven Insurance Underwriting
Data Driven Insurance UnderwritingData Driven Insurance Underwriting
Data Driven Insurance Underwriting
 
Data Driven Insurance Underwriting (Dutch Language Version)
Data Driven Insurance Underwriting (Dutch Language Version)Data Driven Insurance Underwriting (Dutch Language Version)
Data Driven Insurance Underwriting (Dutch Language Version)
 
An introduction to data virtualization in business intelligence
An introduction to data virtualization in business intelligenceAn introduction to data virtualization in business intelligence
An introduction to data virtualization in business intelligence
 
BI SaaS & Cloud Strategies for Telcos
BI SaaS & Cloud Strategies for TelcosBI SaaS & Cloud Strategies for Telcos
BI SaaS & Cloud Strategies for Telcos
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platform
 
Gathering Business Requirements for Data Warehouses
Gathering Business Requirements for Data WarehousesGathering Business Requirements for Data Warehouses
Gathering Business Requirements for Data Warehouses
 
Data warehousing change in a challenging environment
Data warehousing change in a challenging environmentData warehousing change in a challenging environment
Data warehousing change in a challenging environment
 
Building a data warehouse of call data records
Building a data warehouse of call data recordsBuilding a data warehouse of call data records
Building a data warehouse of call data records
 
Struggling with data management
Struggling with data managementStruggling with data management
Struggling with data management
 
A linux mac os x command line interface
A linux mac os x command line interfaceA linux mac os x command line interface
A linux mac os x command line interface
 
Connections a life in the day of - david walker
Connections   a life in the day of - david walkerConnections   a life in the day of - david walker
Connections a life in the day of - david walker
 
Conspectus data warehousing appliances – fad or future
Conspectus   data warehousing appliances – fad or futureConspectus   data warehousing appliances – fad or future
Conspectus data warehousing appliances – fad or future
 
An introduction to social network data
An introduction to social network dataAn introduction to social network data
An introduction to social network data
 
Using the right data model in a data mart
Using the right data model in a data martUsing the right data model in a data mart
Using the right data model in a data mart
 
Implementing Netezza Spatial
Implementing Netezza SpatialImplementing Netezza Spatial
Implementing Netezza Spatial
 

Dernier

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Dernier (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Data Quality Checks Automated & Profiling

  • 1. Data Management & Warehousing Data Quality: Common Problems & Checks Date: 24 April 2009 Location: Zagreb, Croatia David M. Walker davidw@datamgmt.com +44 (0) 7050 028 911 - http://www.datamgmt.com
  • 2. Agenda •! Introduction •! Common Problems •! Automated Checking •! Profiling Checks •! Conclusions 24 April 2009 © 2009 Data Management & Warehousing Page 2
  • 3. Introduction •! Data Quality problems are a SOURCE SYSTEM issue and an ETL issue –! They just manifest themselves in the data warehouse •! Prevention is better than cure –! Fixing the source system or the ETL is ALWAYS cheaper and more effective than cleaning the data in the ETL or in the Data Warehouse itself •! Data Quality is a continuous process –! It is never finished and always needs to be monitored 24 April 2009 © 2009 Data Management & Warehousing Page 3
  • 4. The Impact of Poor Data Quality •! Devalues the data warehouse –! Discourages people from trusting or using the system and therefore curtailing the life of the data warehouse •! Highlights failings in the source system and/or the business process –! Businesses would rather fix at any cost in the data warehouse and pretend that there isn’t a source system problem 24 April 2009 © 2009 Data Management & Warehousing Page 4
  • 5. Common Problems •! 11 types of problem that account for the most common problems •! They usually reflect poor design and/or implementation of systems •! Most can be fixed or monitored and managed to limit the impact 24 April 2009 © 2009 Data Management & Warehousing Page 5
  • 6. Referential Issues •! Keys that are not unique –! Systems that do not enforce the unique primary key or flat files or spreadsheets –! Also generated by ETL that creates a surrogate key incorrectly •! Referential Integrity Failures –! Where referentially integrity is not enforced values in the child table are created that are not in the parent table 24 April 2009 © 2009 Data Management & Warehousing Page 6
  • 7. Data Type Issues •! Format Errors –! Typically in Date/Time type fields –! 02/04/2009 2nd April (UK) or 4th Feb (US) •! Inappropriate Data Types –! Storing Dates in Character Strings 20090624 as YYYYMMDD format string –! But what about 20090230? 24 April 2009 © 2009 Data Management & Warehousing Page 7
  • 8. Data Model Issues •! De-normalised tables –! Commonly created for performance reasons –! Inherently duplicates data –! Often gets out of sync •! Data/Column Retirement –! Upgrade to system retires a column –! ETL continues to use the old column •! Poor Table/Column Naming –! Don’t assume that a column does what it says –! Don’t assume that a column is still being used for it’s original purpose 24 April 2009 © 2009 Data Management & Warehousing Page 8
  • 9. Data Content Issues •! Null Values –! Systems that have many optional fields will often have missing values –! Null values allow rows to be silently omitted from queries •! Inappropriate Values –! Databases allow special characters and/or leading/trailing white space –! “DataspaceQuality” != “DatatabQuality” 24 April 2009 © 2009 Data Management & Warehousing Page 9
  • 10. Data Feed Issues •! Missing Data –! Where a stream of files are loaded by ETL if one is dropped it can go un-noticed –! Common with CDR type loads in Telcos •! Late Data –! A short term data quality issue –! Leaves users believing there is a problem –! Produces inconsistent reporting over time 24 April 2009 © 2009 Data Management & Warehousing Page 10
  • 11. Automated Checking •! Regularly run checks •! Broad Coverage across systems –! 100s and 1000s not 10s of queries –! Run in a low priority loop in the background •! Used against: –! Sources –! Data Staging –! Data Warehouse •! No Product Required –! We often implement this as a controlling shell script and lots of small scripts, one for each check 24 April 2009 © 2009 Data Management & Warehousing Page 11
  • 12. Trending •! Absolute Trending –! Track an expected value over time •! e.g. Returned Mail is usually less than 500 items per day •! If the value is <500 status is green, 501 to 1000 amber and >1000 red •! Statistical Process Control (SPC) Trending –! Track an expected value where the value changes over time •! e.g. Telco CDRs – expect more as the company grows •! Don’t want to be continuously changing the threshold •! Compare current load to historical means •! If current load within 2 Standard Deviations – Green, 3 Standard Deviations – Amber, 4 or more Standard Deviations - Red 24 April 2009 © 2009 Data Management & Warehousing Page 12
  • 13. Flow Control •! Flow Control –! ETL manipulates data •! Joins, De-duplicates, Filters, Aggregates, etc –! Use the formula: Source Count - Filtered Count – DeDup Count – Target Count = 0 •! Trusted Source –! Compare the result with a third system –! e.g. Does the Count of Switch CDRs = Count of those processed by the billing system Count of those processed in the DWH 24 April 2009 © 2009 Data Management & Warehousing Page 13
  • 14. Business Rules Based •! Specific rules to match known business rules –! Account holders > 18 (Sys Date – DoB) –! Account holders < 115 –! Credit Card numbers are 16 digits long –! Number of accounts without a status •! Result should yield Zero 24 April 2009 © 2009 Data Management & Warehousing Page 14
  • 15. Automated Checking - Ops •! Managed by exception –! Red given priority –! Amber are always followed up •! Massive number of checks –! 100’s are good –! 1000’s are better •! Presentation –! Alerts, RAG, Graphical, Numerical, etc. 24 April 2009 © 2009 Data Management & Warehousing Page 15
  • 16. Data Profiling Checks •! Run manually because they need to be interpreted by a human •! Leads to new business rules being added to the automated checks •! Can be done with simple reporting tools or commercial data profiling tools 24 April 2009 © 2009 Data Management & Warehousing Page 16
  • 17. Frequency Outliers •! Count discreet values in a table and check items with many more or less than normal –! e.g. DoB 01-01-01 many times more common than any other value indicates source default and something that needs work –! e.g. Count of SMS messages significantly lower on a given day may equate to a genuine system failure and therefore not a DQ problem 24 April 2009 © 2009 Data Management & Warehousing Page 17
  • 18. Maximum & Minimum •! Determine what a valid range for any value should be –! e.g. age between 18 and 115 –! Immediately finds individual data quality issues that can be resolved –! Allows an analyst to create new business rules to prevent future problems 24 April 2009 © 2009 Data Management & Warehousing Page 18
  • 19. Sequential Keys •! If a system has a sequential key: Max Value – Min Value – Count = 0 •! If this is true – is it too perfect for an operational system and therefore test data •! If this is false – what has caused the gaps, are the deletions intentional? 24 April 2009 © 2009 Data Management & Warehousing Page 19
  • 20. Data Types •! Validation of mis-used data types before loading –! e.g. Dates in Character fields –! Format: YYYYMMDD –! Check MM between 01 and 12 –! Check DD between 01 and 31 –! Check MMDD does not include 0230, 0231 –! etc. 24 April 2009 © 2009 Data Management & Warehousing Page 20
  • 21. Skewed Pattern Profiling •! Looking for specific patterns in data –! e.g. UK National Insurance Numbers (?) have the format AA 99 99 99 A –! Pattern match all values looking for exceptions •! Number Lengths are a special case –! e.g Credit Card Numbers are 16 digits long 24 April 2009 © 2009 Data Management & Warehousing Page 21
  • 22. Content Checking •! Content Checking is the manual review of character strings •! Needs a good understanding of the nature of the data •! Often determines the need to do analysis of other types 24 April 2009 © 2009 Data Management & Warehousing Page 22
  • 23. Nulls & White Space •! Nulls –! Fields that have large proportion of nulls are usually not useful –! Also common is default status of null (e.g. Account us either closed or null) •! White Space –! Not Null fields with a single space –! Tab instead of space –! Leading/Trailing white space –! Double White Space: “DavidSpaceSpaceWalker” 24 April 2009 © 2009 Data Management & Warehousing Page 23
  • 24. Punctuation & Control Chars •! Punctuation –! CSV files that are not properly quoted perform field shifts –! Address lines with extra commas •! Control Characters –! Data fields that contain ASCII character codes 0 to 31 and 127 to 159 are often ‘invisible’ when viewed in queries but cause failures –! Also be aware of ‘code-page’ specifics 24 April 2009 © 2009 Data Management & Warehousing Page 24
  • 25. Problem Management Matrix 24 April 2009 © 2009 Data Management & Warehousing Page 25
  • 26. Continuous DQ Process 24 April 2009 © 2009 Data Management & Warehousing Page 26
  • 27. Quality is FREE … … as long as you are prepared to INVEST HEAVILY in it Philip Crosby 1980 Especially true of Data Quality 24 April 2009 © 2009 Data Management & Warehousing Page 27
  • 28. Data Management & Warehousing Data Quality: Common Problems & Checks Thank You David M. Walker davidw@datamgmt.com +44 (0) 7050 028 911 - http://www.datamgmt.com