SlideShare une entreprise Scribd logo
1  sur  21
Data Vault Automation at
de Bijenkorf
PRESENTED BY
ROB WINTERS
ANDREI SCORUS
Presentation agenda
◦ Project objectives
◦ Architectural overview
◦ The data warehouse data model
◦ Automation in the data warehouse
◦ Successes and failures
◦ Conclusions
About the presenters
Rob Winters
Head of Data Technology, the Bijenkorf
Project role:
◦ Project Lead
◦ Systems architect and administrator
◦ Data modeler
◦ Developer (ETL, predictive models, reports)
◦ Stakeholder manager
◦ Joined project September 2014
Andrei Scorus
BI Consultant, Incentro
Project role:
◦ Main ETL Developer
◦ ETL Developer
◦ Modeling support
◦ Source system expert
◦ Joined project November 2014
Project objectives
◦ Information requirements
◦ Have one place as the source for all reports
◦ Security and privacy
◦ Information management
◦ Integrate with production
◦ Non-functional requirements
◦ System quality
◦ Extensibility
◦ Scalability
◦ Maintainability
◦ Security
◦ Flexibility
◦ Low Cost
Technical Requirements
• One environment to quickly generate customer insights
• Then feed those insights back to production
• Then measure the impact of those changes in near real time
Source system landscape
Source Type Number of Sources Examples Load Frequency Data Structure
Oracle DB 2 Virgo ERP 2x/hour Partial 3NF
MySQL 3 Product DB, Web
Orders, DWH
10x/hour 3NF (Web Orders),
Improperly normalized
Event bus 1 Web/email events 1x/minute Tab delimited with
JSON fields
Webhook 1 Transactional Emails 1x/minute JSON
REST APIs 5+ GA, DotMailer 1x/hour-1x/day JSON
SOAP APIs 5+ AdWords, Pricing 1x/day XML
Architectural overview
Tools
AWS
◦ S3
◦ Kinesis
◦ Elasticache
◦ Elastic Beanstalk
◦ EC2
◦ DynamoDB
Open Source
◦ Snowplow Event Tracker
◦ Rundeck Scheduler
◦ Jenkins Continuous Integration
◦ Pentaho PDI
Other
◦ HP Vertica
◦ Tableau
◦ Github
◦ RStudio Server
DWH internal architecture
• Traditional three tier DWH
• ODS generated automatically from
staging
• Ops mart reflects data in original
source form
• Helps offload queries from
source systems
• Business marts materialized
exclusively from vault
Bijenkorf Data Vault overview
Data volumes
• ~1 TB base volume
• 10-12 GB daily
• ~250 source tables
Aligned to Data Vault 2.0
• Hash keys
• Hashes used for CDC
• Parallel loading
• Maximum utilization of available
resources
• Data unchanged in to the vault
Some statistics
18 hubs
• 34 loading scripts
27 links
• 43 loading scripts
39 satellites
• 43 loading scripts
13 reference tables
• 1 script per table
Model contains
• Sales transactions
• Customer and corporate
locations
• Customers
• Products
• Payment methods
• E-mail
• Phone
• Product grouping
• Campaigns
• deBijenkorf card
• Social media
Excluded from the vault
◦ Event streams
◦ Server logs
◦ Unstructured data
Deep dive: Transactions in DV
•Transactions
Deep dive: Customers in DV
•Same as link on customer
Challenges encountered during data modeling
Challenge Issue Details Resolution
Source issues • Source systems and original data
unavailable for most information
• Data often transformed 2-4 times before
access was available
• Business keys (ex. SKU) typically replaced
with sequences
• Business keys rebuilt in staging prior to
vault loading
Modeling returns • Retail returns can appear in ERP in 1-3
ways across multiple tables with
inconsistent keys
• Online returns appear as a state change
on original transaction and may/may not
appear in ERP
• Original model showed sale state on
line item satellite
• Revised model recorded “negative sale”
transactions and used a new link to
connect to original sale when possible
Fragmented
knowledge
• Information about the systems was being
held by multiple people
• Documentation was out-of-date
• Talking to as many people as possible
and testing hypotheses on the data
Targeted benefits of DWH automation
Objective Achievements
Speed of development • Integration of new sources or data from existing sources takes 1-2 steps
• Adding a new vault dependency takes one step
Simplicity • Five jobs handle all ETL processes across DWH
Traceability • Every record/source file is traced in the database and every row automatically
identified by source file in ODS
Code simplification • Replaced most common key definitions with dynamic variable replacement
File management • Every source file automatically archived to Amazon S3 in appropriate locations
sorted by source, table, and date
• Entire source systems, periods, etc can be replayed in minutes
Source loading automation
o Design of loader focused on process abstraction, traceability, and minimization of “moving parts”
o Final process consisted of two base jobs working in tandem: one for generating incremental extracts from
source systems, one for loading flat files from all sources to staging tables
o Replication was desired but rejected due to limited access to source systems
Source tables
duplicated in
staging with
addition of
loadTs and
sourceFile
columns
Metadata for
source file
added
Loader
automatically
generates ODS,
begins tracking
source files for
duplication and
data quality
Query
generator
automatically
executes full
duplication on
first execution
and
incrementals
afterward
CREATE TABLE stg_oms.customer
(
customerId int
, customerName varchar(500)
, customerAddress varchar(5000)
, loadTs timestamp NOT NULL
, sourceFile varchar(255) NOT NULL
)
ORDER BY customerId
PARTITION BY date(loadTs)
;
INSERT INTO meta.source_to_stg_mapping
(targetSchema, targetTable, sourceSystem, fileNamePattern, delimiter, nullField)
VALUES
('stg_oms','customer','OMS','OMS_CUSTOMER','TAB','NULL')
;
Example: Add additional table from existing sourceWorkflow of source integration
Vault loading automation
• New sources
automatically
added
• Last change
epoch based
on load
stamps,
advanced
each time all
dependencies
execute
successfully
All Staging
Tables
Checked for
Changes
• Dependencies
declared at
time of job
creation
• Load
prioritization
possible but
not utilized
List of
Dependent
Vault Loads
Identified
• Jobs
parallelized
across tables
but serialized
per job
• Dynamic job
queueing
ensures
appropriate
execution
order
Loads
Planned in
Hub, Link,
Sat Order
• Variables
automatically
identified and
replaced
• Each load
records
performance
statistics and
error
messages
Loads
Executed
o Loader is fully metadata driven with focus on horizontal scalability and management simplicity
o To support speed of development and performance, variable-driven SQL templates used throughout
Design goals for mart loading automation
Requirement Solution Benefit
Simple,
standardized
models
Metadata-driven
Pentaho PDI
Easy development
using parameters
and variables
Easily
Extensible
Plugin framework
Rapid integration
of new
functionality
Rapid new job
development
Recycle
standardized jobs
and
transformations
Limited moving
parts, easy
modification
Low
administration
overhead
Leverage built in
logging and
tracking
Easily integrated
mart loading
reporting with
other ETL reports
Data Information mart automation flow
Retrieve
commands
• Each dimension and fact is processed independently
Get
dependencies
• Based on defined transformation, get all related vault tables: links, satellites or hubs
Retrieve
changed data
• From the related tables, build a list of unique keys that have changed since the last update of the fact or dimension
• Store the data in the database until further processing
Execute
transformations
• Multiple Pentaho transformations can be processed per command using the data captured in previous steps
Maintentance
• Logging happens throughout the whole process
• Cleanup after all commands have been processed
Primary uses of Bijenkorf DWH
CustomerAnalysis
• Provided first unified
data model of
customer activity
• 80% reduction in
unique customer keys
• Allowed for
segmentation of
customers based on
combination of in-
store and online
activity
Personalization
• DV drives
recommendation
engine and customer
recommendations
(updated nightly)
• Data pipeline
supports near real
time updating of
customer
recommendations
based on web activity
BusinessIntelligence
• DV-based marts
replace joining dozens
of tables across
multiple sources with
single facts/
dimensions
• IT-driven reporting
being replaced with
self-service BI
Biggest drivers of success
AWS Infrastructure
Cost: Entire infrastructure for less than one
server in the data center
Toolset: Most services available off the
shelf, minimizing administration
Freedom: No dependency on IT for
development support
Scalability: Systems automatically scaled to
match DWH demands
Automation
Speed: Enormous time savings after initial
investment
Simplicity: Able to run and monitor 40k+
queries per day with minimal effort
Auditability: Enforced tracking and archiving
without developer involvement
PDI framework
Ease of use: Adding new commands takes at
most 45 minutes
Agile: Building the framework took 1 day
Low profile: Average memory usage of
250MB
Biggest mistakes along the way
• Initial integration design was based on provided documentation/models which was rarely accurate
• Current users of sources should have been engaged earlier to explain undocumented caveats
Reliance on documentation and requirements over expert users
• Variables were utilized late in development, slowing progress significantly and creating consistency
issues
• Good initial design of templates will significantly reduce development time in mid/long run
Late utilization of templates and variables
• We attempted to design and populate the entire data vault prior to focusing on customer deliverables
like reports (in addition to other projects)
• We have shifted focus to continuous release of new information rather than waiting for completeness
Aggressive overextension of resources
Primary takeaways
◦ Sources are like cars: the older they are, the more idiosyncrasies. Be cautious with design automation!
◦ Automation can enormously simplify/accelerate data warehousing. Don’t be afraid to roll your own
◦ Balance stateful versus stateless and monolithic versus fragmented architecture design
◦ Cloud based architecture based on column store DBs is extremely scalable, cheap, and highly performant
◦ A successful vault can create a new problem: getting IT to think about business processes rather than system keys!
Rob Winters
WintersRD@gmail.com
Andrei Scorus
andrei.scorus@incentro.com

Contenu connexe

Tendances

How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringDurga Gadiraju
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformCaserta
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Kent Graziano
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
O'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeO'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeVasu S
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsEduardo Castro
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analyticsIke Ellis
 
Machine Learning for z/OS
Machine Learning for z/OSMachine Learning for z/OS
Machine Learning for z/OSCuneyt Goksu
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Kent Graziano
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?James Serra
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemJames Serra
 

Tendances (20)

How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
O'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeO'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data Lake
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analytics
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
 
Machine Learning for z/OS
Machine Learning for z/OSMachine Learning for z/OS
Machine Learning for z/OS
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 
2022 02 Integration Bootcamp
2022 02 Integration Bootcamp2022 02 Integration Bootcamp
2022 02 Integration Bootcamp
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Disaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQLDisaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQL
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
 

En vedette

Building a Personalized Offer Using Machine Learning
Building a Personalized Offer Using Machine LearningBuilding a Personalized Offer Using Machine Learning
Building a Personalized Offer Using Machine LearningRob Winters
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureJames Serra
 
Architecting for Real-Time Big Data Analytics
Architecting for Real-Time Big Data AnalyticsArchitecting for Real-Time Big Data Analytics
Architecting for Real-Time Big Data AnalyticsRob Winters
 
Guru4Pro Data Vault Best Practices
Guru4Pro Data Vault Best PracticesGuru4Pro Data Vault Best Practices
Guru4Pro Data Vault Best PracticesCGI
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Amazon Web Services
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesRob Winters
 
Top bi travelbird
Top bi travelbirdTop bi travelbird
Top bi travelbirdBigDataExpo
 
HP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataHP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataRob Winters
 
Getting Started with Big Data Analytics
Getting Started with Big Data AnalyticsGetting Started with Big Data Analytics
Getting Started with Big Data AnalyticsRob Winters
 
Billions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right NowBillions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right NowRob Winters
 
Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!
Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!
Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!BigDataExpo
 
Semantic Technology for the Data Warehousing Practitioner
Semantic Technology for the Data Warehousing PractitionerSemantic Technology for the Data Warehousing Practitioner
Semantic Technology for the Data Warehousing PractitionerThomas Kelly, PMP
 
Agile Data Mining with Data Vault 2.0 (english)
Agile Data Mining with Data Vault 2.0 (english)Agile Data Mining with Data Vault 2.0 (english)
Agile Data Mining with Data Vault 2.0 (english)Michael Olschimke
 
Data Center In Healthcare Presentation 02 12
Data Center In Healthcare Presentation 02 12Data Center In Healthcare Presentation 02 12
Data Center In Healthcare Presentation 02 12todmoore
 
Tableau @ Spil Games
Tableau @ Spil GamesTableau @ Spil Games
Tableau @ Spil GamesRob Winters
 
Data warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designData warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designSarita Kataria
 

En vedette (20)

Building a Personalized Offer Using Machine Learning
Building a Personalized Offer Using Machine LearningBuilding a Personalized Offer Using Machine Learning
Building a Personalized Offer Using Machine Learning
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Architecting for Real-Time Big Data Analytics
Architecting for Real-Time Big Data AnalyticsArchitecting for Real-Time Big Data Analytics
Architecting for Real-Time Big Data Analytics
 
Guru4Pro Data Vault Best Practices
Guru4Pro Data Vault Best PracticesGuru4Pro Data Vault Best Practices
Guru4Pro Data Vault Best Practices
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Ibm integration bus
Ibm integration busIbm integration bus
Ibm integration bus
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Top bi travelbird
Top bi travelbirdTop bi travelbird
Top bi travelbird
 
HP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataHP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big Data
 
Getting Started with Big Data Analytics
Getting Started with Big Data AnalyticsGetting Started with Big Data Analytics
Getting Started with Big Data Analytics
 
Billions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right NowBillions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right Now
 
Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!
Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!
Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!
 
Semantic Technology for the Data Warehousing Practitioner
Semantic Technology for the Data Warehousing PractitionerSemantic Technology for the Data Warehousing Practitioner
Semantic Technology for the Data Warehousing Practitioner
 
Agile Data Mining with Data Vault 2.0 (english)
Agile Data Mining with Data Vault 2.0 (english)Agile Data Mining with Data Vault 2.0 (english)
Agile Data Mining with Data Vault 2.0 (english)
 
Techzone 2014 presentation rundeck
Techzone 2014 presentation rundeckTechzone 2014 presentation rundeck
Techzone 2014 presentation rundeck
 
Data Center In Healthcare Presentation 02 12
Data Center In Healthcare Presentation 02 12Data Center In Healthcare Presentation 02 12
Data Center In Healthcare Presentation 02 12
 
Tableau @ Spil Games
Tableau @ Spil GamesTableau @ Spil Games
Tableau @ Spil Games
 
Data warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designData warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-design
 

Similaire à Data Vault Automation at the Bijenkorf

Remote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New FeaturesRemote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New FeaturesRemote DBA Experts
 
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, GlooHow a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, GlooHostedbyConfluent
 
Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse OptimizationCloudera, Inc.
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Victor Holman
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeSaurabh K. Gupta
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptRafiulHasan19
 
Bringing DevOps to the Database
Bringing DevOps to the DatabaseBringing DevOps to the Database
Bringing DevOps to the DatabaseMichaela Murray
 
Evolutionary database design
Evolutionary database designEvolutionary database design
Evolutionary database designSalehein Syed
 
Introduction to Conductor
Introduction to ConductorIntroduction to Conductor
Introduction to ConductorJason Gleason
 
Delivering Changes for Applications and Databases
Delivering Changes for Applications and DatabasesDelivering Changes for Applications and Databases
Delivering Changes for Applications and DatabasesMiguel Alho
 
Data Stream Processing for Beginners with Kafka and CDC
Data Stream Processing for Beginners with Kafka and CDCData Stream Processing for Beginners with Kafka and CDC
Data Stream Processing for Beginners with Kafka and CDCAbhijit Kumar
 
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...Amazon Web Services
 
Fishbowl's Packaged Tools for WebCenter Automation
Fishbowl's Packaged Tools for WebCenter AutomationFishbowl's Packaged Tools for WebCenter Automation
Fishbowl's Packaged Tools for WebCenter AutomationFishbowl Solutions
 
Got documents - The Raven Bouns Edition
Got documents - The Raven Bouns EditionGot documents - The Raven Bouns Edition
Got documents - The Raven Bouns EditionMaggie Pint
 
Ibm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_CapabilitiesIbm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_CapabilitiesIBM_Info_Management
 
DevOps+Data: Working with Source Control
DevOps+Data: Working with Source ControlDevOps+Data: Working with Source Control
DevOps+Data: Working with Source ControlEd Leighton-Dick
 
AppSphere 15 - Is the database affecting your critical business transactions?
AppSphere 15 - Is the database affecting your critical business transactions?AppSphere 15 - Is the database affecting your critical business transactions?
AppSphere 15 - Is the database affecting your critical business transactions?AppDynamics
 

Similaire à Data Vault Automation at the Bijenkorf (20)

Remote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New FeaturesRemote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New Features
 
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, GlooHow a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
 
Datawarehouse org
Datawarehouse orgDatawarehouse org
Datawarehouse org
 
Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse Optimization
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.ppt
 
Bringing DevOps to the Database
Bringing DevOps to the DatabaseBringing DevOps to the Database
Bringing DevOps to the Database
 
Evolutionary database design
Evolutionary database designEvolutionary database design
Evolutionary database design
 
Introduction to Conductor
Introduction to ConductorIntroduction to Conductor
Introduction to Conductor
 
Delivering Changes for Applications and Databases
Delivering Changes for Applications and DatabasesDelivering Changes for Applications and Databases
Delivering Changes for Applications and Databases
 
Data Stream Processing for Beginners with Kafka and CDC
Data Stream Processing for Beginners with Kafka and CDCData Stream Processing for Beginners with Kafka and CDC
Data Stream Processing for Beginners with Kafka and CDC
 
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
 
Fishbowl's Packaged Tools for WebCenter Automation
Fishbowl's Packaged Tools for WebCenter AutomationFishbowl's Packaged Tools for WebCenter Automation
Fishbowl's Packaged Tools for WebCenter Automation
 
Datastage Introduction To Data Warehousing
Datastage Introduction To Data WarehousingDatastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing
 
Got documents - The Raven Bouns Edition
Got documents - The Raven Bouns EditionGot documents - The Raven Bouns Edition
Got documents - The Raven Bouns Edition
 
DW (1).ppt
DW (1).pptDW (1).ppt
DW (1).ppt
 
Ibm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_CapabilitiesIbm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_Capabilities
 
DevOps+Data: Working with Source Control
DevOps+Data: Working with Source ControlDevOps+Data: Working with Source Control
DevOps+Data: Working with Source Control
 
AppSphere 15 - Is the database affecting your critical business transactions?
AppSphere 15 - Is the database affecting your critical business transactions?AppSphere 15 - Is the database affecting your critical business transactions?
AppSphere 15 - Is the database affecting your critical business transactions?
 

Dernier

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 

Dernier (20)

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 

Data Vault Automation at the Bijenkorf

  • 1. Data Vault Automation at de Bijenkorf PRESENTED BY ROB WINTERS ANDREI SCORUS
  • 2. Presentation agenda ◦ Project objectives ◦ Architectural overview ◦ The data warehouse data model ◦ Automation in the data warehouse ◦ Successes and failures ◦ Conclusions
  • 3. About the presenters Rob Winters Head of Data Technology, the Bijenkorf Project role: ◦ Project Lead ◦ Systems architect and administrator ◦ Data modeler ◦ Developer (ETL, predictive models, reports) ◦ Stakeholder manager ◦ Joined project September 2014 Andrei Scorus BI Consultant, Incentro Project role: ◦ Main ETL Developer ◦ ETL Developer ◦ Modeling support ◦ Source system expert ◦ Joined project November 2014
  • 4. Project objectives ◦ Information requirements ◦ Have one place as the source for all reports ◦ Security and privacy ◦ Information management ◦ Integrate with production ◦ Non-functional requirements ◦ System quality ◦ Extensibility ◦ Scalability ◦ Maintainability ◦ Security ◦ Flexibility ◦ Low Cost Technical Requirements • One environment to quickly generate customer insights • Then feed those insights back to production • Then measure the impact of those changes in near real time
  • 5. Source system landscape Source Type Number of Sources Examples Load Frequency Data Structure Oracle DB 2 Virgo ERP 2x/hour Partial 3NF MySQL 3 Product DB, Web Orders, DWH 10x/hour 3NF (Web Orders), Improperly normalized Event bus 1 Web/email events 1x/minute Tab delimited with JSON fields Webhook 1 Transactional Emails 1x/minute JSON REST APIs 5+ GA, DotMailer 1x/hour-1x/day JSON SOAP APIs 5+ AdWords, Pricing 1x/day XML
  • 6. Architectural overview Tools AWS ◦ S3 ◦ Kinesis ◦ Elasticache ◦ Elastic Beanstalk ◦ EC2 ◦ DynamoDB Open Source ◦ Snowplow Event Tracker ◦ Rundeck Scheduler ◦ Jenkins Continuous Integration ◦ Pentaho PDI Other ◦ HP Vertica ◦ Tableau ◦ Github ◦ RStudio Server
  • 7. DWH internal architecture • Traditional three tier DWH • ODS generated automatically from staging • Ops mart reflects data in original source form • Helps offload queries from source systems • Business marts materialized exclusively from vault
  • 8. Bijenkorf Data Vault overview Data volumes • ~1 TB base volume • 10-12 GB daily • ~250 source tables Aligned to Data Vault 2.0 • Hash keys • Hashes used for CDC • Parallel loading • Maximum utilization of available resources • Data unchanged in to the vault Some statistics 18 hubs • 34 loading scripts 27 links • 43 loading scripts 39 satellites • 43 loading scripts 13 reference tables • 1 script per table Model contains • Sales transactions • Customer and corporate locations • Customers • Products • Payment methods • E-mail • Phone • Product grouping • Campaigns • deBijenkorf card • Social media Excluded from the vault ◦ Event streams ◦ Server logs ◦ Unstructured data
  • 9. Deep dive: Transactions in DV •Transactions
  • 10. Deep dive: Customers in DV •Same as link on customer
  • 11. Challenges encountered during data modeling Challenge Issue Details Resolution Source issues • Source systems and original data unavailable for most information • Data often transformed 2-4 times before access was available • Business keys (ex. SKU) typically replaced with sequences • Business keys rebuilt in staging prior to vault loading Modeling returns • Retail returns can appear in ERP in 1-3 ways across multiple tables with inconsistent keys • Online returns appear as a state change on original transaction and may/may not appear in ERP • Original model showed sale state on line item satellite • Revised model recorded “negative sale” transactions and used a new link to connect to original sale when possible Fragmented knowledge • Information about the systems was being held by multiple people • Documentation was out-of-date • Talking to as many people as possible and testing hypotheses on the data
  • 12. Targeted benefits of DWH automation Objective Achievements Speed of development • Integration of new sources or data from existing sources takes 1-2 steps • Adding a new vault dependency takes one step Simplicity • Five jobs handle all ETL processes across DWH Traceability • Every record/source file is traced in the database and every row automatically identified by source file in ODS Code simplification • Replaced most common key definitions with dynamic variable replacement File management • Every source file automatically archived to Amazon S3 in appropriate locations sorted by source, table, and date • Entire source systems, periods, etc can be replayed in minutes
  • 13. Source loading automation o Design of loader focused on process abstraction, traceability, and minimization of “moving parts” o Final process consisted of two base jobs working in tandem: one for generating incremental extracts from source systems, one for loading flat files from all sources to staging tables o Replication was desired but rejected due to limited access to source systems Source tables duplicated in staging with addition of loadTs and sourceFile columns Metadata for source file added Loader automatically generates ODS, begins tracking source files for duplication and data quality Query generator automatically executes full duplication on first execution and incrementals afterward CREATE TABLE stg_oms.customer ( customerId int , customerName varchar(500) , customerAddress varchar(5000) , loadTs timestamp NOT NULL , sourceFile varchar(255) NOT NULL ) ORDER BY customerId PARTITION BY date(loadTs) ; INSERT INTO meta.source_to_stg_mapping (targetSchema, targetTable, sourceSystem, fileNamePattern, delimiter, nullField) VALUES ('stg_oms','customer','OMS','OMS_CUSTOMER','TAB','NULL') ; Example: Add additional table from existing sourceWorkflow of source integration
  • 14. Vault loading automation • New sources automatically added • Last change epoch based on load stamps, advanced each time all dependencies execute successfully All Staging Tables Checked for Changes • Dependencies declared at time of job creation • Load prioritization possible but not utilized List of Dependent Vault Loads Identified • Jobs parallelized across tables but serialized per job • Dynamic job queueing ensures appropriate execution order Loads Planned in Hub, Link, Sat Order • Variables automatically identified and replaced • Each load records performance statistics and error messages Loads Executed o Loader is fully metadata driven with focus on horizontal scalability and management simplicity o To support speed of development and performance, variable-driven SQL templates used throughout
  • 15. Design goals for mart loading automation Requirement Solution Benefit Simple, standardized models Metadata-driven Pentaho PDI Easy development using parameters and variables Easily Extensible Plugin framework Rapid integration of new functionality Rapid new job development Recycle standardized jobs and transformations Limited moving parts, easy modification Low administration overhead Leverage built in logging and tracking Easily integrated mart loading reporting with other ETL reports
  • 16. Data Information mart automation flow Retrieve commands • Each dimension and fact is processed independently Get dependencies • Based on defined transformation, get all related vault tables: links, satellites or hubs Retrieve changed data • From the related tables, build a list of unique keys that have changed since the last update of the fact or dimension • Store the data in the database until further processing Execute transformations • Multiple Pentaho transformations can be processed per command using the data captured in previous steps Maintentance • Logging happens throughout the whole process • Cleanup after all commands have been processed
  • 17. Primary uses of Bijenkorf DWH CustomerAnalysis • Provided first unified data model of customer activity • 80% reduction in unique customer keys • Allowed for segmentation of customers based on combination of in- store and online activity Personalization • DV drives recommendation engine and customer recommendations (updated nightly) • Data pipeline supports near real time updating of customer recommendations based on web activity BusinessIntelligence • DV-based marts replace joining dozens of tables across multiple sources with single facts/ dimensions • IT-driven reporting being replaced with self-service BI
  • 18. Biggest drivers of success AWS Infrastructure Cost: Entire infrastructure for less than one server in the data center Toolset: Most services available off the shelf, minimizing administration Freedom: No dependency on IT for development support Scalability: Systems automatically scaled to match DWH demands Automation Speed: Enormous time savings after initial investment Simplicity: Able to run and monitor 40k+ queries per day with minimal effort Auditability: Enforced tracking and archiving without developer involvement PDI framework Ease of use: Adding new commands takes at most 45 minutes Agile: Building the framework took 1 day Low profile: Average memory usage of 250MB
  • 19. Biggest mistakes along the way • Initial integration design was based on provided documentation/models which was rarely accurate • Current users of sources should have been engaged earlier to explain undocumented caveats Reliance on documentation and requirements over expert users • Variables were utilized late in development, slowing progress significantly and creating consistency issues • Good initial design of templates will significantly reduce development time in mid/long run Late utilization of templates and variables • We attempted to design and populate the entire data vault prior to focusing on customer deliverables like reports (in addition to other projects) • We have shifted focus to continuous release of new information rather than waiting for completeness Aggressive overextension of resources
  • 20. Primary takeaways ◦ Sources are like cars: the older they are, the more idiosyncrasies. Be cautious with design automation! ◦ Automation can enormously simplify/accelerate data warehousing. Don’t be afraid to roll your own ◦ Balance stateful versus stateless and monolithic versus fragmented architecture design ◦ Cloud based architecture based on column store DBs is extremely scalable, cheap, and highly performant ◦ A successful vault can create a new problem: getting IT to think about business processes rather than system keys!

Notes de l'éditeur

  1. One of the focus points will be the return satellite, maybe the whole link to the return location and customer should have been modeled as a link? Return satellite is an active satellite