SlideShare une entreprise Scribd logo
1  sur  33
Kelly Stirman
VP Strategy
@kstirman
Why data prep?
Analytics on modern
data is incredibly hard
Unprecedented complexity
The demands for data
are growing rapidly
Increasing demands
Reporting
New products
Forecasting
Threat detection
BI
Machine
Learning
Segmenting
Fraud prevention
Your analysts are hungry for data
SQL
Today you engineer data flows and reshaping
Data Staging
• Custon ETL
• Fragile transforms
• Slow moving
SQL
Today you engineer data flows and reshaping
Data Staging
Data Warehouse
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Fragile transforms
• Slow moving
SQL
Today you engineer data flows and reshaping
Data Staging
Data Warehouse
Cubes, BI Extracts &
Aggregation Tables • Data sprawl
• Governance issues
• Slow to update
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Fragile transforms
• Slow moving
SQL
+
+
+
+
+
+
+
+
+
Data Prep
Source: Forrester
Old v. new approaches
Data Integration v. Data Prep
Data Integration Data Prep
Primary user IT Business Analyst
User works from Metadata Data samples
Prioritizes Governance, security Ease of use, time to insight
Sample vendors Informatica, IBM, SAS, SQL tools Alteryx, Trifacta, Paxata
Data Integration is the standard
• For 25+ years, Data Integration has been an essential tool for IT
• Pros
• Mature, robust
• Deep integrations to enterprise standards
• Security and governance controls
• Server-based: scalable, centralized
• Cons
• IT users only
• Assumes minimal data quality
• Mature for enterprise sources
• Less mature for cloud, 3rd party apps, Hadoop, NoSQL
• Complex, expensive
Data Prep prioritizes speed, ease of use
• Newer entrants, architected for modern resources
• Pros
• User experience works for both IT, Business
• Data-centric model vs. metadata-centric model
• Support for Hadoop, NoSQL, Cloud, machine learning
• Can leverage Hadoop and/or cloud for processing, storage
• Faster time to value
• Cons
• Less mature tech stack
• Small vendors, limited ecosystem of integrations and skills
• Security integrations less comprehensive
• Assumes governance, authority, lineage handled elsewhere
• Still need IT on board and coordinating process
Gartner 2016 Forrester 2016 Bloor 2017
Analyst coverage (see references)
Open source alternatives
• RDBMS
• Pros: SQL based; mature; ecosystem
• Cons: non-relational sources; scalability; ease of use
• Apache Hive
• Pros: scalabilty; SQL based; Hadoop integrations;
• Cons: latency; ease of use; integrations
• Apache Spark
• Pros: scalability; performance; Python/R integration; ML
• Cons: ease of use; integrations; maturity
• Python Pandas
• Pros: performance; pervasive skills; ecosystem; flexibility
• Cons: scale out is complex; ease of use
• R dplyr
• Pros: performance; pervasive skills; ecosystem; flexibility
• Cons: scale out is complex; ease of use
Screenshot commentary
How to decide?
Category Good Fit Primary User Model Scalability
ETL Tools
Static, predictable
integrations between
enterprise tech
IT
Data Pipeline,
metadata-based
Single server
BI Tools ”Last Mile” data prep Business Embedded Desktop
Trifacta, Paxata
Scalable, collaborative
data prep for business
users
Business
Spreadsheet,
sample-based
Hadoop cluster
Custom Scripts Maximum flexibility IT
Data Pipeline,
metadata-based
Single server
Alteryx, Datawatch
Building BI extracts,
easier to use than ETL
IT
Data Pipeline,
metadata-based
Desktop (single
server optional)
SAS Data Loader IT users IT
Data Pipeline,
metadata-based
Single server
Tamr
Human-aided ML for
data cleansing
Business
Spreadsheet,
sample-based
Single server
Important questions to ask
• Usability – knowing data is more important than knowing tech
• Collaboration – essential feature for business users
• Data sources – ODBC for NoSQL, cloud, Hadoop not enough
• License model – will influence how you adopt the tool
• Governance – solving problems or creating new ones?
• Complexity – how many moving parts for your end to end analytical
process
• Vendor viability – crowded market of small players
• Ecosystem – no technology is an island
Market predictions
• BI tools build integrated capabilities
• But customers want one solution for all tools
• ETL vendors try to become “business friendly”
• Legacy technology stack is an impediment, not an enabler
• Hadoop vendors acquire emerging data prep players
• What about data outside of Hadoop?
• Opportunity for new approach
• Truly self service for the business (no IT required)
• Works with all data sources (relational, cloud, NoSQL, Hadoop)
• Works with all analytical tools (BI, SQL, R, Python, Spark)
• Integrates all layers of the analytical stack
References
• Gartner (Market Guide) https://www.gartner.com/doc/3418832/market-guide-selfservice-data-preparation
• Forrester (Wave) https://www.forrester.com/report/The+Forrester+Wave+Data+Preparation+Tools+Q1+2017/-/E-RES128464
• Forrester (Vendor Landscape) https://www.forrester.com/report/Vendor+Landscape+Data+Preparation+Tools/-/E-RES128561
• Bloor Research: http://www.bloorresearch.com/technology/data-preparation-self-service/
• Informatica Demo: https://youtu.be/UBsUrJjggwc
• Alteryx Demo: https://youtu.be/LwO6VL1ScXk?t=1m25s
• SAS (data prep) Demo: https://youtu.be/9e_uxQBUPsQ?t=2m34s
• Trifacta Demo: https://www.youtube.com/watch?v=4VpW6oJ3cQI
• Paxata Demo: https://youtu.be/TR1smNYB4ks?t=18m6s
• Datawatch Demo: https://youtu.be/6hc_cafMsCs?t=2m22s
• Tableau (data prep) Demo: https://youtu.be/vlwfD9VyJME?t=20m49s
• Tamr Demo: https://youtu.be/PI_EqvIX45o
Kelly Stirman
VP Strategy
@kstirman
Want to try a new approach?
Contact me about the Dremio Beta Program
kelly@dremio.com

Contenu connexe

Tendances

Big Data on azure
Big Data on azureBig Data on azure
Big Data on azureDavid Giard
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
 
Analyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeAnalyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeBizTalk360
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeDatabricks
 
Integration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data LakeIntegration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data LakeTom Kerkhove
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Databricks
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data PipelineJesus Rodriguez
 
How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?Vincent Terrasi
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannAzure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannDatabricks
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lakeMykola Zerniuk
 
Not only SQL - Database Choices
Not only SQL - Database ChoicesNot only SQL - Database Choices
Not only SQL - Database ChoicesLynn Langit
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data FactoryBizTalk360
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine LearningMark Tabladillo
 
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Con LA
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleDatabricks
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks
 

Tendances (20)

Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
Analyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeAnalyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data Lake
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake
 
Integration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data LakeIntegration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data Lake
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
 
How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannAzure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Not only SQL - Database Choices
Not only SQL - Database ChoicesNot only SQL - Database Choices
Not only SQL - Database Choices
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
 
Introduction to Azure HDInsight
Introduction to Azure HDInsightIntroduction to Azure HDInsight
Introduction to Azure HDInsight
 
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure Databricks
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At Scale
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
 

En vedette

Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
 
The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too muchJulian Hyde
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowDremio Corporation
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits allJulian Hyde
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memoryJulian Hyde
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeDremio Corporation
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overviewJulian Hyde
 

En vedette (10)

Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too much
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 

Similaire à Options for Data Prep - A Survey of the Current Market

Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleDatabricks
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
 
Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse OptimizationCloudera, Inc.
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningProvectus
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Andrew Brust
 
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...confluent
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataAshnikbiz
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 
Derfor skal du bruge en DataLake
Derfor skal du bruge en DataLakeDerfor skal du bruge en DataLake
Derfor skal du bruge en DataLakeMicrosoft
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise AnalyticsDATAVERSITY
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalVMware Tanzu Korea
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarioskcmallu
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureVenu Anuganti
 

Similaire à Options for Data Prep - A Survey of the Current Market (20)

Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse Optimization
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
Derfor skal du bruge en DataLake
Derfor skal du bruge en DataLakeDerfor skal du bruge en DataLake
Derfor skal du bruge en DataLake
 
Retail & CPG
Retail & CPGRetail & CPG
Retail & CPG
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from Pivotal
 
Hadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data WarehouseHadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data Warehouse
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
 

Dernier

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 

Dernier (20)

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 

Options for Data Prep - A Survey of the Current Market

  • 2.
  • 4. Analytics on modern data is incredibly hard Unprecedented complexity
  • 5. The demands for data are growing rapidly Increasing demands Reporting New products Forecasting Threat detection BI Machine Learning Segmenting Fraud prevention
  • 6. Your analysts are hungry for data SQL
  • 7. Today you engineer data flows and reshaping Data Staging • Custon ETL • Fragile transforms • Slow moving SQL
  • 8. Today you engineer data flows and reshaping Data Staging Data Warehouse • $$$ • High overhead • Proprietary lock in • Custon ETL • Fragile transforms • Slow moving SQL
  • 9. Today you engineer data flows and reshaping Data Staging Data Warehouse Cubes, BI Extracts & Aggregation Tables • Data sprawl • Governance issues • Slow to update • $$$ • High overhead • Proprietary lock in • Custon ETL • Fragile transforms • Slow moving SQL + + + + + + + + +
  • 11.
  • 13. Old v. new approaches
  • 14. Data Integration v. Data Prep Data Integration Data Prep Primary user IT Business Analyst User works from Metadata Data samples Prioritizes Governance, security Ease of use, time to insight Sample vendors Informatica, IBM, SAS, SQL tools Alteryx, Trifacta, Paxata
  • 15. Data Integration is the standard • For 25+ years, Data Integration has been an essential tool for IT • Pros • Mature, robust • Deep integrations to enterprise standards • Security and governance controls • Server-based: scalable, centralized • Cons • IT users only • Assumes minimal data quality • Mature for enterprise sources • Less mature for cloud, 3rd party apps, Hadoop, NoSQL • Complex, expensive
  • 16. Data Prep prioritizes speed, ease of use • Newer entrants, architected for modern resources • Pros • User experience works for both IT, Business • Data-centric model vs. metadata-centric model • Support for Hadoop, NoSQL, Cloud, machine learning • Can leverage Hadoop and/or cloud for processing, storage • Faster time to value • Cons • Less mature tech stack • Small vendors, limited ecosystem of integrations and skills • Security integrations less comprehensive • Assumes governance, authority, lineage handled elsewhere • Still need IT on board and coordinating process
  • 17. Gartner 2016 Forrester 2016 Bloor 2017 Analyst coverage (see references)
  • 18. Open source alternatives • RDBMS • Pros: SQL based; mature; ecosystem • Cons: non-relational sources; scalability; ease of use • Apache Hive • Pros: scalabilty; SQL based; Hadoop integrations; • Cons: latency; ease of use; integrations • Apache Spark • Pros: scalability; performance; Python/R integration; ML • Cons: ease of use; integrations; maturity • Python Pandas • Pros: performance; pervasive skills; ecosystem; flexibility • Cons: scale out is complex; ease of use • R dplyr • Pros: performance; pervasive skills; ecosystem; flexibility • Cons: scale out is complex; ease of use
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 29. Category Good Fit Primary User Model Scalability ETL Tools Static, predictable integrations between enterprise tech IT Data Pipeline, metadata-based Single server BI Tools ”Last Mile” data prep Business Embedded Desktop Trifacta, Paxata Scalable, collaborative data prep for business users Business Spreadsheet, sample-based Hadoop cluster Custom Scripts Maximum flexibility IT Data Pipeline, metadata-based Single server Alteryx, Datawatch Building BI extracts, easier to use than ETL IT Data Pipeline, metadata-based Desktop (single server optional) SAS Data Loader IT users IT Data Pipeline, metadata-based Single server Tamr Human-aided ML for data cleansing Business Spreadsheet, sample-based Single server
  • 30. Important questions to ask • Usability – knowing data is more important than knowing tech • Collaboration – essential feature for business users • Data sources – ODBC for NoSQL, cloud, Hadoop not enough • License model – will influence how you adopt the tool • Governance – solving problems or creating new ones? • Complexity – how many moving parts for your end to end analytical process • Vendor viability – crowded market of small players • Ecosystem – no technology is an island
  • 31. Market predictions • BI tools build integrated capabilities • But customers want one solution for all tools • ETL vendors try to become “business friendly” • Legacy technology stack is an impediment, not an enabler • Hadoop vendors acquire emerging data prep players • What about data outside of Hadoop? • Opportunity for new approach • Truly self service for the business (no IT required) • Works with all data sources (relational, cloud, NoSQL, Hadoop) • Works with all analytical tools (BI, SQL, R, Python, Spark) • Integrates all layers of the analytical stack
  • 32. References • Gartner (Market Guide) https://www.gartner.com/doc/3418832/market-guide-selfservice-data-preparation • Forrester (Wave) https://www.forrester.com/report/The+Forrester+Wave+Data+Preparation+Tools+Q1+2017/-/E-RES128464 • Forrester (Vendor Landscape) https://www.forrester.com/report/Vendor+Landscape+Data+Preparation+Tools/-/E-RES128561 • Bloor Research: http://www.bloorresearch.com/technology/data-preparation-self-service/ • Informatica Demo: https://youtu.be/UBsUrJjggwc • Alteryx Demo: https://youtu.be/LwO6VL1ScXk?t=1m25s • SAS (data prep) Demo: https://youtu.be/9e_uxQBUPsQ?t=2m34s • Trifacta Demo: https://www.youtube.com/watch?v=4VpW6oJ3cQI • Paxata Demo: https://youtu.be/TR1smNYB4ks?t=18m6s • Datawatch Demo: https://youtu.be/6hc_cafMsCs?t=2m22s • Tableau (data prep) Demo: https://youtu.be/vlwfD9VyJME?t=20m49s • Tamr Demo: https://youtu.be/PI_EqvIX45o
  • 33. Kelly Stirman VP Strategy @kstirman Want to try a new approach? Contact me about the Dremio Beta Program kelly@dremio.com

Notes de l'éditeur

  1. BI assumes single relational database, but… Data in non-relational technologies Data fragmented across many systems Massive scale and velocity
  2. Data is the business, and… Era of impatient smartphone natives Rise of self-service BI Accelerating time to market Because of the complexity of modern data and increasing demands for data, IT gets crushed in the middle: Slow or non-responsive IT “Shadow Analytics” Data governance risk Illusive data engineers Immature software Competing strategic initiatives
  3. Here’s the problem everyone is trying to solve today. You have consumers of data with their favorite tools. BI products like Tableau, PowerBI, Qlik, as well as data science tools like Python, R, Spark, and SQL. Then you have all your data, in a mix of relational, NoSQL, Hadoop, and cloud like S3. So how are you going to get the data to the people asking for it?
  4. Here’s how everyone tries to solve it: First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store. You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too.
  5. Here’s how everyone tries to solve it: First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store. You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too. Then you move the data into a data warehouse. This could be Redshift, Teradata, Vertica, or other products. These are all proprietary, and they take DBA experts to make them work. And to move the data here you write another set of scripts. But what we see with many customers is that the performance here isn’t sufficient for their needs, and so …
  6. Here’s how everyone tries to solve it: First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store. You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too. Then you move the data into a data warehouse. This could be Redshift, Teradata, Vertica, or other products. These are all proprietary, and they take DBA experts to make them work. And to move the data here you write another set of scripts. But what we see with many customers is that the performance here isn’t sufficient for their needs, and so … You build cubes and aggregation tables to get the performance your users are asking for. And to do this you build another set of scripts. In the end you’re left with something like this picture. You may have more layers, the technologies may be different, but you’re probably living with something like this. And nobody likes this – it’s expensive, the data movement is slow, it’s hard to change. But worst of all, you’re left with a dynamic where every time a consumer of the data wants a new piece of data: They open a ticket with IT IT begins an engineering project to build another set of pipelines, over several weeks or months