Data comes in many shapes and sizes, and every company struggles to find ways to transform, validate, and enrich data for multiple purposes. The problem has been around as long as data, and the market has an overwhelming number of options. In this presentation we look at the problem and key options from vendors in the market today. Dremio is a new approach that eliminates the need for stand alone data prep tools.
5. The demands for data
are growing rapidly
Increasing demands
Reporting
New products
Forecasting
Threat detection
BI
Machine
Learning
Segmenting
Fraud prevention
7. Today you engineer data flows and reshaping
Data Staging
• Custon ETL
• Fragile transforms
• Slow moving
SQL
8. Today you engineer data flows and reshaping
Data Staging
Data Warehouse
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Fragile transforms
• Slow moving
SQL
9. Today you engineer data flows and reshaping
Data Staging
Data Warehouse
Cubes, BI Extracts &
Aggregation Tables • Data sprawl
• Governance issues
• Slow to update
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Fragile transforms
• Slow moving
SQL
+
+
+
+
+
+
+
+
+
14. Data Integration v. Data Prep
Data Integration Data Prep
Primary user IT Business Analyst
User works from Metadata Data samples
Prioritizes Governance, security Ease of use, time to insight
Sample vendors Informatica, IBM, SAS, SQL tools Alteryx, Trifacta, Paxata
15. Data Integration is the standard
• For 25+ years, Data Integration has been an essential tool for IT
• Pros
• Mature, robust
• Deep integrations to enterprise standards
• Security and governance controls
• Server-based: scalable, centralized
• Cons
• IT users only
• Assumes minimal data quality
• Mature for enterprise sources
• Less mature for cloud, 3rd party apps, Hadoop, NoSQL
• Complex, expensive
16. Data Prep prioritizes speed, ease of use
• Newer entrants, architected for modern resources
• Pros
• User experience works for both IT, Business
• Data-centric model vs. metadata-centric model
• Support for Hadoop, NoSQL, Cloud, machine learning
• Can leverage Hadoop and/or cloud for processing, storage
• Faster time to value
• Cons
• Less mature tech stack
• Small vendors, limited ecosystem of integrations and skills
• Security integrations less comprehensive
• Assumes governance, authority, lineage handled elsewhere
• Still need IT on board and coordinating process
29. Category Good Fit Primary User Model Scalability
ETL Tools
Static, predictable
integrations between
enterprise tech
IT
Data Pipeline,
metadata-based
Single server
BI Tools ”Last Mile” data prep Business Embedded Desktop
Trifacta, Paxata
Scalable, collaborative
data prep for business
users
Business
Spreadsheet,
sample-based
Hadoop cluster
Custom Scripts Maximum flexibility IT
Data Pipeline,
metadata-based
Single server
Alteryx, Datawatch
Building BI extracts,
easier to use than ETL
IT
Data Pipeline,
metadata-based
Desktop (single
server optional)
SAS Data Loader IT users IT
Data Pipeline,
metadata-based
Single server
Tamr
Human-aided ML for
data cleansing
Business
Spreadsheet,
sample-based
Single server
30. Important questions to ask
• Usability – knowing data is more important than knowing tech
• Collaboration – essential feature for business users
• Data sources – ODBC for NoSQL, cloud, Hadoop not enough
• License model – will influence how you adopt the tool
• Governance – solving problems or creating new ones?
• Complexity – how many moving parts for your end to end analytical
process
• Vendor viability – crowded market of small players
• Ecosystem – no technology is an island
31. Market predictions
• BI tools build integrated capabilities
• But customers want one solution for all tools
• ETL vendors try to become “business friendly”
• Legacy technology stack is an impediment, not an enabler
• Hadoop vendors acquire emerging data prep players
• What about data outside of Hadoop?
• Opportunity for new approach
• Truly self service for the business (no IT required)
• Works with all data sources (relational, cloud, NoSQL, Hadoop)
• Works with all analytical tools (BI, SQL, R, Python, Spark)
• Integrates all layers of the analytical stack
BI assumes single relational database, but…
Data in non-relational technologies
Data fragmented across many systems
Massive scale and velocity
Data is the business, and…
Era of impatient smartphone natives
Rise of self-service BI
Accelerating time to market
Because of the complexity of modern data and increasing demands for data, IT gets crushed in the middle:
Slow or non-responsive IT
“Shadow Analytics”
Data governance risk
Illusive data engineers
Immature software
Competing strategic initiatives
Here’s the problem everyone is trying to solve today.
You have consumers of data with their favorite tools. BI products like Tableau, PowerBI, Qlik, as well as data science tools like Python, R, Spark, and SQL.
Then you have all your data, in a mix of relational, NoSQL, Hadoop, and cloud like S3.
So how are you going to get the data to the people asking for it?
Here’s how everyone tries to solve it:
First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store.
You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too.
Here’s how everyone tries to solve it:
First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store.
You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too.
Then you move the data into a data warehouse. This could be Redshift, Teradata, Vertica, or other products. These are all proprietary, and they take DBA experts to make them work. And to move the data here you write another set of scripts.
But what we see with many customers is that the performance here isn’t sufficient for their needs, and so …
Here’s how everyone tries to solve it:
First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store.
You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too.
Then you move the data into a data warehouse. This could be Redshift, Teradata, Vertica, or other products. These are all proprietary, and they take DBA experts to make them work. And to move the data here you write another set of scripts.
But what we see with many customers is that the performance here isn’t sufficient for their needs, and so …
You build cubes and aggregation tables to get the performance your users are asking for. And to do this you build another set of scripts.
In the end you’re left with something like this picture. You may have more layers, the technologies may be different, but you’re probably living with something like this. And nobody likes this – it’s expensive, the data movement is slow, it’s hard to change.
But worst of all, you’re left with a dynamic where every time a consumer of the data wants a new piece of data:
They open a ticket with IT
IT begins an engineering project to build another set of pipelines, over several weeks or months