Options for Data Prep - A Survey of the Current Market

Kelly Stirman
VP Strategy
@kstirman

Analytics on modern
data is incredibly hard
Unprecedented complexity

The demands for data
are growing rapidly
Increasing demands
Reporting
New products
Forecasting
Threat detection
BI
Machine
Learning
Segmenting
Fraud prevention

Your analysts are hungry for data
SQL

Today you engineer data flows and reshaping
Data Staging
• Custon ETL
• Fragile transforms
• Slow moving
SQL

Data Staging
Data Warehouse
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Slow moving
SQL

Data Staging
Data Warehouse
Cubes, BI Extracts &
Aggregation Tables • Data sprawl
• Governance issues
• Slow to update
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Slow moving
SQL
+
+
+
+
+
+
+
+
+

Data Integration v. Data Prep
Data Integration Data Prep
Primary user IT Business Analyst
User works from Metadata Data samples
Prioritizes Governance, security Ease of use, time to insight
Sample vendors Informatica, IBM, SAS, SQL tools Alteryx, Trifacta, Paxata

Data Integration is the standard
• For 25+ years, Data Integration has been an essential tool for IT
• Pros
• Mature, robust
• Deep integrations to enterprise standards
• Security and governance controls
• Server-based: scalable, centralized
• Cons
• IT users only
• Assumes minimal data quality
• Mature for enterprise sources
• Less mature for cloud, 3rd party apps, Hadoop, NoSQL
• Complex, expensive

Data Prep prioritizes speed, ease of use
• Newer entrants, architected for modern resources
• Pros
• User experience works for both IT, Business
• Data-centric model vs. metadata-centric model
• Support for Hadoop, NoSQL, Cloud, machine learning
• Can leverage Hadoop and/or cloud for processing, storage
• Faster time to value
• Cons
• Less mature tech stack
• Small vendors, limited ecosystem of integrations and skills
• Security integrations less comprehensive
• Assumes governance, authority, lineage handled elsewhere
• Still need IT on board and coordinating process

Gartner 2016 Forrester 2016 Bloor 2017
Analyst coverage (see references)

Open source alternatives
• RDBMS
• Pros: SQL based; mature; ecosystem
• Cons: non-relational sources; scalability; ease of use
• Apache Hive
• Pros: scalabilty; SQL based; Hadoop integrations;
• Cons: latency; ease of use; integrations
• Apache Spark
• Pros: scalability; performance; Python/R integration; ML
• Cons: ease of use; integrations; maturity
• Python Pandas
• Pros: performance; pervasive skills; ecosystem; flexibility
• Cons: scale out is complex; ease of use
• R dplyr
• Pros: performance; pervasive skills; ecosystem; flexibility
• Cons: scale out is complex; ease of use

Category Good Fit Primary User Model Scalability
ETL Tools
Static, predictable
integrations between
enterprise tech
IT
Data Pipeline,
metadata-based
Single server
BI Tools ”Last Mile” data prep Business Embedded Desktop
Trifacta, Paxata
Scalable, collaborative
data prep for business
users
Business
Spreadsheet,
sample-based
Hadoop cluster
Custom Scripts Maximum flexibility IT
Data Pipeline,
metadata-based
Single server
Alteryx, Datawatch
Building BI extracts,
easier to use than ETL
IT
Data Pipeline,
metadata-based
Desktop (single
server optional)
SAS Data Loader IT users IT
Data Pipeline,
metadata-based
Single server
Tamr
Human-aided ML for
data cleansing
Business
Spreadsheet,
sample-based
Single server

Important questions to ask
• Usability – knowing data is more important than knowing tech
• Collaboration – essential feature for business users
• Data sources – ODBC for NoSQL, cloud, Hadoop not enough
• License model – will influence how you adopt the tool
• Governance – solving problems or creating new ones?
• Complexity – how many moving parts for your end to end analytical
process
• Vendor viability – crowded market of small players
• Ecosystem – no technology is an island

Market predictions
• BI tools build integrated capabilities
• But customers want one solution for all tools
• ETL vendors try to become “business friendly”
• Legacy technology stack is an impediment, not an enabler
• Hadoop vendors acquire emerging data prep players
• What about data outside of Hadoop?
• Opportunity for new approach
• Truly self service for the business (no IT required)
• Works with all data sources (relational, cloud, NoSQL, Hadoop)
• Works with all analytical tools (BI, SQL, R, Python, Spark)
• Integrates all layers of the analytical stack

References
• Gartner (Market Guide) https://www.gartner.com/doc/3418832/market-guide-selfservice-data-preparation
• Forrester (Wave) https://www.forrester.com/report/The+Forrester+Wave+Data+Preparation+Tools+Q1+2017/-/E-RES128464
• Forrester (Vendor Landscape) https://www.forrester.com/report/Vendor+Landscape+Data+Preparation+Tools/-/E-RES128561
• Bloor Research: http://www.bloorresearch.com/technology/data-preparation-self-service/
• Informatica Demo: https://youtu.be/UBsUrJjggwc
• Alteryx Demo: https://youtu.be/LwO6VL1ScXk?t=1m25s
• SAS (data prep) Demo: https://youtu.be/9e_uxQBUPsQ?t=2m34s
• Trifacta Demo: https://www.youtube.com/watch?v=4VpW6oJ3cQI
• Paxata Demo: https://youtu.be/TR1smNYB4ks?t=18m6s
• Datawatch Demo: https://youtu.be/6hc_cafMsCs?t=2m22s
• Tableau (data prep) Demo: https://youtu.be/vlwfD9VyJME?t=20m49s
• Tamr Demo: https://youtu.be/PI_EqvIX45o

Kelly Stirman
VP Strategy
@kstirman
Want to try a new approach?
Contact me about the Dremio Beta Program
kelly@dremio.com

Options for Data Prep - A Survey of the Current Market

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (10)

Similaire à Options for Data Prep - A Survey of the Current Market

Similaire à Options for Data Prep - A Survey of the Current Market (20)

Dernier

Dernier (20)

Options for Data Prep - A Survey of the Current Market

Notes de l'éditeur