SlideShare a Scribd company logo
1 of 41
© 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1 
SQL on Hadoop 
The haves and have-nots
© 2014 MapR Technologies 2 
A touch of history
© 2014 MapR Technologies 3
© 2014 MapR Technologies 4 
A more complete view
© 2014 MapR Technologies 5
But FORTRAN kills 
what it touches 
© 2014 MapR Technologies 6 
Fortran 
Lisp 
Algol 
COBOL 
The four 
main threads
© 2014 MapR Technologies 7 
Lisp 
Algol 
COBOL 
“English” like 
Math model, REPL 
Scoping, data 
structures, OO
© 2014 MapR Technologies 8 
These were the seeds: 
readable as text 
math model 
interactivity 
scoping
© 2014 MapR Technologies 9 
And then, 
punctuation reinvented 
strong typing added
And structured programming 
struct, objects, methods 
© 2014 MapR Technologies 10 
(simula, pascal, C) 
Late compilation 
(LISP, Java)
So we come to Java, Python, 
Scala, Javascript and other 
current languages 
© 2014 MapR Technologies 11
But think back to the seeds 
SQL sprouted, 
and then continued alone 
added vectorizers, radical 
optimizer, declarative 
© 2014 MapR Technologies 12
© 2014 MapR Technologies 13 
SQL retained: 
COBOL’s data types, 
strict, fixed typing, 
limited scoping, 
almost no recursion
But what do we need now? 
© 2014 MapR Technologies 14 
scalability 
data agility 
interactivity
© 2014 MapR Technologies 15 
Data is doubling in 
size every two years
44 ZETTABYTES 
© 2014 MapR Technologies 16 
IDC estimates that in 2020, 
there will be 44 zettabytes 
of data in the world 
4.4 ZETTABYTES 
1.8 ZETTABYTES 
2011 2013 
2020 
Source: IDC Digital Universe
© 2014 MapR Technologies 17 
UNSTRUCTURED 
DATA 
Unstructured data will account 
for more than 80% of the data 
collected by organizations 
STRUCTURED DATA 
1980 1990 2000 2010 2020 
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data 
Total Data Stored
Unstructured Data is Ubiquitous 
© 2014 MapR Technologies 18 
Social Media 
Messages 
Audio 
Sensors 
Mobile Data 
Email 
Clickstream
© 2014 MapR Technologies 19 
360° Customer View 
5PB 
CUSTOMER DATA
1.2B 
PEOPLE 
Largest Biometric Database in the World 
© 2014 PEOPLE MapR Technologies 20
© 2014 MapR Technologies 21 
So where do we stand? 
Scalable? 
Interactive? 
Data agile? 
Hive 
Impala, Drill 
Drill
© 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 22 
The Future of Hadoop: 
Data agility, scalability, interactivity
© 2014 MapR Technologies 23 
Distance to Data 
Business 
(analysts, developers) 
“Plumbing” 
development 
MapReduce 
Business 
(analysts, developers) 
Modeling and 
transformations 
Hive and other 
SQL-on-Hadoop 
Existing approaches 
require a middleman (IT) 
Data 
Data
Real-World Data Modeling and Transformations 
© 2014 MapR Technologies 24
© 2014 MapR Technologies 25
© 2014 MapR Technologies 26 
Distance to Data 
Business 
(analysts, developers) 
Existing approaches 
require a middleman (IT) 
“Plumbing” 
development 
MapReduce 
Hive and other 
SQL-on-Hadoop 
Business 
Data Agility (analysts, developers) 
Data 
Data 
Data 
Business 
(analysts, developers) 
Modeling and 
transformations
Improve time to value Redu2ce the burden on IT 
© 2014 MapR Technologies 27 
Why Improve Distance to Data? 
• Enable rapid data exploration and 
application development 
• IT should provide a valuable 
service without “getting in the way” 
• Can’t add DBAs to keep up with 
the exponential data growth 
• Minimize “unnecessary work” so IT 
can focus on value-added 
activities and become a partner to 
the business users
• Pioneering Data Agility for Hadoop 
• Apache open source project 
• Scale-out execution engine for low-latency queries 
• Unified SQL-based API for analytics & operational applications 
© 2014 MapR Technologies 28 
APACHE DRILL 
40+ contributors 
150+ years of experience building 
databases and distributed systems
Evolution Towards Self-Service Data Exploration 
© 2014 MapR Technologies 29 
Data Modeling and 
Transformation 
Data Visualization 
IT-driven 
IT-driven 
IT-driven 
Self-service 
IT-driven 
Self-service 
Not needed 
Self-service 
Traditional BI 
w/ RDBMS 
Self-Service BI 
w/ RDBMS 
SQL-on-Hadoop 
Self-Service 
Data Exploration 
Zero-day analytics
© 2014 MapR Technologies 30 
(1) Self-Describing Data is Ubiquitous 
Flat files in DFS 
• Complex data (Thrift, Avro, protobuf) 
• Columnar data (Parquet, ORC) 
• Loosely defined (JSON) 
• Traditional files (CSV, TSV) 
Data stored in NoSQL stores 
• Relational-like (rows, columns) 
• Sparse data (NoSQL maps) 
• Embedded blobs (JSON) 
• Document stores (nested objects) 
{ 
name: { 
first: Michael, 
last: Smith 
}, 
hobbies: [ski, soccer], 
district: Los Altos 
} 
{ 
name: { 
first: Jennifer, 
last: Gates 
}, 
hobbies: [sing], 
preschool: CCLC 
}
RDBMS/SQL-on-Hadoop 
table 
Apache Drill table 
© 2014 MapR Technologies 31 
(2) Drill’s Data Model is Flexible 
Fixed schema Schema-less 
HBase 
JSON 
BSON 
CSV 
TSV 
Parquet 
Avro 
Flat 
Complex 
Flexibility 
Flexibility 
Name Gender Age 
Michael M 6 
Jennifer F 3 
{ 
name: { 
first: Michael, 
last: Smith 
}, 
hobbies: [ski, soccer], 
district: Los Altos 
} 
{ 
name: { 
first: Jennifer, 
last: Gates 
}, 
hobbies: [sing], 
preschool: CCLC 
}
(3) Drill Supports Schema Discovery On-The-Fly 
Schema Declared In Advance Schema2 Discovered On-The-Fly 
© 2014 MapR Technologies 32 
• Fixed schema 
• Leverage schema in centralized 
repository (Hive Metastore) 
• Fixed schema, evolving schema or 
schema-less 
• Leverage schema in centralized 
repository or self-describing data 
SCHEMA ON 
WRITE 
SCHEMA 
BEFORE READ 
SCHEMA ON THE 
FLY
Quick Tour 
Self-Service Data Exploration with Apache Drill 
© 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 33
© 2014 MapR Technologies 34 
• d
Zero to Results in 2 Minutes (3 Commands) 
© 2014 MapR Technologies 35 
$ tar xzf apache-drill.tar.gz 
$ apache-drill/bin/sqlline -u jdbc:drill:zk=local 
0: jdbc:drill:zk=local> 
SELECT count(*) AS incidents, columns[1] AS category 
FROM dfs.`/tmp/SFPD_Incidents_-_Previous_Three_Months.csv` 
GROUP BY columns[1] 
ORDER BY incidents DESC; 
+------------+------------+ 
| incidents | category | 
+------------+------------+ 
| 8372 | LARCENY/THEFT | 
| 4247 | OTHER OFFENSES | 
| 3765 | NON-CRIMINAL | 
| 2502 | ASSAULT | 
... 
35 rows selected (0.847 seconds) 
Install 
Launch shell 
(embedded 
mode) 
Query 
Results
© 2014 MapR Technologies 36 
A storage engine instance 
- DFS 
- HBase 
- Hive Metastore/HCatalog 
A workspace 
- Sub-directory 
- Hive database 
A table 
- pathnames 
- HBase table 
- Hive table 
Data Source is in the Query 
SELECT timestamp, message 
FROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` 
WHERE errorLevel > 2
© 2014 MapR Technologies 37 
Query Directory Trees 
# Query file: How many errors per level in Jan 2014? 
SELECT errorLevel, count(*) 
FROM dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` 
GROUP BY errorLevel; 
# Query directory sub-tree: How many errors per level? 
SELECT errorLevel, count(*) 
FROM dfs.logs.`/AppServerLogs` 
GROUP BY errorLevel; 
# Query some partitions: How many errors per level by month from 2012? 
SELECT errorLevel, count(*) 
FROM dfs.logs.`/AppServerLogs` 
WHERE dirs[1] >= 2012 
GROUP BY errorLevel, dirs[2];
© 2014 MapR Technologies 38 
Works with HBase and Embedded Blobs 
# Query an HBase table directly (no schemas) 
SELECT cf1.month, cf1.year 
FROM hbase.table1; 
# Embedded JSON value inside column profileBlob inside column family cf1 of 
the HBase table users 
SELECT profile.name, count(profile.children) 
FROM ( 
SELECT CONVERT_FROM(cf1.profileBlob, 'json') AS profile 
FROM hbase.users 
)
© 2014 MapR Technologies 39 
Combine Data Sources on the Fly 
# Join log directory with JSON file (user profiles) to identify the name and email address for 
anyone associated with an error message. 
SELECT DISTINCT users.name, users.emails.work 
FROM dfs.logs.`/data/logs` logs, 
dfs.users.`/profiles.json` users 
WHERE logs.uid = users.id AND 
logs.errorLevel > 5; 
# Join a Hive table and an HBase table (without Hive metadata) to determine the number of 
tweets per user 
SELECT users.name, count(*) as tweetCount 
FROM hive.social.tweets tweets, 
hbase.users users 
WHERE tweets.userId = convert_from(users.rowkey, 'UTF-8') 
GROUP BY tweets.userId;
© 2014 MapR Technologies 40 
Summary 
• Enable rapid data exploration and application development while 
reducing the burden on IT 
• Apache Drill is here now 
– Email tdunning@mapr.com or tshiran@mapr.com 
• Get involved 
– Download and play: http://incubator.apache.org/drill/ 
– Ask questions: drill-user@incubator.apache.org 
– Contribute: http://github.com/apache/incubator-drill/
© 2014 MapR Technologies 41 
Thank You 
Ted Dunning, Chief Application Architect 
@mapr maprtech 
tdunning@mapr.com 
tdunning@apache.org 
MapRTechnologies 
maprtech 
mapr-technologies

More Related Content

What's hot

Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Carol McDonald
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBaseCarol McDonald
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationRevolution Analytics
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...DataWorks Summit/Hadoop Summit
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBMapR Technologies
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APICarol McDonald
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesCarol McDonald
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm DataWorks Summit/Hadoop Summit
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopDataWorks Summit
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringGeorge Ang
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 

What's hot (19)

Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means Clustering
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 

Similar to Big Data Everywhere Chicago: SQL on Hadoop

The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionMapR Technologies
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillTomer Shiran
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drilltshiran
 
Hadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataHadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataSenturus
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeMapR Technologies
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeDataWorks Summit
 
Self-Service Data Exploration with Apache Drill
Self-Service Data Exploration with Apache DrillSelf-Service Data Exploration with Apache Drill
Self-Service Data Exploration with Apache DrillMapR Technologies
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Dataconomy Media
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Mats Uddenfeldt
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRData Con LA
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, Howmcsrivas
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopDataWorks Summit
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 

Similar to Big Data Everywhere Chicago: SQL on Hadoop (20)

The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop Solution
 
2014 08-20-pit-hug
2014 08-20-pit-hug2014 08-20-pit-hug
2014 08-20-pit-hug
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Hadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataHadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big Data
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
 
Self-Service Data Exploration with Apache Drill
Self-Service Data Exploration with Apache DrillSelf-Service Data Exploration with Apache Drill
Self-Service Data Exploration with Apache Drill
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapR
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Is Spark Replacing Hadoop
Is Spark Replacing HadoopIs Spark Replacing Hadoop
Is Spark Replacing Hadoop
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on Hadoop
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 

More from BigDataEverywhere

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)BigDataEverywhere
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...BigDataEverywhere
 
Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...
Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...
Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...BigDataEverywhere
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) BigDataEverywhere
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...BigDataEverywhere
 
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...BigDataEverywhere
 

More from BigDataEverywhere (7)

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
 
Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...
Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...
Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
 
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
 

Recently uploaded

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 

Recently uploaded (20)

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 

Big Data Everywhere Chicago: SQL on Hadoop

  • 1. © 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1 SQL on Hadoop The haves and have-nots
  • 2. © 2014 MapR Technologies 2 A touch of history
  • 3. © 2014 MapR Technologies 3
  • 4. © 2014 MapR Technologies 4 A more complete view
  • 5. © 2014 MapR Technologies 5
  • 6. But FORTRAN kills what it touches © 2014 MapR Technologies 6 Fortran Lisp Algol COBOL The four main threads
  • 7. © 2014 MapR Technologies 7 Lisp Algol COBOL “English” like Math model, REPL Scoping, data structures, OO
  • 8. © 2014 MapR Technologies 8 These were the seeds: readable as text math model interactivity scoping
  • 9. © 2014 MapR Technologies 9 And then, punctuation reinvented strong typing added
  • 10. And structured programming struct, objects, methods © 2014 MapR Technologies 10 (simula, pascal, C) Late compilation (LISP, Java)
  • 11. So we come to Java, Python, Scala, Javascript and other current languages © 2014 MapR Technologies 11
  • 12. But think back to the seeds SQL sprouted, and then continued alone added vectorizers, radical optimizer, declarative © 2014 MapR Technologies 12
  • 13. © 2014 MapR Technologies 13 SQL retained: COBOL’s data types, strict, fixed typing, limited scoping, almost no recursion
  • 14. But what do we need now? © 2014 MapR Technologies 14 scalability data agility interactivity
  • 15. © 2014 MapR Technologies 15 Data is doubling in size every two years
  • 16. 44 ZETTABYTES © 2014 MapR Technologies 16 IDC estimates that in 2020, there will be 44 zettabytes of data in the world 4.4 ZETTABYTES 1.8 ZETTABYTES 2011 2013 2020 Source: IDC Digital Universe
  • 17. © 2014 MapR Technologies 17 UNSTRUCTURED DATA Unstructured data will account for more than 80% of the data collected by organizations STRUCTURED DATA 1980 1990 2000 2010 2020 Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data Total Data Stored
  • 18. Unstructured Data is Ubiquitous © 2014 MapR Technologies 18 Social Media Messages Audio Sensors Mobile Data Email Clickstream
  • 19. © 2014 MapR Technologies 19 360° Customer View 5PB CUSTOMER DATA
  • 20. 1.2B PEOPLE Largest Biometric Database in the World © 2014 PEOPLE MapR Technologies 20
  • 21. © 2014 MapR Technologies 21 So where do we stand? Scalable? Interactive? Data agile? Hive Impala, Drill Drill
  • 22. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 22 The Future of Hadoop: Data agility, scalability, interactivity
  • 23. © 2014 MapR Technologies 23 Distance to Data Business (analysts, developers) “Plumbing” development MapReduce Business (analysts, developers) Modeling and transformations Hive and other SQL-on-Hadoop Existing approaches require a middleman (IT) Data Data
  • 24. Real-World Data Modeling and Transformations © 2014 MapR Technologies 24
  • 25. © 2014 MapR Technologies 25
  • 26. © 2014 MapR Technologies 26 Distance to Data Business (analysts, developers) Existing approaches require a middleman (IT) “Plumbing” development MapReduce Hive and other SQL-on-Hadoop Business Data Agility (analysts, developers) Data Data Data Business (analysts, developers) Modeling and transformations
  • 27. Improve time to value Redu2ce the burden on IT © 2014 MapR Technologies 27 Why Improve Distance to Data? • Enable rapid data exploration and application development • IT should provide a valuable service without “getting in the way” • Can’t add DBAs to keep up with the exponential data growth • Minimize “unnecessary work” so IT can focus on value-added activities and become a partner to the business users
  • 28. • Pioneering Data Agility for Hadoop • Apache open source project • Scale-out execution engine for low-latency queries • Unified SQL-based API for analytics & operational applications © 2014 MapR Technologies 28 APACHE DRILL 40+ contributors 150+ years of experience building databases and distributed systems
  • 29. Evolution Towards Self-Service Data Exploration © 2014 MapR Technologies 29 Data Modeling and Transformation Data Visualization IT-driven IT-driven IT-driven Self-service IT-driven Self-service Not needed Self-service Traditional BI w/ RDBMS Self-Service BI w/ RDBMS SQL-on-Hadoop Self-Service Data Exploration Zero-day analytics
  • 30. © 2014 MapR Technologies 30 (1) Self-Describing Data is Ubiquitous Flat files in DFS • Complex data (Thrift, Avro, protobuf) • Columnar data (Parquet, ORC) • Loosely defined (JSON) • Traditional files (CSV, TSV) Data stored in NoSQL stores • Relational-like (rows, columns) • Sparse data (NoSQL maps) • Embedded blobs (JSON) • Document stores (nested objects) { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos } { name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC }
  • 31. RDBMS/SQL-on-Hadoop table Apache Drill table © 2014 MapR Technologies 31 (2) Drill’s Data Model is Flexible Fixed schema Schema-less HBase JSON BSON CSV TSV Parquet Avro Flat Complex Flexibility Flexibility Name Gender Age Michael M 6 Jennifer F 3 { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos } { name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC }
  • 32. (3) Drill Supports Schema Discovery On-The-Fly Schema Declared In Advance Schema2 Discovered On-The-Fly © 2014 MapR Technologies 32 • Fixed schema • Leverage schema in centralized repository (Hive Metastore) • Fixed schema, evolving schema or schema-less • Leverage schema in centralized repository or self-describing data SCHEMA ON WRITE SCHEMA BEFORE READ SCHEMA ON THE FLY
  • 33. Quick Tour Self-Service Data Exploration with Apache Drill © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 33
  • 34. © 2014 MapR Technologies 34 • d
  • 35. Zero to Results in 2 Minutes (3 Commands) © 2014 MapR Technologies 35 $ tar xzf apache-drill.tar.gz $ apache-drill/bin/sqlline -u jdbc:drill:zk=local 0: jdbc:drill:zk=local> SELECT count(*) AS incidents, columns[1] AS category FROM dfs.`/tmp/SFPD_Incidents_-_Previous_Three_Months.csv` GROUP BY columns[1] ORDER BY incidents DESC; +------------+------------+ | incidents | category | +------------+------------+ | 8372 | LARCENY/THEFT | | 4247 | OTHER OFFENSES | | 3765 | NON-CRIMINAL | | 2502 | ASSAULT | ... 35 rows selected (0.847 seconds) Install Launch shell (embedded mode) Query Results
  • 36. © 2014 MapR Technologies 36 A storage engine instance - DFS - HBase - Hive Metastore/HCatalog A workspace - Sub-directory - Hive database A table - pathnames - HBase table - Hive table Data Source is in the Query SELECT timestamp, message FROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` WHERE errorLevel > 2
  • 37. © 2014 MapR Technologies 37 Query Directory Trees # Query file: How many errors per level in Jan 2014? SELECT errorLevel, count(*) FROM dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` GROUP BY errorLevel; # Query directory sub-tree: How many errors per level? SELECT errorLevel, count(*) FROM dfs.logs.`/AppServerLogs` GROUP BY errorLevel; # Query some partitions: How many errors per level by month from 2012? SELECT errorLevel, count(*) FROM dfs.logs.`/AppServerLogs` WHERE dirs[1] >= 2012 GROUP BY errorLevel, dirs[2];
  • 38. © 2014 MapR Technologies 38 Works with HBase and Embedded Blobs # Query an HBase table directly (no schemas) SELECT cf1.month, cf1.year FROM hbase.table1; # Embedded JSON value inside column profileBlob inside column family cf1 of the HBase table users SELECT profile.name, count(profile.children) FROM ( SELECT CONVERT_FROM(cf1.profileBlob, 'json') AS profile FROM hbase.users )
  • 39. © 2014 MapR Technologies 39 Combine Data Sources on the Fly # Join log directory with JSON file (user profiles) to identify the name and email address for anyone associated with an error message. SELECT DISTINCT users.name, users.emails.work FROM dfs.logs.`/data/logs` logs, dfs.users.`/profiles.json` users WHERE logs.uid = users.id AND logs.errorLevel > 5; # Join a Hive table and an HBase table (without Hive metadata) to determine the number of tweets per user SELECT users.name, count(*) as tweetCount FROM hive.social.tweets tweets, hbase.users users WHERE tweets.userId = convert_from(users.rowkey, 'UTF-8') GROUP BY tweets.userId;
  • 40. © 2014 MapR Technologies 40 Summary • Enable rapid data exploration and application development while reducing the burden on IT • Apache Drill is here now – Email tdunning@mapr.com or tshiran@mapr.com • Get involved – Download and play: http://incubator.apache.org/drill/ – Ask questions: drill-user@incubator.apache.org – Contribute: http://github.com/apache/incubator-drill/
  • 41. © 2014 MapR Technologies 41 Thank You Ted Dunning, Chief Application Architect @mapr maprtech tdunning@mapr.com tdunning@apache.org MapRTechnologies maprtech mapr-technologies

Editor's Notes

  1. Have someone introduce me. Thank audience (tie to morning activities), sponsors, HP, etc. We’re here because this is the biggest thing that has happened to Hadoop…
  2. Here at the conference we’re talking about data science. But before we can appreciate the changes happening in data science, we must first talk about Data. Data is doubling every two years. The fast growing volume, variety and velocity of data is overwhelming traditional systems and approaches. A revolutionary approach is required to leverage this data. And with this new technology, Data science as we know, is undergoing tremendous change.
  3. To give you a sense of the data volumes that we’re talking about, I’ve included this chart that shows why a revolutionary approach is needed. You can see the amount of data growth moving from 1.8 Zettabytes to 44 Zettabytes in just over 5 years. To put this into perspective a large datawarehouse contains terabytes of data. A zettabye is 1 billion terabytes. Numbers in chart are from two IDC reports (sponsored by emc). http://www.emc.com/collateral/about/news/idc-emc-digital-universe-2011-infographic.pdf http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm
  4. What is the source of this data growth? While structured data growth has been relatively modest, the growth in unstructured data has been exponential. Source of statistic: http://link.springer.com/chapter/10.1007/978-3-642-39146-0_2
  5. sensor data, social media, clickstream, genomic data, location information, video files, etc.
  6. 19
  7. 20
  8. Many organizations now want to unlock the data in Hadoop and make it accessible to a broader audience within their organizations. That’s easier said than done. While we’ve largely solved the infrastructure scalability challenge, the massive volume, variety and velocity of this data introduces serious challenges on the human side, such as how to prepare all that data and make it available to users, how to make operational data available in real-time for analytics, etc. We need better technology to empower users to take advantage of these massive volumes of data. Past: Enable organizations to capture the data. Future: Enable organizations to more easily extract value from all this captured data. What does the future of Hadoop look like? The problem I’m sure many of you have experienced this (just like the quotes) Why we want to solve it Here’s what we’re doing about it
  9. One of the challenges with Hadoop as well as traditional data management tools is the business user’s “distance from the data”. The dependency on IT (or additional development) increases time to value and reduces agility. It also creates a burden on IT at a time when IT is already overworked. The red arrows in this illustration can represent significant backlogs and delays (often many months). Many of you are likely having to spend a lot of time on plumbing development and data preparation. How many have had to do this? (show hand)
  10. “Data modeling and transformations” may seem easy, but when you look at a real-world environment, you could have thousands of data sets.
  11. Opportunity
  12. This is the opportunity. The audience should feel like this is their chance to become heroes by bringing this to their companies. They have to feel (be emotional) about the problem at this point.
  13. IT-driven = months of delay, unnecessary work (data is no longer relevant, etc.) The so-what needs to be conveyed. Why does it matter that it’s not needed. 6 months -> 3 months -> 3 months -> day zero So imagine now what you can get… Data Agility is needed for Business Agility >>> Stand still during slide, move in at the punchline (why does this matter to YOU)
  14. Need an example or analogy to explain self-describing data.
  15. All SQL engines (traditional or SQL-on-Hadoop) view tables as spreadsheet-like data structures with rows and columns. All records have the same structure, and there is no support for nested data or repeating fields. Drill views tables conceptually as collections of JSON (with additional types) documents. Each record can have a different structure (hence, schema-less). This is revolutionary and has never been done before. If you consider the four data models shown in the 2x2, all models can be represented by the complex, no schema model (JSON) because it is the most flexible. However, no other data model can be represented by the flat, fixed schema model. Therefore, when using any SQL engine except Drill, the data has to be transformed before it can be available to queries.
  16. TODO: Add Impala and Splunk logos
  17. What I want you to see now is how easy is it to ….
  18. Is there something from Israel?
  19. With other technologies you have to do this, then this, then this, …
  20. Key takeaways Core message – We are revolutionizing Hadoop Call to action – get involved, and enjoy the conference as we have great speakers If doing Q&A, set boundaries (time - how much time we have, topic – what questions can I answer about this revolution), back pocket question (someone asked me this morning) -