Big Data Everywhere Chicago: SQL on Hadoop

© 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1
SQL on Hadoop
The haves and have-nots

© 2014 MapR Technologies 2
A touch of history

A more complete view

But FORTRAN kills
what it touches
Fortran
Lisp
Algol
COBOL
The four
main threads

Lisp
Algol
COBOL
“English” like
Math model, REPL
Scoping, data
structures, OO

These were the seeds:
readable as text
math model
interactivity
scoping

And then,
punctuation reinvented
strong typing added

And structured programming
struct, objects, methods
(simula, pascal, C)
Late compilation
(LISP, Java)

So we come to Java, Python,
Scala, Javascript and other
current languages

But think back to the seeds
SQL sprouted,
and then continued alone
added vectorizers, radical
optimizer, declarative

SQL retained:
COBOL’s data types,
strict, fixed typing,
limited scoping,
almost no recursion

But what do we need now?
scalability
data agility
interactivity

Data is doubling in
size every two years

44 ZETTABYTES
IDC estimates that in 2020,
there will be 44 zettabytes
of data in the world
4.4 ZETTABYTES
1.8 ZETTABYTES
2011 2013
2020
Source: IDC Digital Universe

UNSTRUCTURED
DATA
Unstructured data will account
for more than 80% of the data
collected by organizations
STRUCTURED DATA
1980 1990 2000 2010 2020
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
Total Data Stored

Unstructured Data is Ubiquitous
Social Media
Messages
Audio
Sensors
Mobile Data
Email
Clickstream

360° Customer View
5PB
CUSTOMER DATA

1.2B
PEOPLE
Largest Biometric Database in the World
© 2014 PEOPLE MapR Technologies 20

So where do we stand?
Scalable?
Interactive?
Data agile?
Hive
Impala, Drill
Drill

© 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 22
The Future of Hadoop:
Data agility, scalability, interactivity

Distance to Data
Business
(analysts, developers)
“Plumbing”
development
MapReduce
Business
Modeling and
transformations
Hive and other
SQL-on-Hadoop
Existing approaches
require a middleman (IT)
Data
Data

Real-World Data Modeling and Transformations

Distance to Data
Business
Existing approaches
require a middleman (IT)
“Plumbing”
development
MapReduce
Hive and other
SQL-on-Hadoop
Business
Data Agility (analysts, developers)
Data
Data
Data
Business
Modeling and
transformations

Improve time to value Redu2ce the burden on IT
Why Improve Distance to Data?
• Enable rapid data exploration and
application development
• IT should provide a valuable
service without “getting in the way”
• Can’t add DBAs to keep up with
the exponential data growth
• Minimize “unnecessary work” so IT
can focus on value-added
activities and become a partner to
the business users

• Pioneering Data Agility for Hadoop
• Apache open source project
• Scale-out execution engine for low-latency queries
• Unified SQL-based API for analytics & operational applications
APACHE DRILL
40+ contributors
150+ years of experience building
databases and distributed systems

Evolution Towards Self-Service Data Exploration
Data Modeling and
Transformation
Data Visualization
IT-driven
IT-driven
IT-driven
Self-service
IT-driven
Self-service
Not needed
Self-service
Traditional BI
w/ RDBMS
Self-Service BI
w/ RDBMS
SQL-on-Hadoop
Self-Service
Data Exploration
Zero-day analytics

(1) Self-Describing Data is Ubiquitous
Flat files in DFS
• Complex data (Thrift, Avro, protobuf)
• Columnar data (Parquet, ORC)
• Loosely defined (JSON)
• Traditional files (CSV, TSV)
Data stored in NoSQL stores
• Relational-like (rows, columns)
• Sparse data (NoSQL maps)
• Embedded blobs (JSON)
• Document stores (nested objects)
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}

RDBMS/SQL-on-Hadoop
table
Apache Drill table
(2) Drill’s Data Model is Flexible
Fixed schema Schema-less
HBase
JSON
BSON
CSV
TSV
Parquet
Avro
Flat
Complex
Flexibility
Flexibility
Name Gender Age
Michael M 6
Jennifer F 3
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}

(3) Drill Supports Schema Discovery On-The-Fly
Schema Declared In Advance Schema2 Discovered On-The-Fly
• Fixed schema
• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or
schema-less
• Leverage schema in centralized
repository or self-describing data
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
SCHEMA ON THE
FLY

• d

A storage engine instance
- DFS
- HBase
- Hive Metastore/HCatalog
A workspace
- Sub-directory
- Hive database
A table
- pathnames
- HBase table
- Hive table
Data Source is in the Query
SELECT timestamp, message
FROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet`
WHERE errorLevel > 2

Query Directory Trees
# Query file: How many errors per level in Jan 2014?
SELECT errorLevel, count(*)
FROM dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet`
GROUP BY errorLevel;
# Query directory sub-tree: How many errors per level?
FROM dfs.logs.`/AppServerLogs`
GROUP BY errorLevel;
# Query some partitions: How many errors per level by month from 2012?
FROM dfs.logs.`/AppServerLogs`
WHERE dirs[1] >= 2012
GROUP BY errorLevel, dirs[2];

Works with HBase and Embedded Blobs
# Query an HBase table directly (no schemas)
SELECT cf1.month, cf1.year
FROM hbase.table1;
# Embedded JSON value inside column profileBlob inside column family cf1 of
the HBase table users
SELECT profile.name, count(profile.children)
FROM (
SELECT CONVERT_FROM(cf1.profileBlob, 'json') AS profile
FROM hbase.users
)

Combine Data Sources on the Fly
# Join log directory with JSON file (user profiles) to identify the name and email address for
anyone associated with an error message.
SELECT DISTINCT users.name, users.emails.work
FROM dfs.logs.`/data/logs` logs,
dfs.users.`/profiles.json` users
WHERE logs.uid = users.id AND
logs.errorLevel > 5;
# Join a Hive table and an HBase table (without Hive metadata) to determine the number of
tweets per user
SELECT users.name, count(*) as tweetCount
FROM hive.social.tweets tweets,
hbase.users users
WHERE tweets.userId = convert_from(users.rowkey, 'UTF-8')
GROUP BY tweets.userId;

Summary
• Enable rapid data exploration and application development while
reducing the burden on IT
• Apache Drill is here now
– Email tdunning@mapr.com or tshiran@mapr.com
• Get involved
– Download and play: http://incubator.apache.org/drill/
– Ask questions: drill-user@incubator.apache.org
– Contribute: http://github.com/apache/incubator-drill/

Thank You
Ted Dunning, Chief Application Architect
@mapr maprtech
tdunning@mapr.com
tdunning@apache.org
MapRTechnologies
maprtech
mapr-technologies

Big Data Everywhere Chicago: SQL on Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Big Data Everywhere Chicago: SQL on Hadoop

Similar to Big Data Everywhere Chicago: SQL on Hadoop (20)

More from BigDataEverywhere

More from BigDataEverywhere (7)

Recently uploaded

Recently uploaded (20)

Big Data Everywhere Chicago: SQL on Hadoop

Editor's Notes