1. Drilling into Data with Apache Drill
Tomer Shiran, Apache Drill Founder and PMC Member
Jacques Nadeau, Apache Drill PMC Chair
2. Tomer Shiran Jacques Nadeau
tshiran@apache.org jnadeau@apache.org
@tshiran @intjesus
Drill founder and PMC Member
MapR VP Product
Drill PMC Chair (VP, Apache Drill)
3. Apache Drill
• Open source SQL query engine for non-relational datastores
– JSON document model
– Columnar
• Key advantages:
– Query any non-relational datastore
– No overhead (creating and maintaining schemas, transforming data, …)
– Treat your data like a table even when it’s not
– Keep using the BI tools you love
– Scales from one laptop to 1000s of servers
– Great performance and scalability
4. Omni-SQL (“SQL-on-Everything”)
Drill: Omni-SQL
Whereas the other engines we're discussing here create a relational database
environment on top of Hadoop, Drill instead enables a SQL language interface to
data in numerous formats, without requiring a formal schema to be declared. This
enables plug-and-play discovery over a huge universe of data without
prerequisites and preparation. So while Drill uses SQL, and can connect to
Hadoop, calling it SQL-on-Hadoop kind of misses the point. A better name might
be SQL-on-Everything, with very low setup requirements.
Andrew Brust,
“
”
5. Any Non-Relational Datastore
• File systems
– Traditional: Local files and NAS
– Hadoop: HDFS and MapR-FS
– Cloud storage: Amazon S3, Google
Cloud Storage, Azure Blob Storage
• NoSQL databases
– MongoDB
– HBase
– MapR-DB
– Hive
• And you can add new datastores
Any Client
• Multiple interfaces: ODBC, JDBC, REST, C,
Java
• BI tools
– Tableau
– Qlik
– MicroStrategy
– TIBCO Spotfire
– Excel
• Command line (Drill shell)
• Web and mobile apps
– Many JSON-powered chart libraries (see
D3.js)
• SAS, R, …
Drill Integrates With What You Have
6. Achieving “End-to-End Performance”
Execute fast
• Standard SQL
• Read data fast
• Leverage columnar
encodings and execution
• Execute operations
quickly
• Scale out, not up
Iterate fast
• Work without prep
• Decentralize data
management
• In-situ security
• Explore + query
• Access multiple sources
• Avoid the ETL rinse cycle
7. JSON Model, Columnar Speed
JSON
BSON
Mongo
HBase
NoSQL
Parquet
Avro
CSV
TSV
Schema-lessFixed schema
Flat
Complex
Name Gender Age
Michael M 6
Jennifer F 3
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}
RDBMS/SQL-on-Hadoop table
Apache Drill table
8. Apache Drill Provides the Best of Both Worlds
Acts Like a Database
• ANSI SQL: SELECT, FROM,
WHERE, JOIN, HAVING, ORDER
BY, WITH, CTAS, ALL, EXISTS,
ANY, IN, SOME
• VarChar, Int, BigInt, Decimal,
VarBinary, Timestamp, Float,
Double, etc.
• Subqueries, scalar subqueries,
partition pruning, CTE
• Data warehouse offload
• Tableau, ODBC, JDBC
• TPC-H & TPC-DS-like workloads
• Supports Hive SerDes
• Supports Hive UDFs
• Supports Hive Metastore
Even When Your Data
Doesn’t
• Path based queries and
wildcards
– select * from /my/logs/
– select * from /revenue/*/q2
• Modern data types
– Map, Array, Any
• Complex Functions and
Relational Operators
– FLATTEN, kvgen, convert_from,
convert_to, repeated_count, etc
• JSON Sensor analytics
• Complex data analysis
• Alternative DSLs
9. Why? To Support the Changing Data
Organization
Data Dev Circa 2000
1. Developer comes up with
requirements
2. DBA defines tables
3. DBA defines indices
4. DBA defines FK relationships
5. Developer stores data
6. BI builds reports
7. Analyst views reports
8. DBA adds materialized views
Data Today
1. Developer builds app, defines
schema, stores data
2. Analyst queries data
3. Data engineer fixes
performance problems or fills
functionality gaps
11. Everything Starts With a Drillbit…
• High performance query executor
• In-memory columnar execution
• Directly interacts with data, acquiring
knowledge as it reads
• Built to leverage large amounts of
memory
• Networked or not
• Exposes ODBC, JDBC, REST
• Built-in Web UI and CLI
• Extensible
Drillbit
Single process
(daemon or CLI)
12. Data Lake, More Like Data Maelstrom
HDFS HDFS
mongod mongod
HDFS HDFS
HBase HBase
Cassandra Cassandra
HDFS
HDFS
HBase
Windows
Desktop
Mac
Desktop
HBase & HDFS Cluster
HDFS Cluster
MongoDB Cluster
Cassandra Cluster
DesktopClustered Servers
13. Run Drillbits Wherever; Whatever Your Data
Drillbit
HDFS HDFS
mongod mongod
HDFS HDFS
HBase HBase
Drillbit
DrillbitDrillbit
Drillbit Drillbit
Cassandra Cassandra
Drillbit Drillbit
HDFS
HDFS
HBase
Drillbit
Drillbit
Windows
Desktop
Drillbit
Mac
Desktop
Drillbit
14. Connect to Any Drillbit with ODBC, JDBC, C, Java,
REST
1. User connects to Drillbit
2. That Drillbit becomes Foreman
– Foreman generates execution plan
– Cost-based query optimization &
locality
3. Execution fragments are farmed
to other Drillbits
4. Drillbits exchange data as
necessary to guarantee relational
algebra
5. Results are returned to user
through Foreman Drillbit
User
Drillbit
Drillbit
(foreman)
17. Run Drill in Embedded Mode (drill-embedded)
$ tar xf apache-drill-1.0.0.tar.gz
$ cd apache-drill-1.0.0
$ bin/drill-embedded
> SELECT * FROM dfs.root.`/Users/tshiran/yelp/user.json` LIMIT 1;
+----------------+----------------------------------+---------------+-------+
| yelping_since | votes | review_count | name |
+----------------+----------------------------------+---------------+-------+
| 2012-02 | {"funny":1,"useful":5,"cool":0} | 6 | Lee |
+----------------+----------------------------------+---------------+-------+
• drillbit (Drill daemon) starts automatically in embedded mode
• No ZooKeeper in embedded mode
• Web UI is available at localhost:8047
26. Are There More 5-Star or 1-Star Reviews?
> SELECT stars, count(*)
FROM dfs.root.`/Users/tshiran/yelp/review.json`
GROUP BY stars ORDER BY stars;
+--------+---------+
| stars | EXPR$1 |
+--------+---------+
| 1 | 110772 |
| 2 | 102737 |
| 3 | 163761 |
| 4 | 342143 |
| 5 | 406045 |
+--------+---------+
5 rows selected (3.739 seconds)
27. Using Storage Plugins and Workspaces
> SELECT * FROM dfs.root.`/Users/tshiran/data/yelp/review.json`
LIMIT 1;
> SELECT * FROM dfs.demo.`yelp/review.json` LIMIT 1;
> SELECT * FROM mongo.yelp.users LIMIT 1;
> USE mongo.yelp;
> SELECT * FROM users LIMIT 1;
Storage plugin
Workspace
Path relative to workspace
Storage Plugin Workspace Table
dfs Path Path relative to workspace
mongo Database Collection
hive Database Table
hbase Namespace Table
28. Most Common User Names (MongoDB)
> SELECT name, count(*) AS users
FROM mongo.yelp.users
GROUP BY name
ORDER BY users DESC LIMIT 10;
+------------+------------+
| name | users |
+------------+------------+
| David | 2453 |
| John | 2378 |
| Michael | 2322 |
| Chris | 2202 |
| Mike | 2037 |
| Jennifer | 1867 |
| Jessica | 1463 |
| Jason | 1457 |
| Michelle | 1439 |
| Brian | 1436 |
+------------+------------+
29. Cities with the Most Businesses
> SELECT state, city, count(*) AS businesses
FROM dfs.demo.`/yelp/business.json`
GROUP BY state, city
ORDER BY businesses DESC LIMIT 10;
+------------+------------+-------------+
| state | city | businesses |
+------------+------------+-------------+
| NV | Las Vegas | 12021 |
| AZ | Phoenix | 7499 |
| AZ | Scottsdale | 3605 |
| EDH | Edinburgh | 2804 |
| AZ | Mesa | 2041 |
| AZ | Tempe | 2025 |
| NV | Henderson | 1914 |
| AZ | Chandler | 1637 |
| WI | Madison | 1630 |
| AZ | Glendale | 1196 |
+------------+------------+-------------+
33. Which Places Are Open Right Now (22:00)?
> SELECT name, b.hours
FROM dfs.demo.`yelp/business.json` b
WHERE b.hours.Saturday.`open` < '22:00' AND
b.hours.Saturday.`close` > '22:00'
LIMIT 2;
+------------------------------+------------------------------------------------+
| name | hours |
+------------------------------+------------------------------------------------+
| Chang Jiang Chinese Kitchen | {"Saturday":{"close":"22:30","open":"11:00"}} |
| Grand China Restaurant | {"Saturday":{"close":"23:00","open":"11:00"}} |
+------------------------------+------------------------------------------------+
34. It’s 10pm in Vegas and I Want Good Hummus!
> SELECT name, b.hours.Friday AS friday, categories
FROM dfs.demo.`yelp/business.json` b
WHERE b.hours.Friday.`open` < '22:00' AND
b.hours.Friday.`close` > '22:00' AND
REPEATED_CONTAINS(categories, 'Mediterranean') AND
city = 'Las Vegas'
ORDER BY stars DESC
LIMIT 2;
+--------------------------------+-----------------------------------+--------------------------------------------------------------+
| name | friday | categories |
+--------------------------------+-----------------------------------+--------------------------------------------------------------+
| Olives | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] |
| Marrakech Moroccan Restaurant | {"close":"23:00","open":"17:30"} | ["Mediterranean","Middle Eastern","Moroccan","Restaurants"] |
+--------------------------------+-----------------------------------+--------------------------------------------------------------+
35. Flatten Repeated Values
> SELECT name, categories
FROM dfs.demo.`yelp/business.json` LIMIT 3;
+-----------------------------+-------------------------------------------+
| name | categories |
+-----------------------------+-------------------------------------------+
| Eric Goldberg, MD | ["Doctors","Health & Medical"] |
| Pine Cone Restaurant | ["Restaurants"] |
| Deforest Family Restaurant | ["American (Traditional)","Restaurants"] |
+-----------------------------+-------------------------------------------+
> SELECT name, FLATTEN(categories) AS categories
FROM dfs.demo.`yelp/business.json` LIMIT 5;
+-----------------------------+-------------------------+
| name | categories |
+-----------------------------+-------------------------+
| Eric Goldberg, MD | Doctors |
| Eric Goldberg, MD | Health & Medical |
| Pine Cone Restaurant | Restaurants |
| Deforest Family Restaurant | American (Traditional) |
| Deforest Family Restaurant | Restaurants |
+-----------------------------+-------------------------+
36. Most and Least Common Business Categories
> SELECT category, count(*) AS businesses
FROM (SELECT name, FLATTEN(categories) AS category
FROM dfs.demo.`yelp/business.json`) c
GROUP BY category ORDER BY businesses DESC;
+-----------------------------------+-------------+
| category | businesses |
+-----------------------------------+-------------+
| Restaurants | 14303 |
| Shopping | 6428 |
…
| Australian | 1 |
| Boat Dealers | 1 |
| Firewood | 1 |
+-----------------------------------+-------------+
715 rows selected (3.439 seconds)
> SELECT name, categories FROM dfs.demo.`yelp/business.json` WHERE true and
REPEATED_CONTAINS(categories, 'Australian');
+------+------------+
| name | categories |
+------+------------+
| The Australian AZ | ["Bars","Burgers","Nightlife","Australian","Sports Bars","Restaurants"] |
+------+------------+
38. Create a View for Name-Gender Mapping
> CREATE VIEW dfs.tmp.`names` AS
SELECT columns[0] AS name, columns[4] AS gender
FROM dfs.demo.`names.csv`;
> USE dfs.tmp;
> CREATE VIEW names1 ASSELECT columns[0] AS name, columns[4] AS gender
FROM dfs.demo.`names.csv`;
> SELECT * FROM dfs.tmp.names WHERE name = 'John';
+------------+------------+
| name | gender |
+------------+------------+
| John | Male |
+------------+------------+
columns[0] columns[4]
names.csv:
39. Most Common Names (and their Genders) on
Yelp
> SELECT u.name, n.gender, count(*) AS number
FROM mongo.yelp.users u, dfs.tmp.names n
WHERE u.name = n.name
GROUP BY u.name, n.gender
ORDER BY number DESC LIMIT 10;
+------------+------------+------------+
| name | gender | number |
+------------+------------+------------+
| David | Male | 2453 |
| John | Male | 2378 |
| Michael | Male | 2322 |
| Chris | Unknown | 2202 |
| Mike | Male | 2037 |
| Jennifer | Female | 1867 |
| Jessica | Female | 1463 |
| Jason | Male | 1457 |
| Michelle | Female | 1439 |
| Brian | Male | 1436 |
+------------+------------+------------+
40. Who Rates Higher – Men or Women?
> SELECT n.gender, count(*) AS users, round(avg(average_stars), 2) stars
FROM mongo.yelp.users u, dfs.tmp.names n
WHERE u.name = n.name
GROUP BY n.gender;
+------------+------------+------------+
| gender | users | stars |
+------------+------------+------------+
| Female | 103684 | 3.77 |
| Male | 97430 | 3.696 |
| Unknown | 18409 | 3.727 |
+------------+------------+------------+
41. Who Writes Longer Reviews – Men or Women?
> SELECT n.gender, round(avg(length(r.text))) AS review_length
FROM dfs.demo.`yelp/review.json` r,
mongo.yelp.users u,
dfs.tmp.names n
WHERE u.name = n.name AND r.user_id = u.user_id
GROUP BY n.gender;
+------------+---------------+
| gender | review_length |
+------------+---------------+
| Male | 665 |
| Female | 730 |
| Unknown | 711 |
+------------+---------------+
It takes a 3-way join to find out…
42. Thank You!
• Download at drill.apache.org
• Get in touch:
• tshiran@apache.org
• jnadeau@apache.org
• Ask questions:
• user@drill.apache.org
• Tweet: @ApacheDrill
Notes de l'éditeur
All SQL engines (traditional or SQL-on-Hadoop) view tables as spreadsheet-like data structures with rows and columns. All records have the same structure, and there is no support for nested data or repeating fields. Drill views tables conceptually as collections of JSON (with additional types) documents. Each record can have a different structure (hence, schema-less). This is revolutionary and has never been done before.
If you consider the four data models shown in the 2x2, all models can be represented by the complex, no schema model (JSON) because it is the most flexible. However, no other data model can be represented by the flat, fixed schema model. Therefore, when using any SQL engine except Drill, the data has to be transformed before it can be available to queries.