SlideShare une entreprise Scribd logo
1  sur  42
Drilling into Data with Apache Drill
Tomer Shiran, Apache Drill Founder and PMC Member
Jacques Nadeau, Apache Drill PMC Chair
Tomer Shiran Jacques Nadeau
tshiran@apache.org jnadeau@apache.org
@tshiran @intjesus
Drill founder and PMC Member
MapR VP Product
Drill PMC Chair (VP, Apache Drill)
Apache Drill
• Open source SQL query engine for non-relational datastores
– JSON document model
– Columnar
• Key advantages:
– Query any non-relational datastore
– No overhead (creating and maintaining schemas, transforming data, …)
– Treat your data like a table even when it’s not
– Keep using the BI tools you love
– Scales from one laptop to 1000s of servers
– Great performance and scalability
Omni-SQL (“SQL-on-Everything”)
Drill: Omni-SQL
Whereas the other engines we're discussing here create a relational database
environment on top of Hadoop, Drill instead enables a SQL language interface to
data in numerous formats, without requiring a formal schema to be declared. This
enables plug-and-play discovery over a huge universe of data without
prerequisites and preparation. So while Drill uses SQL, and can connect to
Hadoop, calling it SQL-on-Hadoop kind of misses the point. A better name might
be SQL-on-Everything, with very low setup requirements.
Andrew Brust,
“
”
Any Non-Relational Datastore
• File systems
– Traditional: Local files and NAS
– Hadoop: HDFS and MapR-FS
– Cloud storage: Amazon S3, Google
Cloud Storage, Azure Blob Storage
• NoSQL databases
– MongoDB
– HBase
– MapR-DB
– Hive
• And you can add new datastores
Any Client
• Multiple interfaces: ODBC, JDBC, REST, C,
Java
• BI tools
– Tableau
– Qlik
– MicroStrategy
– TIBCO Spotfire
– Excel
• Command line (Drill shell)
• Web and mobile apps
– Many JSON-powered chart libraries (see
D3.js)
• SAS, R, …
Drill Integrates With What You Have
Achieving “End-to-End Performance”
Execute fast
• Standard SQL
• Read data fast
• Leverage columnar
encodings and execution
• Execute operations
quickly
• Scale out, not up
Iterate fast
• Work without prep
• Decentralize data
management
• In-situ security
• Explore + query
• Access multiple sources
• Avoid the ETL rinse cycle
JSON Model, Columnar Speed
JSON
BSON
Mongo
HBase
NoSQL
Parquet
Avro
CSV
TSV
Schema-lessFixed schema
Flat
Complex
Name Gender Age
Michael M 6
Jennifer F 3
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}
RDBMS/SQL-on-Hadoop table
Apache Drill table
Apache Drill Provides the Best of Both Worlds
Acts Like a Database
• ANSI SQL: SELECT, FROM,
WHERE, JOIN, HAVING, ORDER
BY, WITH, CTAS, ALL, EXISTS,
ANY, IN, SOME
• VarChar, Int, BigInt, Decimal,
VarBinary, Timestamp, Float,
Double, etc.
• Subqueries, scalar subqueries,
partition pruning, CTE
• Data warehouse offload
• Tableau, ODBC, JDBC
• TPC-H & TPC-DS-like workloads
• Supports Hive SerDes
• Supports Hive UDFs
• Supports Hive Metastore
Even When Your Data
Doesn’t
• Path based queries and
wildcards
– select * from /my/logs/
– select * from /revenue/*/q2
• Modern data types
– Map, Array, Any
• Complex Functions and
Relational Operators
– FLATTEN, kvgen, convert_from,
convert_to, repeated_count, etc
• JSON Sensor analytics
• Complex data analysis
• Alternative DSLs
Why? To Support the Changing Data
Organization
Data Dev Circa 2000
1. Developer comes up with
requirements
2. DBA defines tables
3. DBA defines indices
4. DBA defines FK relationships
5. Developer stores data
6. BI builds reports
7. Analyst views reports
8. DBA adds materialized views
Data Today
1. Developer builds app, defines
schema, stores data
2. Analyst queries data
3. Data engineer fixes
performance problems or fills
functionality gaps
HOW DOES IT WORK?
Everything Starts With a Drillbit…
• High performance query executor
• In-memory columnar execution
• Directly interacts with data, acquiring
knowledge as it reads
• Built to leverage large amounts of
memory
• Networked or not
• Exposes ODBC, JDBC, REST
• Built-in Web UI and CLI
• Extensible
Drillbit
Single process
(daemon or CLI)
Data Lake, More Like Data Maelstrom
HDFS HDFS
mongod mongod
HDFS HDFS
HBase HBase
Cassandra Cassandra
HDFS
HDFS
HBase
Windows
Desktop
Mac
Desktop
HBase & HDFS Cluster
HDFS Cluster
MongoDB Cluster
Cassandra Cluster
DesktopClustered Servers
Run Drillbits Wherever; Whatever Your Data
Drillbit
HDFS HDFS
mongod mongod
HDFS HDFS
HBase HBase
Drillbit
DrillbitDrillbit
Drillbit Drillbit
Cassandra Cassandra
Drillbit Drillbit
HDFS
HDFS
HBase
Drillbit
Drillbit
Windows
Desktop
Drillbit
Mac
Desktop
Drillbit
Connect to Any Drillbit with ODBC, JDBC, C, Java,
REST
1. User connects to Drillbit
2. That Drillbit becomes Foreman
– Foreman generates execution plan
– Cost-based query optimization &
locality
3. Execution fragments are farmed
to other Drillbits
4. Drillbits exchange data as
necessary to guarantee relational
algebra
5. Results are returned to user
through Foreman Drillbit
User
Drillbit
Drillbit
(foreman)
ANALYZING YELP DATA
1. DOWNLOAD AND INSTALL
DRILL
Run Drill in Embedded Mode (drill-embedded)
$ tar xf apache-drill-1.0.0.tar.gz
$ cd apache-drill-1.0.0
$ bin/drill-embedded
> SELECT * FROM dfs.root.`/Users/tshiran/yelp/user.json` LIMIT 1;
+----------------+----------------------------------+---------------+-------+
| yelping_since | votes | review_count | name |
+----------------+----------------------------------+---------------+-------+
| 2012-02 | {"funny":1,"useful":5,"cool":0} | 6 | Lee |
+----------------+----------------------------------+---------------+-------+
• drillbit (Drill daemon) starts automatically in embedded mode
• No ZooKeeper in embedded mode
• Web UI is available at localhost:8047
Review the Query Profile in the Web UI
(localhost:8047)
Run Drill in Distributed Mode
$ zkServer start # ZooKeeper maintains the list of drillbits in the cluster
$ bin/drillbit.sh start # conf/drill-override.conf includes cluster name and ZK nodes
$ bin/drill-conf # or bin/drill-localhost to skip ZK lookup
> SELECT stars, count(*)
FROM dfs.root.`/Users/tshiran/yelp/review.json`
GROUP BY stars ORDER BY stars;
+--------+---------+
| stars | EXPR$1 |
+--------+---------+
| 1 | 110772 |
| 2 | 102737 |
| 3 | 163761 |
| 4 | 342143 |
| 5 | 406045 |
+--------+---------+
5 rows selected (3.739 seconds)
2. CONFIGURE DATASTORES
(STORAGE PLUGINS)
Enable MongoDB Storage Plugin
Define Workspaces in the File Storage
Plugin
3. EXPLORE THE DATA
The Data: Files
{
"votes": {"funny": 0, "useful": 2, "cool": 1},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg",
"stars": 5,
"date": "2007-05-17",
"text": "dr. goldberg offers everything ...",
"type": "review",
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}
The Data: MongoDB Collections
$ mongo
MongoDB shell version: 2.6.5
> show databases;
admin (empty)
local 0.078GB
yelp 0.453GB
> use yelp
> db.users.findOne()
{
"_id" : ObjectId("54566cdf3237149de181a92a"),
"yelping_since" : "2012-02",
"votes" : {
"funny" : 1,
"useful" : 5,
"cool" : 0
},
"review_count" : 6,
"name" : "Lee",
"user_id" : "qtrmBGNqCvupHMHL_bKFgQ",
"friends" : [ ]
}
Are There More 5-Star or 1-Star Reviews?
> SELECT stars, count(*)
FROM dfs.root.`/Users/tshiran/yelp/review.json`
GROUP BY stars ORDER BY stars;
+--------+---------+
| stars | EXPR$1 |
+--------+---------+
| 1 | 110772 |
| 2 | 102737 |
| 3 | 163761 |
| 4 | 342143 |
| 5 | 406045 |
+--------+---------+
5 rows selected (3.739 seconds)
Using Storage Plugins and Workspaces
> SELECT * FROM dfs.root.`/Users/tshiran/data/yelp/review.json`
LIMIT 1;
> SELECT * FROM dfs.demo.`yelp/review.json` LIMIT 1;
> SELECT * FROM mongo.yelp.users LIMIT 1;
> USE mongo.yelp;
> SELECT * FROM users LIMIT 1;
Storage plugin
Workspace
Path relative to workspace
Storage Plugin Workspace Table
dfs Path Path relative to workspace
mongo Database Collection
hive Database Table
hbase Namespace Table
Most Common User Names (MongoDB)
> SELECT name, count(*) AS users
FROM mongo.yelp.users
GROUP BY name
ORDER BY users DESC LIMIT 10;
+------------+------------+
| name | users |
+------------+------------+
| David | 2453 |
| John | 2378 |
| Michael | 2322 |
| Chris | 2202 |
| Mike | 2037 |
| Jennifer | 1867 |
| Jessica | 1463 |
| Jason | 1457 |
| Michelle | 1439 |
| Brian | 1436 |
+------------+------------+
Cities with the Most Businesses
> SELECT state, city, count(*) AS businesses
FROM dfs.demo.`/yelp/business.json`
GROUP BY state, city
ORDER BY businesses DESC LIMIT 10;
+------------+------------+-------------+
| state | city | businesses |
+------------+------------+-------------+
| NV | Las Vegas | 12021 |
| AZ | Phoenix | 7499 |
| AZ | Scottsdale | 3605 |
| EDH | Edinburgh | 2804 |
| AZ | Mesa | 2041 |
| AZ | Tempe | 2025 |
| NV | Henderson | 1914 |
| AZ | Chandler | 1637 |
| WI | Madison | 1630 |
| AZ | Glendale | 1196 |
+------------+------------+-------------+
3. EXPLORING COMPLEX
DATA
business.json (1)
{
"business_id": "4bEjOyTaDG24SY5TxsaUNQ",
"full_address": "3655 Las Vegas Blvd SnThe StripnLas Vegas, NV 89109",
"hours": {
"Monday": {"close": "23:00", "open": "07:00"},
"Tuesday": {"close": "23:00", "open": "07:00"},
"Friday": {"close": "00:00", "open": "07:00"},
"Wednesday": {"close": "23:00", "open": "07:00"},
"Thursday": {"close": "23:00", "open": "07:00"},
"Sunday": {"close": "23:00", "open": "07:00"},
"Saturday": {"close": "00:00", "open": "07:00"}
},
"open": true,
"categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"],
"city": "Las Vegas",
"review_count": 4084,
"name": "Mon Ami Gabi",
"neighborhoods": ["The Strip"],
"longitude": -115.172588519464,
business.json (2)
"state": "NV",
"stars": 4.0,
"attributes": {
"Alcohol": "full_bar”,
"Noise Level": "average",
"Has TV": false,
"Attire": "casual",
"Ambience": {
"romantic": true,
"intimate": false,
"touristy": false,
"hipster": false,
"classy": true,
"trendy": false,
"casual": false
},
"Good For": {"dessert": false, "latenight": false, "lunch": false,
"dinner": true, "breakfast": false, "brunch": false},
}
}
Which Places Are Open Right Now (22:00)?
> SELECT name, b.hours
FROM dfs.demo.`yelp/business.json` b
WHERE b.hours.Saturday.`open` < '22:00' AND
b.hours.Saturday.`close` > '22:00'
LIMIT 2;
+------------------------------+------------------------------------------------+
| name | hours |
+------------------------------+------------------------------------------------+
| Chang Jiang Chinese Kitchen | {"Saturday":{"close":"22:30","open":"11:00"}} |
| Grand China Restaurant | {"Saturday":{"close":"23:00","open":"11:00"}} |
+------------------------------+------------------------------------------------+
It’s 10pm in Vegas and I Want Good Hummus!
> SELECT name, b.hours.Friday AS friday, categories
FROM dfs.demo.`yelp/business.json` b
WHERE b.hours.Friday.`open` < '22:00' AND
b.hours.Friday.`close` > '22:00' AND
REPEATED_CONTAINS(categories, 'Mediterranean') AND
city = 'Las Vegas'
ORDER BY stars DESC
LIMIT 2;
+--------------------------------+-----------------------------------+--------------------------------------------------------------+
| name | friday | categories |
+--------------------------------+-----------------------------------+--------------------------------------------------------------+
| Olives | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] |
| Marrakech Moroccan Restaurant | {"close":"23:00","open":"17:30"} | ["Mediterranean","Middle Eastern","Moroccan","Restaurants"] |
+--------------------------------+-----------------------------------+--------------------------------------------------------------+
Flatten Repeated Values
> SELECT name, categories
FROM dfs.demo.`yelp/business.json` LIMIT 3;
+-----------------------------+-------------------------------------------+
| name | categories |
+-----------------------------+-------------------------------------------+
| Eric Goldberg, MD | ["Doctors","Health & Medical"] |
| Pine Cone Restaurant | ["Restaurants"] |
| Deforest Family Restaurant | ["American (Traditional)","Restaurants"] |
+-----------------------------+-------------------------------------------+
> SELECT name, FLATTEN(categories) AS categories
FROM dfs.demo.`yelp/business.json` LIMIT 5;
+-----------------------------+-------------------------+
| name | categories |
+-----------------------------+-------------------------+
| Eric Goldberg, MD | Doctors |
| Eric Goldberg, MD | Health & Medical |
| Pine Cone Restaurant | Restaurants |
| Deforest Family Restaurant | American (Traditional) |
| Deforest Family Restaurant | Restaurants |
+-----------------------------+-------------------------+
Most and Least Common Business Categories
> SELECT category, count(*) AS businesses
FROM (SELECT name, FLATTEN(categories) AS category
FROM dfs.demo.`yelp/business.json`) c
GROUP BY category ORDER BY businesses DESC;
+-----------------------------------+-------------+
| category | businesses |
+-----------------------------------+-------------+
| Restaurants | 14303 |
| Shopping | 6428 |
…
| Australian | 1 |
| Boat Dealers | 1 |
| Firewood | 1 |
+-----------------------------------+-------------+
715 rows selected (3.439 seconds)
> SELECT name, categories FROM dfs.demo.`yelp/business.json` WHERE true and
REPEATED_CONTAINS(categories, 'Australian');
+------+------------+
| name | categories |
+------+------------+
| The Australian AZ | ["Bars","Burgers","Nightlife","Australian","Sports Bars","Restaurants"] |
+------+------------+
4. LEVERAGING VIEWS
Create a View for Name-Gender Mapping
> CREATE VIEW dfs.tmp.`names` AS
SELECT columns[0] AS name, columns[4] AS gender
FROM dfs.demo.`names.csv`;
> USE dfs.tmp;
> CREATE VIEW names1 ASSELECT columns[0] AS name, columns[4] AS gender
FROM dfs.demo.`names.csv`;
> SELECT * FROM dfs.tmp.names WHERE name = 'John';
+------------+------------+
| name | gender |
+------------+------------+
| John | Male |
+------------+------------+
columns[0] columns[4]
names.csv:
Most Common Names (and their Genders) on
Yelp
> SELECT u.name, n.gender, count(*) AS number
FROM mongo.yelp.users u, dfs.tmp.names n
WHERE u.name = n.name
GROUP BY u.name, n.gender
ORDER BY number DESC LIMIT 10;
+------------+------------+------------+
| name | gender | number |
+------------+------------+------------+
| David | Male | 2453 |
| John | Male | 2378 |
| Michael | Male | 2322 |
| Chris | Unknown | 2202 |
| Mike | Male | 2037 |
| Jennifer | Female | 1867 |
| Jessica | Female | 1463 |
| Jason | Male | 1457 |
| Michelle | Female | 1439 |
| Brian | Male | 1436 |
+------------+------------+------------+
Who Rates Higher – Men or Women?
> SELECT n.gender, count(*) AS users, round(avg(average_stars), 2) stars
FROM mongo.yelp.users u, dfs.tmp.names n
WHERE u.name = n.name
GROUP BY n.gender;
+------------+------------+------------+
| gender | users | stars |
+------------+------------+------------+
| Female | 103684 | 3.77 |
| Male | 97430 | 3.696 |
| Unknown | 18409 | 3.727 |
+------------+------------+------------+
Who Writes Longer Reviews – Men or Women?
> SELECT n.gender, round(avg(length(r.text))) AS review_length
FROM dfs.demo.`yelp/review.json` r,
mongo.yelp.users u,
dfs.tmp.names n
WHERE u.name = n.name AND r.user_id = u.user_id
GROUP BY n.gender;
+------------+---------------+
| gender | review_length |
+------------+---------------+
| Male | 665 |
| Female | 730 |
| Unknown | 711 |
+------------+---------------+
It takes a 3-way join to find out…
Thank You!
• Download at drill.apache.org
• Get in touch:
• tshiran@apache.org
• jnadeau@apache.org
• Ask questions:
• user@drill.apache.org
• Tweet: @ApacheDrill

Contenu connexe

Tendances

Free Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillFree Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillMapR Technologies
 
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Gera Shegalov
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, Howmcsrivas
 
Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Vince Gonzalez
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillDataWorks Summit
 
Apache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBaseApache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBaseNag Arvind Gudiseva
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill WorkshopCharles Givre
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillMapR Technologies
 
Batch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application AdoptionBatch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application AdoptionDataWorks Summit/Hadoop Summit
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperDataWorks Summit
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies
 
Drilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillDrilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillCharles Givre
 

Tendances (20)

Apache Drill
Apache DrillApache Drill
Apache Drill
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Free Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillFree Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache Drill
 
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Apache drill
Apache drillApache drill
Apache drill
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
 
Apache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBaseApache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBase
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill Workshop
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
Batch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application AdoptionBatch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application Adoption
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 
Drilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillDrilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache Drill
 

En vedette

Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2Charles Givre
 
Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1Charles Givre
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 
Strata NYC 2015 What does your smart device know about you?
Strata NYC 2015 What does your smart device know about you?Strata NYC 2015 What does your smart device know about you?
Strata NYC 2015 What does your smart device know about you?Charles Givre
 
Merlin: The Ultimate Data Science Environment
Merlin: The Ultimate Data Science EnvironmentMerlin: The Ultimate Data Science Environment
Merlin: The Ultimate Data Science EnvironmentCharles Givre
 
Km 65 tahun 2002
Km 65 tahun 2002Km 65 tahun 2002
Km 65 tahun 2002Bp Nafri
 
RAPIM 2011
RAPIM 2011RAPIM 2011
RAPIM 2011Bp Nafri
 
Apache Storm - Minando redes sociales y medios en tiempo real
Apache Storm - Minando redes sociales y medios en tiempo realApache Storm - Minando redes sociales y medios en tiempo real
Apache Storm - Minando redes sociales y medios en tiempo realAndrés Mauricio Palacios
 
What Does Your Smart Car Know About You? Strata London 2016
What Does Your Smart Car Know About You?  Strata London 2016What Does Your Smart Car Know About You?  Strata London 2016
What Does Your Smart Car Know About You? Strata London 2016Charles Givre
 
RAKORNIS 2010
RAKORNIS 2010RAKORNIS 2010
RAKORNIS 2010Bp Nafri
 
Pristine Advisers Presentation
Pristine Advisers PresentationPristine Advisers Presentation
Pristine Advisers PresentationPattyBaronowski
 
KELAIKLAUTAN KAPAL DAN DOKUMENTASI KAPAL
KELAIKLAUTAN KAPAL DAN DOKUMENTASI KAPALKELAIKLAUTAN KAPAL DAN DOKUMENTASI KAPAL
KELAIKLAUTAN KAPAL DAN DOKUMENTASI KAPALBeny Jackson Maliota
 
The Marketer's Guide To Customer Interviews
The Marketer's Guide To Customer InterviewsThe Marketer's Guide To Customer Interviews
The Marketer's Guide To Customer InterviewsGood Funnel
 

En vedette (17)

Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2
 
Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Strata NYC 2015 What does your smart device know about you?
Strata NYC 2015 What does your smart device know about you?Strata NYC 2015 What does your smart device know about you?
Strata NYC 2015 What does your smart device know about you?
 
Merlin: The Ultimate Data Science Environment
Merlin: The Ultimate Data Science EnvironmentMerlin: The Ultimate Data Science Environment
Merlin: The Ultimate Data Science Environment
 
Km 65 tahun 2002
Km 65 tahun 2002Km 65 tahun 2002
Km 65 tahun 2002
 
RAPIM 2011
RAPIM 2011RAPIM 2011
RAPIM 2011
 
Narkoba
NarkobaNarkoba
Narkoba
 
PSCO
PSCOPSCO
PSCO
 
Apache Storm - Minando redes sociales y medios en tiempo real
Apache Storm - Minando redes sociales y medios en tiempo realApache Storm - Minando redes sociales y medios en tiempo real
Apache Storm - Minando redes sociales y medios en tiempo real
 
What Does Your Smart Car Know About You? Strata London 2016
What Does Your Smart Car Know About You?  Strata London 2016What Does Your Smart Car Know About You?  Strata London 2016
What Does Your Smart Car Know About You? Strata London 2016
 
RAKORNIS 2010
RAKORNIS 2010RAKORNIS 2010
RAKORNIS 2010
 
Pristine Advisers Presentation
Pristine Advisers PresentationPristine Advisers Presentation
Pristine Advisers Presentation
 
ISPS Code
ISPS CodeISPS Code
ISPS Code
 
KELAIKLAUTAN KAPAL DAN DOKUMENTASI KAPAL
KELAIKLAUTAN KAPAL DAN DOKUMENTASI KAPALKELAIKLAUTAN KAPAL DAN DOKUMENTASI KAPAL
KELAIKLAUTAN KAPAL DAN DOKUMENTASI KAPAL
 
The Marketer's Guide To Customer Interviews
The Marketer's Guide To Customer InterviewsThe Marketer's Guide To Customer Interviews
The Marketer's Guide To Customer Interviews
 
Meet Spark
Meet SparkMeet Spark
Meet Spark
 

Similaire à Drilling into Data with Apache Drill

Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillDataWorks Summit
 
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019Dave Stokes
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB
 
What and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual GrallWhat and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual Gralldistributed matters
 
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMRVancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMRAllice Shandler
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseHenk van der Valk
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Lace Lofranco
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For HadoopCloudera, Inc.
 
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevWebinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevAltinity Ltd
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Redis+Spark Structured Streaming: Roshan Kumar
Redis+Spark Structured Streaming: Roshan KumarRedis+Spark Structured Streaming: Roshan Kumar
Redis+Spark Structured Streaming: Roshan KumarRedis Labs
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewRiccardo Zamana
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Amazon RDS for Microsoft SQL: Performance, Security, Best Practices (DAT303) ...
Amazon RDS for Microsoft SQL: Performance, Security, Best Practices (DAT303) ...Amazon RDS for Microsoft SQL: Performance, Security, Best Practices (DAT303) ...
Amazon RDS for Microsoft SQL: Performance, Security, Best Practices (DAT303) ...Amazon Web Services
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBDenny Lee
 

Similaire à Drilling into Data with Apache Drill (20)

Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
The Heterogeneous Data lake
The Heterogeneous Data lakeThe Heterogeneous Data lake
The Heterogeneous Data lake
 
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
What and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual GrallWhat and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual Grall
 
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMRVancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevWebinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
 
Mini-Training: Redis
Mini-Training: RedisMini-Training: Redis
Mini-Training: Redis
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Redis+Spark Structured Streaming: Roshan Kumar
Redis+Spark Structured Streaming: Roshan KumarRedis+Spark Structured Streaming: Roshan Kumar
Redis+Spark Structured Streaming: Roshan Kumar
 
OrientDB
OrientDBOrientDB
OrientDB
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overview
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Amazon RDS for Microsoft SQL: Performance, Security, Best Practices (DAT303) ...
Amazon RDS for Microsoft SQL: Performance, Security, Best Practices (DAT303) ...Amazon RDS for Microsoft SQL: Performance, Security, Best Practices (DAT303) ...
Amazon RDS for Microsoft SQL: Performance, Security, Best Practices (DAT303) ...
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 

Plus de MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

Plus de MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Drilling into Data with Apache Drill

  • 1. Drilling into Data with Apache Drill Tomer Shiran, Apache Drill Founder and PMC Member Jacques Nadeau, Apache Drill PMC Chair
  • 2. Tomer Shiran Jacques Nadeau tshiran@apache.org jnadeau@apache.org @tshiran @intjesus Drill founder and PMC Member MapR VP Product Drill PMC Chair (VP, Apache Drill)
  • 3. Apache Drill • Open source SQL query engine for non-relational datastores – JSON document model – Columnar • Key advantages: – Query any non-relational datastore – No overhead (creating and maintaining schemas, transforming data, …) – Treat your data like a table even when it’s not – Keep using the BI tools you love – Scales from one laptop to 1000s of servers – Great performance and scalability
  • 4. Omni-SQL (“SQL-on-Everything”) Drill: Omni-SQL Whereas the other engines we're discussing here create a relational database environment on top of Hadoop, Drill instead enables a SQL language interface to data in numerous formats, without requiring a formal schema to be declared. This enables plug-and-play discovery over a huge universe of data without prerequisites and preparation. So while Drill uses SQL, and can connect to Hadoop, calling it SQL-on-Hadoop kind of misses the point. A better name might be SQL-on-Everything, with very low setup requirements. Andrew Brust, “ ”
  • 5. Any Non-Relational Datastore • File systems – Traditional: Local files and NAS – Hadoop: HDFS and MapR-FS – Cloud storage: Amazon S3, Google Cloud Storage, Azure Blob Storage • NoSQL databases – MongoDB – HBase – MapR-DB – Hive • And you can add new datastores Any Client • Multiple interfaces: ODBC, JDBC, REST, C, Java • BI tools – Tableau – Qlik – MicroStrategy – TIBCO Spotfire – Excel • Command line (Drill shell) • Web and mobile apps – Many JSON-powered chart libraries (see D3.js) • SAS, R, … Drill Integrates With What You Have
  • 6. Achieving “End-to-End Performance” Execute fast • Standard SQL • Read data fast • Leverage columnar encodings and execution • Execute operations quickly • Scale out, not up Iterate fast • Work without prep • Decentralize data management • In-situ security • Explore + query • Access multiple sources • Avoid the ETL rinse cycle
  • 7. JSON Model, Columnar Speed JSON BSON Mongo HBase NoSQL Parquet Avro CSV TSV Schema-lessFixed schema Flat Complex Name Gender Age Michael M 6 Jennifer F 3 { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos } { name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC } RDBMS/SQL-on-Hadoop table Apache Drill table
  • 8. Apache Drill Provides the Best of Both Worlds Acts Like a Database • ANSI SQL: SELECT, FROM, WHERE, JOIN, HAVING, ORDER BY, WITH, CTAS, ALL, EXISTS, ANY, IN, SOME • VarChar, Int, BigInt, Decimal, VarBinary, Timestamp, Float, Double, etc. • Subqueries, scalar subqueries, partition pruning, CTE • Data warehouse offload • Tableau, ODBC, JDBC • TPC-H & TPC-DS-like workloads • Supports Hive SerDes • Supports Hive UDFs • Supports Hive Metastore Even When Your Data Doesn’t • Path based queries and wildcards – select * from /my/logs/ – select * from /revenue/*/q2 • Modern data types – Map, Array, Any • Complex Functions and Relational Operators – FLATTEN, kvgen, convert_from, convert_to, repeated_count, etc • JSON Sensor analytics • Complex data analysis • Alternative DSLs
  • 9. Why? To Support the Changing Data Organization Data Dev Circa 2000 1. Developer comes up with requirements 2. DBA defines tables 3. DBA defines indices 4. DBA defines FK relationships 5. Developer stores data 6. BI builds reports 7. Analyst views reports 8. DBA adds materialized views Data Today 1. Developer builds app, defines schema, stores data 2. Analyst queries data 3. Data engineer fixes performance problems or fills functionality gaps
  • 10. HOW DOES IT WORK?
  • 11. Everything Starts With a Drillbit… • High performance query executor • In-memory columnar execution • Directly interacts with data, acquiring knowledge as it reads • Built to leverage large amounts of memory • Networked or not • Exposes ODBC, JDBC, REST • Built-in Web UI and CLI • Extensible Drillbit Single process (daemon or CLI)
  • 12. Data Lake, More Like Data Maelstrom HDFS HDFS mongod mongod HDFS HDFS HBase HBase Cassandra Cassandra HDFS HDFS HBase Windows Desktop Mac Desktop HBase & HDFS Cluster HDFS Cluster MongoDB Cluster Cassandra Cluster DesktopClustered Servers
  • 13. Run Drillbits Wherever; Whatever Your Data Drillbit HDFS HDFS mongod mongod HDFS HDFS HBase HBase Drillbit DrillbitDrillbit Drillbit Drillbit Cassandra Cassandra Drillbit Drillbit HDFS HDFS HBase Drillbit Drillbit Windows Desktop Drillbit Mac Desktop Drillbit
  • 14. Connect to Any Drillbit with ODBC, JDBC, C, Java, REST 1. User connects to Drillbit 2. That Drillbit becomes Foreman – Foreman generates execution plan – Cost-based query optimization & locality 3. Execution fragments are farmed to other Drillbits 4. Drillbits exchange data as necessary to guarantee relational algebra 5. Results are returned to user through Foreman Drillbit User Drillbit Drillbit (foreman)
  • 16. 1. DOWNLOAD AND INSTALL DRILL
  • 17. Run Drill in Embedded Mode (drill-embedded) $ tar xf apache-drill-1.0.0.tar.gz $ cd apache-drill-1.0.0 $ bin/drill-embedded > SELECT * FROM dfs.root.`/Users/tshiran/yelp/user.json` LIMIT 1; +----------------+----------------------------------+---------------+-------+ | yelping_since | votes | review_count | name | +----------------+----------------------------------+---------------+-------+ | 2012-02 | {"funny":1,"useful":5,"cool":0} | 6 | Lee | +----------------+----------------------------------+---------------+-------+ • drillbit (Drill daemon) starts automatically in embedded mode • No ZooKeeper in embedded mode • Web UI is available at localhost:8047
  • 18. Review the Query Profile in the Web UI (localhost:8047)
  • 19. Run Drill in Distributed Mode $ zkServer start # ZooKeeper maintains the list of drillbits in the cluster $ bin/drillbit.sh start # conf/drill-override.conf includes cluster name and ZK nodes $ bin/drill-conf # or bin/drill-localhost to skip ZK lookup > SELECT stars, count(*) FROM dfs.root.`/Users/tshiran/yelp/review.json` GROUP BY stars ORDER BY stars; +--------+---------+ | stars | EXPR$1 | +--------+---------+ | 1 | 110772 | | 2 | 102737 | | 3 | 163761 | | 4 | 342143 | | 5 | 406045 | +--------+---------+ 5 rows selected (3.739 seconds)
  • 22. Define Workspaces in the File Storage Plugin
  • 24. The Data: Files { "votes": {"funny": 0, "useful": 2, "cool": 1}, "user_id": "Xqd0DzHaiyRqVH3WRG7hzg", "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17", "text": "dr. goldberg offers everything ...", "type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA" }
  • 25. The Data: MongoDB Collections $ mongo MongoDB shell version: 2.6.5 > show databases; admin (empty) local 0.078GB yelp 0.453GB > use yelp > db.users.findOne() { "_id" : ObjectId("54566cdf3237149de181a92a"), "yelping_since" : "2012-02", "votes" : { "funny" : 1, "useful" : 5, "cool" : 0 }, "review_count" : 6, "name" : "Lee", "user_id" : "qtrmBGNqCvupHMHL_bKFgQ", "friends" : [ ] }
  • 26. Are There More 5-Star or 1-Star Reviews? > SELECT stars, count(*) FROM dfs.root.`/Users/tshiran/yelp/review.json` GROUP BY stars ORDER BY stars; +--------+---------+ | stars | EXPR$1 | +--------+---------+ | 1 | 110772 | | 2 | 102737 | | 3 | 163761 | | 4 | 342143 | | 5 | 406045 | +--------+---------+ 5 rows selected (3.739 seconds)
  • 27. Using Storage Plugins and Workspaces > SELECT * FROM dfs.root.`/Users/tshiran/data/yelp/review.json` LIMIT 1; > SELECT * FROM dfs.demo.`yelp/review.json` LIMIT 1; > SELECT * FROM mongo.yelp.users LIMIT 1; > USE mongo.yelp; > SELECT * FROM users LIMIT 1; Storage plugin Workspace Path relative to workspace Storage Plugin Workspace Table dfs Path Path relative to workspace mongo Database Collection hive Database Table hbase Namespace Table
  • 28. Most Common User Names (MongoDB) > SELECT name, count(*) AS users FROM mongo.yelp.users GROUP BY name ORDER BY users DESC LIMIT 10; +------------+------------+ | name | users | +------------+------------+ | David | 2453 | | John | 2378 | | Michael | 2322 | | Chris | 2202 | | Mike | 2037 | | Jennifer | 1867 | | Jessica | 1463 | | Jason | 1457 | | Michelle | 1439 | | Brian | 1436 | +------------+------------+
  • 29. Cities with the Most Businesses > SELECT state, city, count(*) AS businesses FROM dfs.demo.`/yelp/business.json` GROUP BY state, city ORDER BY businesses DESC LIMIT 10; +------------+------------+-------------+ | state | city | businesses | +------------+------------+-------------+ | NV | Las Vegas | 12021 | | AZ | Phoenix | 7499 | | AZ | Scottsdale | 3605 | | EDH | Edinburgh | 2804 | | AZ | Mesa | 2041 | | AZ | Tempe | 2025 | | NV | Henderson | 1914 | | AZ | Chandler | 1637 | | WI | Madison | 1630 | | AZ | Glendale | 1196 | +------------+------------+-------------+
  • 31. business.json (1) { "business_id": "4bEjOyTaDG24SY5TxsaUNQ", "full_address": "3655 Las Vegas Blvd SnThe StripnLas Vegas, NV 89109", "hours": { "Monday": {"close": "23:00", "open": "07:00"}, "Tuesday": {"close": "23:00", "open": "07:00"}, "Friday": {"close": "00:00", "open": "07:00"}, "Wednesday": {"close": "23:00", "open": "07:00"}, "Thursday": {"close": "23:00", "open": "07:00"}, "Sunday": {"close": "23:00", "open": "07:00"}, "Saturday": {"close": "00:00", "open": "07:00"} }, "open": true, "categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"], "city": "Las Vegas", "review_count": 4084, "name": "Mon Ami Gabi", "neighborhoods": ["The Strip"], "longitude": -115.172588519464,
  • 32. business.json (2) "state": "NV", "stars": 4.0, "attributes": { "Alcohol": "full_bar”, "Noise Level": "average", "Has TV": false, "Attire": "casual", "Ambience": { "romantic": true, "intimate": false, "touristy": false, "hipster": false, "classy": true, "trendy": false, "casual": false }, "Good For": {"dessert": false, "latenight": false, "lunch": false, "dinner": true, "breakfast": false, "brunch": false}, } }
  • 33. Which Places Are Open Right Now (22:00)? > SELECT name, b.hours FROM dfs.demo.`yelp/business.json` b WHERE b.hours.Saturday.`open` < '22:00' AND b.hours.Saturday.`close` > '22:00' LIMIT 2; +------------------------------+------------------------------------------------+ | name | hours | +------------------------------+------------------------------------------------+ | Chang Jiang Chinese Kitchen | {"Saturday":{"close":"22:30","open":"11:00"}} | | Grand China Restaurant | {"Saturday":{"close":"23:00","open":"11:00"}} | +------------------------------+------------------------------------------------+
  • 34. It’s 10pm in Vegas and I Want Good Hummus! > SELECT name, b.hours.Friday AS friday, categories FROM dfs.demo.`yelp/business.json` b WHERE b.hours.Friday.`open` < '22:00' AND b.hours.Friday.`close` > '22:00' AND REPEATED_CONTAINS(categories, 'Mediterranean') AND city = 'Las Vegas' ORDER BY stars DESC LIMIT 2; +--------------------------------+-----------------------------------+--------------------------------------------------------------+ | name | friday | categories | +--------------------------------+-----------------------------------+--------------------------------------------------------------+ | Olives | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] | | Marrakech Moroccan Restaurant | {"close":"23:00","open":"17:30"} | ["Mediterranean","Middle Eastern","Moroccan","Restaurants"] | +--------------------------------+-----------------------------------+--------------------------------------------------------------+
  • 35. Flatten Repeated Values > SELECT name, categories FROM dfs.demo.`yelp/business.json` LIMIT 3; +-----------------------------+-------------------------------------------+ | name | categories | +-----------------------------+-------------------------------------------+ | Eric Goldberg, MD | ["Doctors","Health & Medical"] | | Pine Cone Restaurant | ["Restaurants"] | | Deforest Family Restaurant | ["American (Traditional)","Restaurants"] | +-----------------------------+-------------------------------------------+ > SELECT name, FLATTEN(categories) AS categories FROM dfs.demo.`yelp/business.json` LIMIT 5; +-----------------------------+-------------------------+ | name | categories | +-----------------------------+-------------------------+ | Eric Goldberg, MD | Doctors | | Eric Goldberg, MD | Health & Medical | | Pine Cone Restaurant | Restaurants | | Deforest Family Restaurant | American (Traditional) | | Deforest Family Restaurant | Restaurants | +-----------------------------+-------------------------+
  • 36. Most and Least Common Business Categories > SELECT category, count(*) AS businesses FROM (SELECT name, FLATTEN(categories) AS category FROM dfs.demo.`yelp/business.json`) c GROUP BY category ORDER BY businesses DESC; +-----------------------------------+-------------+ | category | businesses | +-----------------------------------+-------------+ | Restaurants | 14303 | | Shopping | 6428 | … | Australian | 1 | | Boat Dealers | 1 | | Firewood | 1 | +-----------------------------------+-------------+ 715 rows selected (3.439 seconds) > SELECT name, categories FROM dfs.demo.`yelp/business.json` WHERE true and REPEATED_CONTAINS(categories, 'Australian'); +------+------------+ | name | categories | +------+------------+ | The Australian AZ | ["Bars","Burgers","Nightlife","Australian","Sports Bars","Restaurants"] | +------+------------+
  • 38. Create a View for Name-Gender Mapping > CREATE VIEW dfs.tmp.`names` AS SELECT columns[0] AS name, columns[4] AS gender FROM dfs.demo.`names.csv`; > USE dfs.tmp; > CREATE VIEW names1 ASSELECT columns[0] AS name, columns[4] AS gender FROM dfs.demo.`names.csv`; > SELECT * FROM dfs.tmp.names WHERE name = 'John'; +------------+------------+ | name | gender | +------------+------------+ | John | Male | +------------+------------+ columns[0] columns[4] names.csv:
  • 39. Most Common Names (and their Genders) on Yelp > SELECT u.name, n.gender, count(*) AS number FROM mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name GROUP BY u.name, n.gender ORDER BY number DESC LIMIT 10; +------------+------------+------------+ | name | gender | number | +------------+------------+------------+ | David | Male | 2453 | | John | Male | 2378 | | Michael | Male | 2322 | | Chris | Unknown | 2202 | | Mike | Male | 2037 | | Jennifer | Female | 1867 | | Jessica | Female | 1463 | | Jason | Male | 1457 | | Michelle | Female | 1439 | | Brian | Male | 1436 | +------------+------------+------------+
  • 40. Who Rates Higher – Men or Women? > SELECT n.gender, count(*) AS users, round(avg(average_stars), 2) stars FROM mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name GROUP BY n.gender; +------------+------------+------------+ | gender | users | stars | +------------+------------+------------+ | Female | 103684 | 3.77 | | Male | 97430 | 3.696 | | Unknown | 18409 | 3.727 | +------------+------------+------------+
  • 41. Who Writes Longer Reviews – Men or Women? > SELECT n.gender, round(avg(length(r.text))) AS review_length FROM dfs.demo.`yelp/review.json` r, mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name AND r.user_id = u.user_id GROUP BY n.gender; +------------+---------------+ | gender | review_length | +------------+---------------+ | Male | 665 | | Female | 730 | | Unknown | 711 | +------------+---------------+ It takes a 3-way join to find out…
  • 42. Thank You! • Download at drill.apache.org • Get in touch: • tshiran@apache.org • jnadeau@apache.org • Ask questions: • user@drill.apache.org • Tweet: @ApacheDrill

Notes de l'éditeur

  1. All SQL engines (traditional or SQL-on-Hadoop) view tables as spreadsheet-like data structures with rows and columns. All records have the same structure, and there is no support for nested data or repeating fields. Drill views tables conceptually as collections of JSON (with additional types) documents. Each record can have a different structure (hence, schema-less). This is revolutionary and has never been done before. If you consider the four data models shown in the 2x2, all models can be represented by the complex, no schema model (JSON) because it is the most flexible. However, no other data model can be represented by the flat, fixed schema model. Therefore, when using any SQL engine except Drill, the data has to be transformed before it can be available to queries.