SlideShare une entreprise Scribd logo
1  sur  25
Drill / SQL / Optiq
     Julian Hyde
 Apache Drill User Group
     2013-03-13
SQL
SQL: Pros & cons
Fact:

  SQL is older than Macaulay Culkin

Less interesting but more relevant:

  Can be written by (lots of) humans

  Can be written by machines

  Requires query optimization

  Allows query optimization

  Based on “flat” relations and basic relational
  operations
Quick intro to Optiq
Introducing Optiq
Framework
Derived from LucidDB
Minimal query mediator:

    No storage

    No runtime

    No metadata

    Query planning engine

    Core operators & rewrite rules

    Optional SQL parser/validator
SELECT p.“product_name”, COUNT(*) AS c
Expression tree                        FROM “splunk”.”splunk” AS s
                                         JOIN “mysql”.”products” AS p
                                         ON s.”product_id” = p.”product_id”
                                       WHERE s.“action” = 'purchase'
                                       GROUP BY p.”product_name”
                                       ORDER BY c DESC
 Splunk
 Table: splunk
                                                    Key: product_name
                     Key: product_id                Agg: count
                                       Condition:                       Key: c DESC
                                         action =
                                       'purchase'
 scan
                          join
MySQL                                  filter          group            sort
    scan
                 Table: products
SELECT p.“product_name”, COUNT(*) AS c
Expression tree                      FROM “splunk”.”splunk” AS s
                                       JOIN “mysql”.”products” AS p
(optimized)                            ON s.”product_id” = p.”product_id”
                                     WHERE s.“action” = 'purchase'
                                     GROUP BY p.”product_name”
                                     ORDER BY c DESC
             Splunk
                        Condition:
 Table: splunk
                          action =
                        'purchase'                     Key: product_name
                                                       Agg: count
                                                                           Key: c DESC

                                     Key: product_id
 scan                   filter

MySQL
                                     join                 group            sort
   scan
                 Table: products
Metadata SPI

    interface Table
    −   RelDataType getRowType()

    interface TableFunction
    −   List<Parameter> getParameters()
    −   Table apply(List arguments)
    −   e.g. ViewTableFunction

    interface Schema
     − Map<String, List<TableFunction>>
       getTableFunctions()
Operators and rules

    Rule: interface RelOptRule

    Operator: interface RelNode

    Core operators: TableAccess, Project,
    Filter, Join, Aggregate, Order, Union,
    Intersect, Minus, Values

    Some rules: MergeFilterRule,
    PushAggregateThroughUnionRule,
    RemoveCorrelationForScalarProjectRule
    + 100 more
Planning algorithm

    Start with a logical plan and a set of
    rewrite rules

    Form a list of rewrite rules that are
    applicable (based on pattern-matching)

    Fire the rule that is likely to do the most
    good

    Rule generates an expression that is
    equivalent (and hopefully cheaper)

    Queue up new rule matches

    Repeat until cheap enough
Concepts

    Cost

    Equivalence sets

    Calling convention

    Logical vs Physical

    Traits

    Implementation
Outside the kernel
                                      JDBC client

    SQL
    parser/validator
                                    JDBC server

    JDBC driver            Optional SQL parser /          Metadata
                                       validator            SPI

    SQL function             Core
                                        Query            Pluggable
    library (validation               3
                                       planner
                                        rd
                                              3
                                              rd
                                                           rules

    + code-               Pluggable  party party
    generation)       3 party
                            rd        ops ops
                                                    3rd party
                        data                          data

    Lingual
    (Cascading
    adapter)

    Splunk adapter

    Drill adapter
Optiq roadmap

    Building blocks for analytic DB:
    −   In-memory tables in a distributed cache
    −   Materialized views
    −   Partitioned tables

    Faster planning

    Easier rule development

    ODBC driver

    Adapters for XXX, YYY
Applying Optiq to Drill
  1. Enhance SQL
 2. Query translation
Drill vs Traditional SQL

    SQL:
    −   Flat data
    −   Schema up front

    Drill:
    −   Nested data (list & map)
    −   No schema

    We'd like to write:
    −   SELECT name, toppings[2] FROM donuts
        WHERE ppu > 0.6

    Solution: ARRAY, MAP, ANY types
ARRAY & MAP SQL types

  ARRAY is like java.util.List

  MAP is like java.util.LinkedHashMap
Examples:

  VALUES ARRAY ['a', 'b', 'c']

  VALUES MAP ['Washington', 1, 'Obama', 44]

  SELECT name, address[1], address[2], state
  FROM Employee

  SELECT * FROM Donuts WHERE
  CAST(donuts['ppu'] AS DOUBLE) > 0.6
ANY SQL type

    ANY means “type to be determined at runtime”

    Validator narrows down possible type based
    on operators used

    Similar to converting Java's type system into
    JavaScript's. (Not easy.)
Sugaring the donut
Query:

  SELECT c['ppu'], c['toppings'][1] FROM
  Donuts
Additional syntactic sugar:

  c.x means c['x']
So:

  CREATE TABLE Donuts(c ANY)

  SELECT c.ppu, c.toppings[1] FROM
  Donuts
Better:
UNNEST
Employees nested inside departments:

  CREATE TYPE employee (empno INT, name
  VARCHAR(30));

  CREATE TABLE dept (deptno INT, name
  VARCHAR(30),
   employees EMPLOYEE ARRAY);
Unnest:

  SELECT d.deptno, d.name, e.empno, e.name
  FROM department AS d
   CROSS JOIN UNNEST(d.employees) AS e
SQL standard provides other operations on
  collections:
Applying Optiq to Drill
   1. Enhance SQL
2. Query translation
Query translation
SQL:

  select d['name'] as name, d['xx'] as xx
  from (
    select _MAP['donuts'] as d from donuts)
  where cast(d['ppu'] as double) > 0.6

Drill:

  { head: { … },
      storage: { … },
      query: [ {
         op: “sequence”, do: [
           { op: “scan”, … selection: { path:
Planner log
Original rel:
AbstractConverter(subset=[rel#14:Subset#3.AR
 RAY], convention=[ARRAY])
 ProjectRel(subset=[rel#10:Subset#3.NONE],
 NAME=[ITEM($0, 'name')], XX=[ITEM($0,
 'xx')])
  FilterRel(subset=[rel#8:Subset#2.NONE],
 condition=[>(CAST(ITEM($0, 'ppu')):DOUBLE
 NOT NULL, 0.6)])
    ProjectRel(subset=[rel#6:Subset#1.NONE],
 D=[ITEM($0, 'donuts')])
     DrillScan(subset=[rel#4:Subset#0.DRILL],
Next

    Translate join, aggregate, sort, set ops

    Operator overloading with ANY

    Mondrian on Drill
Thank you!


https://github.com/julianhyde/share/tree/master/slides
https://github.com/julianhyde/incubator-drill
https://github.com/julianhyde/optiq
http://incubator.apache.org/drill

@julianhyde

Contenu connexe

Tendances

Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 
Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)Julian Hyde
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits allJulian Hyde
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overviewJulian Hyde
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Julian Hyde
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache CalciteJordan Halterman
 
Calcite meetup-2016-04-20
Calcite meetup-2016-04-20Calcite meetup-2016-04-20
Calcite meetup-2016-04-20Josh Elser
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveJulian Hyde
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can helpChristian Tzolov
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllMichael Mior
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Julian Hyde
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
What's new in Mondrian 4?
What's new in Mondrian 4?What's new in Mondrian 4?
What's new in Mondrian 4?Julian Hyde
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
 
Optiq: a SQL front-end for everything
Optiq: a SQL front-end for everythingOptiq: a SQL front-end for everything
Optiq: a SQL front-end for everythingJulian Hyde
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartEvans Ye
 

Tendances (20)

Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Calcite meetup-2016-04-20
Calcite meetup-2016-04-20Calcite meetup-2016-04-20
Calcite meetup-2016-04-20
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can help
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
What's new in Mondrian 4?
What's new in Mondrian 4?What's new in Mondrian 4?
What's new in Mondrian 4?
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
Cost-based Query Optimization
Cost-based Query Optimization Cost-based Query Optimization
Cost-based Query Optimization
 
Optiq: a SQL front-end for everything
Optiq: a SQL front-end for everythingOptiq: a SQL front-end for everything
Optiq: a SQL front-end for everything
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
 

Similaire à Drill / SQL / Optiq

Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Bryan Yang
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for TrainingBryan Yang
 
Odtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youOdtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youLuc Bors
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)Jerome Eteve
 
Meetup cassandra sfo_jdbc
Meetup cassandra sfo_jdbcMeetup cassandra sfo_jdbc
Meetup cassandra sfo_jdbczznate
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQLSatoshi Nagayasu
 
Introduction to NoSQL Database
Introduction to NoSQL DatabaseIntroduction to NoSQL Database
Introduction to NoSQL DatabaseMohammad Alghanem
 
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016DataStax
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphScyllaDB
 
Spark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael NitschingerSpark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael NitschingerSpark Summit
 
Nko workshop - node js & nosql
Nko workshop - node js & nosqlNko workshop - node js & nosql
Nko workshop - node js & nosqlSimon Su
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaSpark Summit
 
Living the Nomadic life - Nic Jackson
Living the Nomadic life - Nic JacksonLiving the Nomadic life - Nic Jackson
Living the Nomadic life - Nic JacksonParis Container Day
 

Similaire à Drill / SQL / Optiq (20)

Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Odtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youOdtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for you
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)
 
Meetup cassandra sfo_jdbc
Meetup cassandra sfo_jdbcMeetup cassandra sfo_jdbc
Meetup cassandra sfo_jdbc
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Road to Analytics
Road to AnalyticsRoad to Analytics
Road to Analytics
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
 
Introduction to NoSQL Database
Introduction to NoSQL DatabaseIntroduction to NoSQL Database
Introduction to NoSQL Database
 
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraph
 
Spark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael NitschingerSpark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael Nitschinger
 
Nko workshop - node js & nosql
Nko workshop - node js & nosqlNko workshop - node js & nosql
Nko workshop - node js & nosql
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
 
Living the Nomadic life - Nic Jackson
Living the Nomadic life - Nic JacksonLiving the Nomadic life - Nic Jackson
Living the Nomadic life - Nic Jackson
 

Plus de Julian Hyde

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteJulian Hyde
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Julian Hyde
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQLJulian Hyde
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming languageJulian Hyde
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityJulian Hyde
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're IncubatingJulian Hyde
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteJulian Hyde
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesJulian Hyde
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databasesJulian Hyde
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and FastJulian Hyde
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache CalciteJulian Hyde
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache CalciteJulian Hyde
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache CalciteJulian Hyde
 

Plus de Julian Hyde (20)

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using Calcite
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQL
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're Incubating
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databases
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 

Dernier

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Dernier (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

Drill / SQL / Optiq

  • 1. Drill / SQL / Optiq Julian Hyde Apache Drill User Group 2013-03-13
  • 2.
  • 3. SQL
  • 4. SQL: Pros & cons Fact:  SQL is older than Macaulay Culkin Less interesting but more relevant:  Can be written by (lots of) humans  Can be written by machines  Requires query optimization  Allows query optimization  Based on “flat” relations and basic relational operations
  • 6. Introducing Optiq Framework Derived from LucidDB Minimal query mediator:  No storage  No runtime  No metadata  Query planning engine  Core operators & rewrite rules  Optional SQL parser/validator
  • 7. SELECT p.“product_name”, COUNT(*) AS c Expression tree FROM “splunk”.”splunk” AS s JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id” WHERE s.“action” = 'purchase' GROUP BY p.”product_name” ORDER BY c DESC Splunk Table: splunk Key: product_name Key: product_id Agg: count Condition: Key: c DESC action = 'purchase' scan join MySQL filter group sort scan Table: products
  • 8. SELECT p.“product_name”, COUNT(*) AS c Expression tree FROM “splunk”.”splunk” AS s JOIN “mysql”.”products” AS p (optimized) ON s.”product_id” = p.”product_id” WHERE s.“action” = 'purchase' GROUP BY p.”product_name” ORDER BY c DESC Splunk Condition: Table: splunk action = 'purchase' Key: product_name Agg: count Key: c DESC Key: product_id scan filter MySQL join group sort scan Table: products
  • 9. Metadata SPI  interface Table − RelDataType getRowType()  interface TableFunction − List<Parameter> getParameters() − Table apply(List arguments) − e.g. ViewTableFunction  interface Schema − Map<String, List<TableFunction>> getTableFunctions()
  • 10. Operators and rules  Rule: interface RelOptRule  Operator: interface RelNode  Core operators: TableAccess, Project, Filter, Join, Aggregate, Order, Union, Intersect, Minus, Values  Some rules: MergeFilterRule, PushAggregateThroughUnionRule, RemoveCorrelationForScalarProjectRule + 100 more
  • 11. Planning algorithm  Start with a logical plan and a set of rewrite rules  Form a list of rewrite rules that are applicable (based on pattern-matching)  Fire the rule that is likely to do the most good  Rule generates an expression that is equivalent (and hopefully cheaper)  Queue up new rule matches  Repeat until cheap enough
  • 12. Concepts  Cost  Equivalence sets  Calling convention  Logical vs Physical  Traits  Implementation
  • 13. Outside the kernel JDBC client  SQL parser/validator JDBC server  JDBC driver Optional SQL parser / Metadata validator SPI  SQL function Core Query Pluggable library (validation 3 planner rd 3 rd rules + code- Pluggable party party generation) 3 party rd ops ops 3rd party data data  Lingual (Cascading adapter)  Splunk adapter  Drill adapter
  • 14. Optiq roadmap  Building blocks for analytic DB: − In-memory tables in a distributed cache − Materialized views − Partitioned tables  Faster planning  Easier rule development  ODBC driver  Adapters for XXX, YYY
  • 15. Applying Optiq to Drill 1. Enhance SQL 2. Query translation
  • 16. Drill vs Traditional SQL  SQL: − Flat data − Schema up front  Drill: − Nested data (list & map) − No schema  We'd like to write: − SELECT name, toppings[2] FROM donuts WHERE ppu > 0.6  Solution: ARRAY, MAP, ANY types
  • 17. ARRAY & MAP SQL types  ARRAY is like java.util.List  MAP is like java.util.LinkedHashMap Examples:  VALUES ARRAY ['a', 'b', 'c']  VALUES MAP ['Washington', 1, 'Obama', 44]  SELECT name, address[1], address[2], state FROM Employee  SELECT * FROM Donuts WHERE CAST(donuts['ppu'] AS DOUBLE) > 0.6
  • 18. ANY SQL type  ANY means “type to be determined at runtime”  Validator narrows down possible type based on operators used  Similar to converting Java's type system into JavaScript's. (Not easy.)
  • 19. Sugaring the donut Query:  SELECT c['ppu'], c['toppings'][1] FROM Donuts Additional syntactic sugar:  c.x means c['x'] So:  CREATE TABLE Donuts(c ANY)  SELECT c.ppu, c.toppings[1] FROM Donuts Better:
  • 20. UNNEST Employees nested inside departments:  CREATE TYPE employee (empno INT, name VARCHAR(30));  CREATE TABLE dept (deptno INT, name VARCHAR(30), employees EMPLOYEE ARRAY); Unnest:  SELECT d.deptno, d.name, e.empno, e.name FROM department AS d CROSS JOIN UNNEST(d.employees) AS e SQL standard provides other operations on collections:
  • 21. Applying Optiq to Drill 1. Enhance SQL 2. Query translation
  • 22. Query translation SQL:  select d['name'] as name, d['xx'] as xx from ( select _MAP['donuts'] as d from donuts) where cast(d['ppu'] as double) > 0.6 Drill:  { head: { … }, storage: { … }, query: [ { op: “sequence”, do: [ { op: “scan”, … selection: { path:
  • 23. Planner log Original rel: AbstractConverter(subset=[rel#14:Subset#3.AR RAY], convention=[ARRAY]) ProjectRel(subset=[rel#10:Subset#3.NONE], NAME=[ITEM($0, 'name')], XX=[ITEM($0, 'xx')]) FilterRel(subset=[rel#8:Subset#2.NONE], condition=[>(CAST(ITEM($0, 'ppu')):DOUBLE NOT NULL, 0.6)]) ProjectRel(subset=[rel#6:Subset#1.NONE], D=[ITEM($0, 'donuts')]) DrillScan(subset=[rel#4:Subset#0.DRILL],
  • 24. Next  Translate join, aggregate, sort, set ops  Operator overloading with ANY  Mondrian on Drill

Notes de l'éditeur

  1. It&apos;s much more efficient if we psuh filters and aggregations to Splunk. But the user writing SQL shouldn&apos;t have to worry about that. This is not about processing data. This is about processing expressions. Reformulating the question. The question is the parse tree of a query. The parse tree is a data flow. In Splunk, a data flow looks like a pipeline of Linux commands. SQL systems have pipelines too (sometimes they are dataflow trees) built up of the basic relational operators. Think of the SQL SELECT, WHERE, JOIN, GROUP BY, ORDER BY clauses.
  2. It&apos;s much more efficient if we psuh filters and aggregations to Splunk. But the user writing SQL shouldn&apos;t have to worry about that. This is not about processing data. This is about processing expressions. Reformulating the question. The question is the parse tree of a query. The parse tree is a data flow. In Splunk, a data flow looks like a pipeline of Linux commands. SQL systems have pipelines too (sometimes they are dataflow trees) built up of the basic relational operators. Think of the SQL SELECT, WHERE, JOIN, GROUP BY, ORDER BY clauses.