SlideShare une entreprise Scribd logo
1  sur  19
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Hive for Analytic
Workloads
Alan Gates (@alanfgates)
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Stinger Project
(announced February 2013)
Batch AND Interactive SQL-IN-Hadoop
Stinger Initiative
A broad, community-based effort to
drive the next generation of HIVE
Hive 0.13, April 2014:
• Hive on Apache Tez
• SQL standard authorization
• Permanent UDFs
• Vectorized Processing
Hive 0.11, May 2013:
• Base Optimizations
• SQL Analytic Functions
• ORCFile, Modern File Format
Hive 0.12, October 2013:
• VARCHAR, DATE Types
• ORCFile predicate pushdown
• Advanced Optimizations
• Performance Boosts via YARN
Speed
Improve Hive query performance by 100X to
allow for interactive query times (seconds)
Scale
The only SQL interface to Hadoop designed
for queries that scale from TB to PB
SQL
Support broadest range of SQL semantics for
analytic applications running against Hadoop
…all IN Hadoop
Goals:
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Stinger Highlights
• 13 months
• 145 separate contributors
– from 44 separate entities
• 3 Hive releases, 0.11, 0.12, and 0.13
• 392,000 lines of new Java code
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Now this is not the end.
It is not even the
beginning of the end.
But it is, perhaps, the
end of the beginning.
-Winston Churchill
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Hive 0.13 Performance
• The TPC Benchmark™DS is a decision support
benchmark that models queries and data maintenance. It
evaluates decision support systems that examine large
volumes of data to answer real-world business
questions.
• Test: 50 SQL queries on Hive 0.13
• Test Environment
– Driven by the Hive Testbench: https://github.com/cartershanklin/hive-testbench
– Nodes: 20 nodes, 256 GB per node – only 48G per node used for Hive
– Drives: 6x 4TB WDC WD4000FYYZ-0 drives per node
– Interconnect: 10GB
– Processors: 2x Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz for total of 16
CPU cores per machine
– Scale: 30K (30T total data)
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Benchmark Results
Queries modified to have partition
key that duplicates join key,
making it easier for the optimizer
to choose which partitions to scan.
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Benchmark Results
Queries modified to have partition
key that duplicates join key,
making it easier for the optimizer
to choose which partitions to scan.
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
SQL Semantics
Release SQL Semantics
Hive 0.10 & before SELECT, JOIN, WHERE, GROUP BY, HAVING, ORDER
BY, UNION, ROLLUP/CUBE, subqueries in FROM
Hive 0.11 Windowing functions (RANK, ROW_NUMBER) and
OVER clause
Hive 0.13 • Subqueries with IN, EXISTS in WHERE and HAVING
• Common table expressions (WITH clause)
• Join condition in WHERE
• CREATE FUNCTION (stored on cluster)
Next Steps • Temporary tables
• Subqueries with equality and inequality operators
• Full UNION support
• Set operators, EXCEPT and INTERSECT
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Security
Release Security
Hive 0.12 & before • StorageBasedAuthorizationProvider, maps file level
security
• secure, based on HDFS security
• coarse grained, no column or row level security
• default, all advisory
• everyone has grant permissions
Hive 0.13 SQL standard security for tables, views, and databases
• GRANT/REVOKE
• ROLEs
• Column and row level permissions via views
Next Steps • Integration with XA Secure
• Extend to cover execution of functions
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Data Type Conformance
Release Available Data Types
Hive 0.10 & before Integer types, floating types, string, array, map, struct,
timestamp, binary
Hive 0.11 decimal (default precision and scale only)
Hive 0.12 date, varchar
Hive 0.13 char, user defined precision and scale for decimal
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Read and Write, ACID
Release Write Capabilities, ACID Compliance
Hive 0.12 & before • INSERT and INSERT OVERWRITE available
• Locking available, requires ZooKeeper for durability
• No ACID
Hive 0.13 • ACID compliant ingestion of data from streaming
sources such as Flume and Storm
• Snapshot isolation for readers
Next Steps • Addition of INSERT … VALUES, UPDATE, DELETE
• Multi-statement transactions: BEGIN, COMMIT,
ROLLBACK
• Integration with HCatalog
Owen and I have a talk on this at 5:30 today.
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Optimizer
Release Optimizer
Hive 0.11 & before Rules based optimizer
• Mostly simple rules such as push filter below join
Hive 0.12 Correlation optimizer
• Where possible combine related execution into single
job
Next Steps • Use Optiq for cost based optimization
• Join ordering and operator selection using statistics
and cost estimates
• Expand statistics calculated and used in planning
Julian has a talk on this at 4:35 today.
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
MapReduce is dead,
Long live Hadoop
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
MapReduce is dead,
Long live Hadoop
Tez Talks:
• A New Chapter in Hadoop Data Processing, today 12:05
• Hive on Apache Tez: Benchmarked at Yahoo! Scale, today 12:05
• Hive + Tez: A Performance Deep Dive, today 2:35
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
ORC File Format
•Columnar format for complex data types
•Built into Hive from 0.11
•Support for Pig via OrcLoader/OrcStorer
•Support for MapReduce via HCat
•Two levels of compression
–Lightweight type-specific and generic
•Built in indexes
–Every 10,000 rows with position information
–Min, Max, Sum, Count of each column
–Supports seek to row number
Page 15
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
ORC File Format
• Hive 0.12
–Predicate Push Down
–Improved run length encoding
–Adaptive string dictionaries
–Padding stripes to HDFS block boundaries
• Hive 0.13
–Stripe-based Input Splits
–Input Split elimination
–Vectorized Reader
–Customized Pig Load and Store functions
–ACID support
• Next Steps
–Faster writes
–Integer dictionaries
–Better block buffering
Page 16
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Vectorized Query Execution
•Designed for Modern Processor Architectures
–Avoid branching in the inner loop.
–Make the most use of L1 and L2 cache.
•How It Works
–Process records in batches of 1,000 rows
–Generate code from templates to minimize branching.
•What It Gives
–30x improvement in rows processed per second.
–Initial prototype: 100M rows/sec on laptop
• In Hive 0.13, initial (map) tasks vectorized
• Current work: vectorize shuffle and reduce tasks
Page 17
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Try it Yourself
• Apache Hive 0.13
–http://hive.apache.org/downloads.html
• Download and play with HDP-2.1
–http://hortonworks.com/products/hortonworks-sandbox/ for
use on your laptop
–http://hortonworks.com/hdp/ for use on your cluster
© Hortonworks Inc. 2013. Confidential and Proprietary.© Hortonworks Inc. 2013. Confidential and Proprietary.
Thank You!
@alanfgates
@hortonworks

Contenu connexe

Tendances

Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hiverxu
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveDataWorks Summit
 
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015alanfgates
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerDataWorks Summit
 
Hive on Spark, production experience @Uber
 Hive on Spark, production experience @Uber Hive on Spark, production experience @Uber
Hive on Spark, production experience @UberFuture of Data Meetup
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Mich Talebzadeh (Ph.D.)
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Chris Nauroth
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideDouglas Bernardini
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
 

Tendances (20)

ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
HiveACIDPublic
HiveACIDPublicHiveACIDPublic
HiveACIDPublic
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
Hive on Spark, production experience @Uber
 Hive on Spark, production experience @Uber Hive on Spark, production experience @Uber
Hive on Spark, production experience @Uber
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
 
Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
 
The Heterogeneous Data lake
The Heterogeneous Data lakeThe Heterogeneous Data lake
The Heterogeneous Data lake
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 

Similaire à Hive Optimizations and New Features in 0.11-0.13

Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitData Con LA
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
Hadoop Now, Next and Beyond
Hadoop Now, Next and BeyondHadoop Now, Next and Beyond
Hadoop Now, Next and BeyondDataWorks Summit
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalHortonworks
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in AzureMostafa
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceChris Nauroth
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep DiveHortonworks
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in AzureMostafa
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...Big Data Spain
 
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storyApache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storySunil Govindan
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopHortonworks
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017alanfgates
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 

Similaire à Hive Optimizations and New Features in 0.11-0.13 (20)

Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Hadoop Now, Next and Beyond
Hadoop Now, Next and BeyondHadoop Now, Next and Beyond
Hadoop Now, Next and Beyond
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in Azure
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
 
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storyApache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration story
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 

Plus de alanfgates

Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019alanfgates
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019alanfgates
 
Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018alanfgates
 
Hortonworks apache training
Hortonworks apache trainingHortonworks apache training
Hortonworks apache trainingalanfgates
 
Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016alanfgates
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016alanfgates
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016alanfgates
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016alanfgates
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015alanfgates
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013alanfgates
 
Strata feb2013
Strata feb2013Strata feb2013
Strata feb2013alanfgates
 

Plus de alanfgates (12)

Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018
 
Hortonworks apache training
Hortonworks apache trainingHortonworks apache training
Hortonworks apache training
 
Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013
 
Strata feb2013
Strata feb2013Strata feb2013
Strata feb2013
 

Dernier

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

Hive Optimizations and New Features in 0.11-0.13

  • 1. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Hive for Analytic Workloads Alan Gates (@alanfgates)
  • 2. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Stinger Project (announced February 2013) Batch AND Interactive SQL-IN-Hadoop Stinger Initiative A broad, community-based effort to drive the next generation of HIVE Hive 0.13, April 2014: • Hive on Apache Tez • SQL standard authorization • Permanent UDFs • Vectorized Processing Hive 0.11, May 2013: • Base Optimizations • SQL Analytic Functions • ORCFile, Modern File Format Hive 0.12, October 2013: • VARCHAR, DATE Types • ORCFile predicate pushdown • Advanced Optimizations • Performance Boosts via YARN Speed Improve Hive query performance by 100X to allow for interactive query times (seconds) Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB SQL Support broadest range of SQL semantics for analytic applications running against Hadoop …all IN Hadoop Goals:
  • 3. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Stinger Highlights • 13 months • 145 separate contributors – from 44 separate entities • 3 Hive releases, 0.11, 0.12, and 0.13 • 392,000 lines of new Java code
  • 4. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning. -Winston Churchill
  • 5. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Hive 0.13 Performance • The TPC Benchmark™DS is a decision support benchmark that models queries and data maintenance. It evaluates decision support systems that examine large volumes of data to answer real-world business questions. • Test: 50 SQL queries on Hive 0.13 • Test Environment – Driven by the Hive Testbench: https://github.com/cartershanklin/hive-testbench – Nodes: 20 nodes, 256 GB per node – only 48G per node used for Hive – Drives: 6x 4TB WDC WD4000FYYZ-0 drives per node – Interconnect: 10GB – Processors: 2x Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz for total of 16 CPU cores per machine – Scale: 30K (30T total data)
  • 6. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Benchmark Results Queries modified to have partition key that duplicates join key, making it easier for the optimizer to choose which partitions to scan.
  • 7. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Benchmark Results Queries modified to have partition key that duplicates join key, making it easier for the optimizer to choose which partitions to scan.
  • 8. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. SQL Semantics Release SQL Semantics Hive 0.10 & before SELECT, JOIN, WHERE, GROUP BY, HAVING, ORDER BY, UNION, ROLLUP/CUBE, subqueries in FROM Hive 0.11 Windowing functions (RANK, ROW_NUMBER) and OVER clause Hive 0.13 • Subqueries with IN, EXISTS in WHERE and HAVING • Common table expressions (WITH clause) • Join condition in WHERE • CREATE FUNCTION (stored on cluster) Next Steps • Temporary tables • Subqueries with equality and inequality operators • Full UNION support • Set operators, EXCEPT and INTERSECT
  • 9. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Security Release Security Hive 0.12 & before • StorageBasedAuthorizationProvider, maps file level security • secure, based on HDFS security • coarse grained, no column or row level security • default, all advisory • everyone has grant permissions Hive 0.13 SQL standard security for tables, views, and databases • GRANT/REVOKE • ROLEs • Column and row level permissions via views Next Steps • Integration with XA Secure • Extend to cover execution of functions
  • 10. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Data Type Conformance Release Available Data Types Hive 0.10 & before Integer types, floating types, string, array, map, struct, timestamp, binary Hive 0.11 decimal (default precision and scale only) Hive 0.12 date, varchar Hive 0.13 char, user defined precision and scale for decimal
  • 11. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Read and Write, ACID Release Write Capabilities, ACID Compliance Hive 0.12 & before • INSERT and INSERT OVERWRITE available • Locking available, requires ZooKeeper for durability • No ACID Hive 0.13 • ACID compliant ingestion of data from streaming sources such as Flume and Storm • Snapshot isolation for readers Next Steps • Addition of INSERT … VALUES, UPDATE, DELETE • Multi-statement transactions: BEGIN, COMMIT, ROLLBACK • Integration with HCatalog Owen and I have a talk on this at 5:30 today.
  • 12. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Optimizer Release Optimizer Hive 0.11 & before Rules based optimizer • Mostly simple rules such as push filter below join Hive 0.12 Correlation optimizer • Where possible combine related execution into single job Next Steps • Use Optiq for cost based optimization • Join ordering and operator selection using statistics and cost estimates • Expand statistics calculated and used in planning Julian has a talk on this at 4:35 today.
  • 13. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. MapReduce is dead, Long live Hadoop
  • 14. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. MapReduce is dead, Long live Hadoop Tez Talks: • A New Chapter in Hadoop Data Processing, today 12:05 • Hive on Apache Tez: Benchmarked at Yahoo! Scale, today 12:05 • Hive + Tez: A Performance Deep Dive, today 2:35
  • 15. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. ORC File Format •Columnar format for complex data types •Built into Hive from 0.11 •Support for Pig via OrcLoader/OrcStorer •Support for MapReduce via HCat •Two levels of compression –Lightweight type-specific and generic •Built in indexes –Every 10,000 rows with position information –Min, Max, Sum, Count of each column –Supports seek to row number Page 15
  • 16. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. ORC File Format • Hive 0.12 –Predicate Push Down –Improved run length encoding –Adaptive string dictionaries –Padding stripes to HDFS block boundaries • Hive 0.13 –Stripe-based Input Splits –Input Split elimination –Vectorized Reader –Customized Pig Load and Store functions –ACID support • Next Steps –Faster writes –Integer dictionaries –Better block buffering Page 16
  • 17. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Vectorized Query Execution •Designed for Modern Processor Architectures –Avoid branching in the inner loop. –Make the most use of L1 and L2 cache. •How It Works –Process records in batches of 1,000 rows –Generate code from templates to minimize branching. •What It Gives –30x improvement in rows processed per second. –Initial prototype: 100M rows/sec on laptop • In Hive 0.13, initial (map) tasks vectorized • Current work: vectorize shuffle and reduce tasks Page 17
  • 18. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Try it Yourself • Apache Hive 0.13 –http://hive.apache.org/downloads.html • Download and play with HDP-2.1 –http://hortonworks.com/products/hortonworks-sandbox/ for use on your laptop –http://hortonworks.com/hdp/ for use on your cluster
  • 19. © Hortonworks Inc. 2013. Confidential and Proprietary.© Hortonworks Inc. 2013. Confidential and Proprietary. Thank You! @alanfgates @hortonworks

Notes de l'éditeur

  1. 21 – 29 sec, scan one day of items table
  2. 93 – fact to fact left outer join over a years data, finished in around an hour 13 – full year 6 way star join