SlideShare une entreprise Scribd logo
1  sur  61
Télécharger pour lire hors ligne
Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive on Spark is Blazing Fast… Or Is It?
Carter Shanklin and Mostafa Mokhtar
Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Why SQL on Hadoop? Solving for Scale.
Hadoop is great for
cost, but MapReduce is
too difficult.
SQL on Hadoop makes
Hadoop real and gives
me scale that traditional
SQL can’t offer.
I’m deleting important
data because it’s too
expensive to store it.
$
Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SQL at Facebook: Emergence of Apache Hive
Developed Hive to address traditional RDBMS limitations.
300+ PB of data under management(1).
600+ TB of data loaded daily.
60,000+ Hive queries per day(2).
More than 1,000 users per day.
Initial Apache release in April 2009.
Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive Classic: Strengths and Challenges
Familiar SQL Interface+
Economical Processing of Petabytes+
Hive Classic tied to MapReduce, leading to latency
Traditional SQL Workloads Needed Higher Performance!
Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Need for Speed: The Stinger Initiative
Stinger: An Open Roadmap to improve Apache Hive’s performance 100x.
Launched: February 2013; Delivered: April 2014.
Delivered in 100% Apache Open Source.
SQL Engine
Vectorized
SQL Engine
Columnar
Storage
ORCFile
= 100X+ +
Distributed
Execution
Apache Tez
Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Stinger Phase 3: TPC-DS Benchmark at 30 Terabyte Scale
Sample of 50 queries from TPC-DS at 30 terabyte scale.
Average 52x Query Speedup, Maximum 160x Query Speedup.
Total benchmark time decreased from 7.8 days to 9.3 hours.(3)
Cost-Based Optimizer added in Hive 14 gave additional 2.5x Speedup.
Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive + Stinger at Yahoo
Around 1 million Hive jobs
run every month.
Scalei
Total benchmark time from
8.1 hours to 1.3 hours at
10TB scale.
Performancei
Up to 82x faster.(4)
Performancei
Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Stinger at Spotify
Query 25 TB of compressed
data in 10 Minutes across
690 nodes (MapReduce too
slow to complete.)
Speedi
16x less HDFS read when
using ORCFile versus Avro.(5)
Efficiencyi
Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORCFile at Facebook
Saved more than 1,400
servers worth of storage.
Compressioni
Compression ratio
increased from 5x to 8x
globally.
Compressioni
Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive on Tez: Conclusion
Hive on Tez delivers fast batch and interactive SQL today.
But users need more speed!
Proven at petabyte scale.
Scalei
The most comprehensive
open-source SQL on
Hadoop.
SQLi
More than 90 Hortonworks
customers use Hive-on-Tez
today for fast SQL.
Speedi
Hortonworks Customer Support metrics as of Feb/2015
Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Next Stop: Stinger.next and Sub-Second SQL
Emergence of LLAP and Hive-on-Spark bring Sub-Second within reach.
What does it take to get Hive to sub-second?
Does Hive-on-Spark get us there?
Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Performance Today and the Sub-Second
Future
Hive on Tez, Hive on Spark, Hive on Mapreduce & Spark-SQL
Page 13 © Hortonworks Inc. 2014
Query processing in Hadoop
Cache
Block
Cache
Linux Cache
Storage
Columnar Storage
Parquet File
Distrided
ExecutionEngine
SQL Engine
Hive Engine
SQL SQL support
HiveQL
Tez
Columnar Storage
ORC File
MapReduce Spark
Spark-SQL
SQL Engine
Page 14 © Hortonworks Inc. 2014
Query processing in Hadoop
Cache
Block
Cache
Linux Cache
Storage
Columnar Storage
Parquet File
Distrided
ExecutionEngine
SQL Engine
Hive Engine
SQL SQL support
HiveQL
Tez
Columnar Storage
ORC File
MapReduce Spark
Spark-SQL
SQL Engine
What is covered today
in terms of performance
Page 15 © Hortonworks Inc. 2014
Performance comparison : Test bed
Component Version
Hive 1.2.0
Tez 0.5.2
Spark 1.2.0
Hadoop 2.6.0
Software :
Hardware
20 physical nodes, each with:
● 2x Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz for total of 16 CPU cores/machine
● Hyper-threading enabled
● 256GB RAM per node
● 6x 4TB WDC WD4000FYYZ-0 drives per node
● 10 Gigabit interconnect between the nodes
Note: Based on the YARN Node Manager’s Memory Resource setting used below, only 128 GB of RAM per node
was dedicated to query processing.
Execution
Engine
Primitives on 30TB
Scale factor
TPC-DS queries on
30TB Scale factor
TPC-DS queries on 200GB
Scale factor
Spark X X X
Tez X X X
Map Reduce X
Spark-SQL X X X
Performance benchmarks :
Page 16 © Hortonworks Inc. 2014
Performance comparison : Configurations
Hive on Tez
● 128GB of memory allocated
● 16 out of 32 Logical processors
allocated
● hive.execution.engine = tez
● hive.auto.convert.join.noconditionaltask.
size = 600MB
● Vectorization enabled
● CBO enabled
● Fetch column stats enabled
Other settings
● hive.prewarm.numcontainers = 317
● hive.tez.auto.reducer.parallelism = true
Hive on Spark
● 128GB of memory allocated
● 16 out of 32 Logical processors
allocated
● hive.execution.engine=spark
● Configuration parameters followed
recomendation from Hive on Spark wiki
http://tinyurl.com/pk2ju8e which
also had CBO, Vectoriztion, fetch
column stats enabled etc..
● spark.master=yarn-master
Spark settings
● spark.shuffle.memoryFraction = 0.5
● spark.storage.memoryFraction = 0.1
● spark.shuffle.consolidateFiles = true
● spark.serializer =
org.apache.spark.serializer.KryoSerializer
Spark-SQL
● 128GB of memory allocated
● 16 out of 32 Logical processors
allocated
● spark.shuffle.memoryFraction = 0.5
● spark.storage.memoryFraction = 0.1
● spark.shuffle.consolidateFiles = true
● spark.serializer =
org.apache.spark.serializer.KryoSerializer
● spark.sql.shuffle.partitions = 1009
● spark-sql --master yarn-client
● driver-memory 8g
● Default GC configuration
spark.sql.codegen was not enabled as it caused
most queries to fail.
Page 17 © Hortonworks Inc. 2014
Performance comparison : TPC-DS 200GB
● Warm timings reported, Cold queries on Spark are significantly slower
● Hive on Tez using ORC format
● Hive on Spark using Parquet format
● Spark-sql using Parquet format
1,118
1,982
1,235
Page 18 © Hortonworks Inc. 2014
Performance comparison : TPC-DS 200GB continued..
● Warm timings reported, Cold queries on Spark are significantly slower
● Hive on Tez using ORC format
● Hive on Spark using Parquet format
● Spark-sql using Parquet format
1,118
1,982
1,235
Hive on Tez is
77% faster than Hive on Spark
10% faster than Spark-sql
Spark-sql is
60% faster than Hive on Spark
Page 19 © Hortonworks Inc. 2014
Performance comparison : TPC-DS 200GB summary
Page 20 © Hortonworks Inc. 2014
Performance comparison : TPC-DS 200GB summary
Even simple
queries don’t
run in sub-
second
Page 21 © Hortonworks Inc. 2014
Performance comparison : TPC-DS 200GB summary
Even simple
queries don’t
run in sub-
second
Page 22 © Hortonworks Inc. 2014
Performance comparison : TPC-DS 200GB
● 200GB Scale factor, un-partitioned schema
● 45x unmodified queries from TPC-DS
● ORC format compression ratio 3.4x
● Parquet format compression ratio of 2.8x
Page 23 © Hortonworks Inc. 2014
Performance comparison : TPC-DS 30TB
● 30 TB Scale factor
● ORC Table format
● Fact tables partitioned on *_date_sk
● Explicit partition filters where used for Hive on Spark and Spark-SQL (but not for Hive-on-Tez)
● 20 out of the previously used queries where used, warm query timings reported
● Hive on Tez outperforms Hive on Spark and Spark-SQL by up to 18x
● Hive on Spark completed 15 out of the 20, the remaining 5 queries errored out or where stuck in GC and got cancelled
● Spark-SQL completed 7 out of the 20, the remaining 13 queries either failed within a couple of minutes or errored out after running
for hours
● Spark-SQL performance is negatively affected by in-efficient query plans as it lacks a query optimizer
Workload config
Highlights from 30TB TPC-DS test
Page 24 © Hortonworks Inc. 2014
Performance comparison : TPC-DS 30TB
1,828
10,098
Page 25 © Hortonworks Inc. 2014
Performance comparison : TPC-DS 30TB
1,828
10,098For large data set
Hive on Tez is ~5x
faster than Hive on
Spark
Page 26 © Hortonworks Inc. 2014
Performance comparison : TPC-DS 30TB continued
Page 27 © Hortonworks Inc. 2014
Performance comparison : TPC-DS 30TB continued
Failed Spark-SQL
queries
Page 28 © Hortonworks Inc. 2014
Performance comparison : TPC-DS 30TB Q17
Page 29 © Hortonworks Inc. 2014
Performance comparison : TPC-DS 30TB Q17
Hive on Tez
query ends
here
Page 30 © Hortonworks Inc. 2014
Why didn’t Spark take Hive to sub-second?
● Hive is CPU bound for most operations specially after the introduction of columnar file formats (do more with less)
● Spark consumes more CPU, Disk & Network IO than Tez
● Hive on Spark spends a lot of time translating from RDDs to Hive’s “Row Containers”
Page 31 © Hortonworks Inc. 2014
Why didn’t Spark take Hive to sub-second?
● Hive is CPU bound for most operations specially after the introduction of columnar file formats (do more with less)
● Spark consumes more CPU, Disk & Network IO than Tez for relatively large datasets
● Hive on Spark spends a lot of time translating from RDDs to Hive’s “Row Containers”
2x less
Disk IO
4x less
Network IO6x less
CPU
Page 32 © Hortonworks Inc. 2014
I don’t believe what you just said!!!
Show me some queries I can understand...
Simple queries to understand complex systems
Execution engine Primitives
Page 33 © Hortonworks Inc. 2014
Performance comparison : What are those primitives?
Group Test case Comment
ETL
Create table as select * Insert 8 Billion rows, 570 GB of Data
Create table as select with Group by Group by and Insert 8 Billion rows, 570 GB of Data
Create table as with Group by on all columns followed
by cluster by
Group by, cluster by and Insert 8 Billion rows, 570 GB of Data
Group by
Group by on primary key Group by 25 billion distinct keys
Group by on column with low NDV* Group by 82 billion rows with 8K distinct keys
Map join
store_sales x item Map join 28 Billion x 462K
store_sales x item x store Map join 28 Billion x 462K x 1.7K
store_sales x item x store x customer_demographics Map join 28 Billion x 462K x 1.7K x 1.9 Million
Shuffle Join
Shuffle join Shuffle join 8.6 Billion x 706 Million rows
Shuffle join + Group by on primary key Shuffle join 8.6 Billion x 706 Million rows followed by group by on
675 Million rows
NDV* Number of distinct values
Page 34 © Hortonworks Inc. 2014
Performance comparison : CTAS
Create table test_table as select * from store_returns;
Execution engine Elapsed time (Seconds) Tez Gain %
Hive on Tez 316
Hive on Spark 351 11%
Hive on Mapreduce 494 56%
Spark-SQL 418 32%
Table Scan
store_returns
8 Billion rows
Table Insert
8 Billion rows
316
351
494
418
Page 35 © Hortonworks Inc. 2014
Performance comparison : CTAS
Create table test_table as select * from store_returns;
Execution engine Elapsed time (Seconds) Tez Gain %
Hive on Tez 316
Hive on Spark 351 11%
Hive on Mapreduce 494 56%
Spark-SQL 418 32%
Table Scan
store_returns
8 Billion rows
Table Insert
8 Billion rows
316
351
494
418
Tez is
11% faster than Spark
56% faster than Mapreduce
32% faster than Spark-SQL
Page 36 © Hortonworks Inc. 2014
Performance comparison : CTAS with group by
Create table test_table as select * from store_returns group by *;
Execution engine Elapsed time (Seconds) Tez Gain %
Hive on Tez 630
Hive on Spark 1,608 155%
Hive on Mapreduce 840 33%
Spark-SQL 1,202 91%
Table Insert
4 Billion rows
Shuffle
On all columns
8 Billion rows
Group by
On all columns
7 billion rows
Table Scan
store_returns
8 Billion rows
630
1,608
840
1,202
Page 37 © Hortonworks Inc. 2014
Performance comparison : CTAS with group by
Create table test_table as select * from store_returns group by *;
Execution engine Elapsed time (Seconds) Tez Gain %
Hive on Tez 630
Hive on Spark 1,608 155%
Hive on Mapreduce 840 33%
Spark-SQL 1,202 91%
Table Insert
4 Billion rows
Shuffle
On all columns
8 Billion rows
Group by
On all columns
7 billion rows
Table Scan
store_returns
8 Billion rows
630
1,608
840
1,202
This time, execution engine
must prepare, shuffle and
aggregate data.
Page 38 © Hortonworks Inc. 2014
Performance comparison : CTAS with group by
Create table test_table as select * from store_returns group by *;
Execution engine Elapsed time (Seconds) Tez Gain %
Hive on Tez 630
Hive on Spark 1,608 155%
Hive on Mapreduce 840 33%
Spark-SQL 1,202 91%
Table Insert
4 Billion rows
Shuffle
On all columns
8 Billion rows
Group by
On all columns
7 billion rows
Table Scan
store_returns
8 Billion rows
630
1,608
840
1,202
Tez is
155% faster than Spark
33% faster than Mapreduce
91% faster than Spark-SQL
Page 39 © Hortonworks Inc. 2014
Performance comparison : Select + group by on PK
select count(*) rowcount from store_sales group by ss_item_sk , ss_ticket_number having rowcount > 100000000
Execution engine Elapsed time (Seconds) Tez Gain %
Hive on Tez 457
Hive on Spark 2,966 550%
Hive on Mapreduce 893 96%
Spark-SQL 862 89%
Select
0 rows qualify
Shuffle
25 Billion rows
Group by
25 billion rows
Table Scan
25 Billion rows
Filter operator
25 billion rows
457
2,966
893 862
Page 40 © Hortonworks Inc. 2014
Performance comparison : Select + group by on PK
select count(*) rowcount from store_sales group by ss_item_sk , ss_ticket_number having rowcount > 100000000
Execution engine Elapsed time (Seconds) Tez Gain %
Hive on Tez 457
Hive on Spark 2,966 550%
Hive on Mapreduce 893 96%
Spark-SQL 862 89%
Select
0 rows qualify
Shuffle
25 Billion rows
Group by
25 billion rows
Table Scan
25 Billion rows
Filter operator
25 billion rows
457
2,966
893 862
Group-By performed on all
25 billion distinct keys.
Page 41 © Hortonworks Inc. 2014
Performance comparison : Select + group by on PK
select count(*) rowcount from store_sales group by ss_item_sk , ss_ticket_number having rowcount > 100000000
Execution engine Elapsed time (Seconds) Tez Gain %
Hive on Tez 457
Hive on Spark 2,966 550%
Hive on Mapreduce 893 96%
Spark-SQL 862 89%
Select
0 rows qualify
Shuffle
25 Billion rows
Group by
25 billion rows
Table Scan
25 Billion rows
Filter operator
25 billion rows
457
2,966
893 862
Tez is
550% faster than Spark
96% faster than Mapreduce
89% faster than Spark-SQL
Page 42 © Hortonworks Inc. 2014
Performance comparison : Select + group by on low NDV
select sum(ss_list_price) from store_sales group by ss_sold_date_sk having sum(ss_list_price) = 1
Execution engine Elapsed time (Seconds) Tez Gain %
Hive on Tez 51
Hive on Spark 56 10%
Hive on Mapreduce 290 465%
Spark-SQL 164 221%
Select
0 rows qualify
Group by
85 billion rows
Table Scan
85 Billion rows
Filter operator
8K rows
51
290
56
164
Page 43 © Hortonworks Inc. 2014
Performance comparison : Select + group by on low NDV
select sum(ss_list_price) from store_sales group by ss_sold_date_sk having sum(ss_list_price) = 1
Execution engine Elapsed time (Seconds) Tez Gain %
Hive on Tez 51
Hive on Spark 56 10%
Hive on Mapreduce 290 465%
Spark-SQL 164 221%
Select
0 rows qualify
Group by
85 billion rows
Table Scan
85 Billion rows
Filter operator
8K rows
51
290
56
164
Hive on Tez and
Hive on Spark
outperform
Spark-SQL
Page 44 © Hortonworks Inc. 2014
select count(*) from store_sales, item, store, customer_demographics where i_item_sk = ss_item_sk and s_store_sk = ss_store_sk and ss_cdemo_sk = cd_demo_sk
Performance comparison : Map join with 1,2 & 3 tables
Map join
27 Billion
rows
Map join
27 Billion rows
Map join
27 Billion rows
Table Scan
store_sales
28 Billion rows
Table Scan
customer_demographic
s
1.9 Million rows
Table Scan
item
472K rows
Table Scan
Store
1.7K rows
Execution engine Map join #1 Map join #2 Map join #3 Tez Join #1 Gain % Tez Join #2 Gain % Tez join #3 Gain %
Hive on Tez 108 145 232
Hive on Spark 106 142 289 98% 98% 125%
Hive on Mapreduce 247 280 800 228% 193% 345%
Spark-SQL 86 117 166 -20% -20% -28%
Page 45 © Hortonworks Inc. 2014
select count(*) from store_sales, item, store, customer_demographics where i_item_sk = ss_item_sk and s_store_sk = ss_store_sk and ss_cdemo_sk = cd_demo_sk
Performance comparison : Map join with 1,2 & 3 tables
Map join
27 Billion
rows
Map join
27 Billion rows
Map join
27 Billion rows
Table Scan
store_sales
28 Billion rows
Table Scan
customer_demographic
s
1.9 Million rows
Table Scan
item
472K rows
Table Scan
Store
1.7K rows
Execution engine Map join #1 Map join #2 Map join #3 Tez Join #1 Gain % Tez Join #2 Gain % Tez join #3 Gain %
Hive on Tez 108 145 232
Hive on Spark 106 142 289 98% 98% 125%
Hive on Mapreduce 247 280 800 228% 193% 345%
Spark-SQL 86 117 166 -20% -20% -28%
Spark-SQL is faster than
Hive on Tez and Hive on
Spark for Map-joins
Page 46 © Hortonworks Inc. 2014
Performance comparison : Shuffle join + group by
● select count(*) from store_sales a ,store_returns b where a.ss_item_sk = b.sr_item_sk and a.ss_ticket_number = b.sr_ticket_number
● select count(*) from store_sales a ,store_returns b where a.ss_item_sk = b.sr_item_sk and a.ss_ticket_number = b.sr_ticket_number group by
ss_item_sk , ss_ticket_number having rowcount > 1
Execution engine Shuffle join Shuffle join + group by Tez Shuffle Gain % Tez Gain %
Hive on Tez 400 453
Hive on Spark 1,078 1,120 170% 147%
Hive on Mapreduce 756 826 89% 82%
Spark-SQL 1,835 1,884 359% 316%
Shuffle Join
9 Billion rows
Group by
675 Million
rows
Table
Scan
8.6
Billion
rows
Table
Scan
6 Million
rows
Select
0 rows
Filter
675 Million
rows
400
1,078 1,120
826
453
756
1,884
1,835
Page 47 © Hortonworks Inc. 2014
Performance comparison : Shuffle join + group by
● select count(*) from store_sales a ,store_returns b where a.ss_item_sk = b.sr_item_sk and a.ss_ticket_number = b.sr_ticket_number
● select count(*) from store_sales a ,store_returns b where a.ss_item_sk = b.sr_item_sk and a.ss_ticket_number = b.sr_ticket_number group by
ss_item_sk , ss_ticket_number having rowcount > 1
Shuffle Join
9 Billion rows
Group by
675 Million
rows
Table
Scan
8.6
Billion
rows
Table
Scan
6 Million
rows
Select
0 rows
Filter
675 Million
rows
400
1,078 1,120
826
453
756
1,884
1,835
Tez is
170% faster than Spark
89% faster than Mapreduce
359% faster than Spark-SQL
Tez is
147% faster than Spark
82% faster than Mapreduce
316% faster than Spark-SQL
Execution engine Shuffle join Shuffle join + group by Tez Shuffle Gain % Tez Gain %
Hive on Tez 400 453
Hive on Spark 1,078 1,120 170% 147%
Hive on Mapreduce 756 826 89% 82%
Spark-SQL 1,835 1,884 359% 316%
Page 48 © Hortonworks Inc. 2014
Performance comparison : Shuffle join + group by
● select count(*) from store_sales a ,store_returns b where a.ss_item_sk = b.sr_item_sk and a.ss_ticket_number = b.sr_ticket_number
● select count(*) from store_sales a ,store_returns b where a.ss_item_sk = b.sr_item_sk and a.ss_ticket_number = b.sr_ticket_number group by
ss_item_sk , ss_ticket_number having rowcount > 1
Shuffle Join
9 Billion rows
Group by
675 Million
rows
Table
Scan
8.6
Billion
rows
Table
Scan
6 Million
rows
Select
0 rows
Filter
675 Million
rows
400
1,078 1,120
826
453
756
1,884
1,835
Why are shuffles so
slow for Hive on Spark
and Spark-SQL
Execution engine Shuffle join Shuffle join + group by Tez Shuffle Gain % Tez Gain %
Hive on Tez 400 453
Hive on Spark 1,078 1,120 170% 147%
Hive on Mapreduce 756 826 89% 82%
Spark-SQL 1,835 1,884 359% 316%
Page 49 © Hortonworks Inc. 2014
Performance comparison : Shuffle join cluster CPU utilization
Page 50 © Hortonworks Inc. 2014
Performance comparison : Shuffle join cluster CPU utilization
Hive on Tez
query ends
here
Page 51 © Hortonworks Inc. 2014
Performance comparison : Shuffle join cluster CPU utilization
Hive on
Spark query
ends here
Page 52 © Hortonworks Inc. 2014
Performance comparison : Primitive results summary
Page 53 © Hortonworks Inc. 2014
Performance comparison : Performance summary
Short running query+
ETL+
Large joins and aggregates+
Slower than Spark-SQL in Map joins
High GC
Instability
SQL support limited compared to Hive
Lack of sophisticated query optimizer
Efficient resource utilization+
Map join performance+
Large Joins
Outperforms Spark-SQL in large join+
Slower than Tez for large joins and aggregates
High GC
Hive Tez
Spark-SQL
Hive on Spark
MapReduce
Promising initial release+
Page 54 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Solving Hive’s Top Performance Challenges
Page55 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive: Modern ArchitectureStorage
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engine
SQL Engines
Row Engine Vector Engine
SQL
SQL Support
SQL:2011 Optimizer HCatalog HiveServer2
Cache
Block Cache
Linux Cache
Distributed
Execution
Hadoop 1
MapReduce
Hadoop 2
Tez Spark
Vector Cache
LLAP
Persistent Server
Historical
Current
In Development
Legend
Page56 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Storage
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engine
SQL Engines
Row Engine Vector Engine
SQL
SQL Support
SQL:2011 Optimizer HCatalog HiveServer2
Apache Hive: Getting to Sub-Second Improvement
LLAP: Persistent servers
cache vectors and start
queries instantly.
Pluggable integrations
with Tez or Spark.
Cache
Block Cache
Linux Cache
Distributed
Execution
Hadoop 1
MapReduce
Hadoop 2
Tez Spark
Historical
Current
In Development
Legend
Vector Cache
LLAP
Persistent Server
Page57 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Storage
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engine
SQL Engines
Row Engine Vector Engine
SQL
SQL Support
SQL:2011 Optimizer HCatalog HiveServer2 Vectorized Hash
Join Solves CPU
Boundedness for
Hive on Tez or on
Spark.
Cache
Block Cache
Linux Cache
Distributed
Execution
Hadoop 1
MapReduce
Hadoop 2
Tez Spark
Historical
Current
In Development
Legend
Apache Hive: Getting to Sub-Second Improvement
Vector Cache
LLAP
Persistent Server
Page58 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Storage
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engine
SQL Engines
Row Engine Vector Engine
SQL
SQL Support
SQL:2011 Optimizer HCatalog HiveServer2 Improved metadata
catalog allows instant
query planning and
optimization for any
engine.
Cache
Block Cache
Linux Cache
Distributed
Execution
Hadoop 1
MapReduce
Hadoop 2
Tez Spark
Historical
Current
In Development
Legend
Apache Hive: Getting to Sub-Second Improvement
Vector Cache
LLAP
Persistent Server
Page59 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive’s Sub-Second Future
=
Sub-Second
Hive
Metadata
Fast,
Scalable
Metadata
Catalog
Persistent
Server
LLAP
+ +
SQL Engine
Vectorized
Hash Join
Choice of
Execution
Engines
Tez or
Spark
+
Page60 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Questions?
?
Interested? Stop by the Hortonworks booth to learn more
Page61 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Endnotes
(1) https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/
(2) https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-
with-corona/10151142560538920
(3) http://hortonworks.com/blog/benchmarking-apache-hive-13-enterprise-hadoop/
(4) http://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive-tez-and-yarn
(5) http://www.slideshare.net/AdamKawa/a-perfect-hive-query-for-a-perfect-meeting-hadoop-summit-2014

Contenu connexe

Tendances

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystifiedOmid Vahdaty
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 

Tendances (20)

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 

En vedette

MySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey resultsMySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey resultsMatthew Aslett
 
NewSQL overview, Feb 2015
NewSQL overview, Feb 2015NewSQL overview, Feb 2015
NewSQL overview, Feb 2015Ivan Glushkov
 
Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
 Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference KeynoteKingsley Uyi Idehen
 
The DBA Is Dead (Again). Long Live the DBA !
The DBA Is Dead (Again). Long Live the DBA !The DBA Is Dead (Again). Long Live the DBA !
The DBA Is Dead (Again). Long Live the DBA !Christian Bilien
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 

En vedette (6)

LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
MySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey resultsMySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey results
 
NewSQL overview, Feb 2015
NewSQL overview, Feb 2015NewSQL overview, Feb 2015
NewSQL overview, Feb 2015
 
Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
 Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
 
The DBA Is Dead (Again). Long Live the DBA !
The DBA Is Dead (Again). Long Live the DBA !The DBA Is Dead (Again). Long Live the DBA !
The DBA Is Dead (Again). Long Live the DBA !
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 

Similaire à Hive on spark is blazing fast or is it final

Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleYifeng Jiang
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善HortonworksJapan
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?DataWorks Summit
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019alanfgates
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream DataDataWorks Summit
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkDataWorks Summit
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 

Similaire à Hive on spark is blazing fast or is it final (20)

Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scale
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 
Sparc solaris servers
Sparc solaris serversSparc solaris servers
Sparc solaris servers
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Presto@Uber
Presto@UberPresto@Uber
Presto@Uber
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 

Plus de Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

Plus de Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Dernier

Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxAS Design & AST.
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfkalichargn70th171
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...kalichargn70th171
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 

Dernier (20)

Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptx
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 

Hive on spark is blazing fast or is it final

  • 1. Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive on Spark is Blazing Fast… Or Is It? Carter Shanklin and Mostafa Mokhtar
  • 2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Why SQL on Hadoop? Solving for Scale. Hadoop is great for cost, but MapReduce is too difficult. SQL on Hadoop makes Hadoop real and gives me scale that traditional SQL can’t offer. I’m deleting important data because it’s too expensive to store it. $
  • 3. Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved SQL at Facebook: Emergence of Apache Hive Developed Hive to address traditional RDBMS limitations. 300+ PB of data under management(1). 600+ TB of data loaded daily. 60,000+ Hive queries per day(2). More than 1,000 users per day. Initial Apache release in April 2009.
  • 4. Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive Classic: Strengths and Challenges Familiar SQL Interface+ Economical Processing of Petabytes+ Hive Classic tied to MapReduce, leading to latency Traditional SQL Workloads Needed Higher Performance!
  • 5. Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Need for Speed: The Stinger Initiative Stinger: An Open Roadmap to improve Apache Hive’s performance 100x. Launched: February 2013; Delivered: April 2014. Delivered in 100% Apache Open Source. SQL Engine Vectorized SQL Engine Columnar Storage ORCFile = 100X+ + Distributed Execution Apache Tez
  • 6. Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Stinger Phase 3: TPC-DS Benchmark at 30 Terabyte Scale Sample of 50 queries from TPC-DS at 30 terabyte scale. Average 52x Query Speedup, Maximum 160x Query Speedup. Total benchmark time decreased from 7.8 days to 9.3 hours.(3) Cost-Based Optimizer added in Hive 14 gave additional 2.5x Speedup.
  • 7. Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive + Stinger at Yahoo Around 1 million Hive jobs run every month. Scalei Total benchmark time from 8.1 hours to 1.3 hours at 10TB scale. Performancei Up to 82x faster.(4) Performancei
  • 8. Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Stinger at Spotify Query 25 TB of compressed data in 10 Minutes across 690 nodes (MapReduce too slow to complete.) Speedi 16x less HDFS read when using ORCFile versus Avro.(5) Efficiencyi
  • 9. Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORCFile at Facebook Saved more than 1,400 servers worth of storage. Compressioni Compression ratio increased from 5x to 8x globally. Compressioni
  • 10. Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive on Tez: Conclusion Hive on Tez delivers fast batch and interactive SQL today. But users need more speed! Proven at petabyte scale. Scalei The most comprehensive open-source SQL on Hadoop. SQLi More than 90 Hortonworks customers use Hive-on-Tez today for fast SQL. Speedi Hortonworks Customer Support metrics as of Feb/2015
  • 11. Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Next Stop: Stinger.next and Sub-Second SQL Emergence of LLAP and Hive-on-Spark bring Sub-Second within reach. What does it take to get Hive to sub-second? Does Hive-on-Spark get us there?
  • 12. Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Performance Today and the Sub-Second Future Hive on Tez, Hive on Spark, Hive on Mapreduce & Spark-SQL
  • 13. Page 13 © Hortonworks Inc. 2014 Query processing in Hadoop Cache Block Cache Linux Cache Storage Columnar Storage Parquet File Distrided ExecutionEngine SQL Engine Hive Engine SQL SQL support HiveQL Tez Columnar Storage ORC File MapReduce Spark Spark-SQL SQL Engine
  • 14. Page 14 © Hortonworks Inc. 2014 Query processing in Hadoop Cache Block Cache Linux Cache Storage Columnar Storage Parquet File Distrided ExecutionEngine SQL Engine Hive Engine SQL SQL support HiveQL Tez Columnar Storage ORC File MapReduce Spark Spark-SQL SQL Engine What is covered today in terms of performance
  • 15. Page 15 © Hortonworks Inc. 2014 Performance comparison : Test bed Component Version Hive 1.2.0 Tez 0.5.2 Spark 1.2.0 Hadoop 2.6.0 Software : Hardware 20 physical nodes, each with: ● 2x Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz for total of 16 CPU cores/machine ● Hyper-threading enabled ● 256GB RAM per node ● 6x 4TB WDC WD4000FYYZ-0 drives per node ● 10 Gigabit interconnect between the nodes Note: Based on the YARN Node Manager’s Memory Resource setting used below, only 128 GB of RAM per node was dedicated to query processing. Execution Engine Primitives on 30TB Scale factor TPC-DS queries on 30TB Scale factor TPC-DS queries on 200GB Scale factor Spark X X X Tez X X X Map Reduce X Spark-SQL X X X Performance benchmarks :
  • 16. Page 16 © Hortonworks Inc. 2014 Performance comparison : Configurations Hive on Tez ● 128GB of memory allocated ● 16 out of 32 Logical processors allocated ● hive.execution.engine = tez ● hive.auto.convert.join.noconditionaltask. size = 600MB ● Vectorization enabled ● CBO enabled ● Fetch column stats enabled Other settings ● hive.prewarm.numcontainers = 317 ● hive.tez.auto.reducer.parallelism = true Hive on Spark ● 128GB of memory allocated ● 16 out of 32 Logical processors allocated ● hive.execution.engine=spark ● Configuration parameters followed recomendation from Hive on Spark wiki http://tinyurl.com/pk2ju8e which also had CBO, Vectoriztion, fetch column stats enabled etc.. ● spark.master=yarn-master Spark settings ● spark.shuffle.memoryFraction = 0.5 ● spark.storage.memoryFraction = 0.1 ● spark.shuffle.consolidateFiles = true ● spark.serializer = org.apache.spark.serializer.KryoSerializer Spark-SQL ● 128GB of memory allocated ● 16 out of 32 Logical processors allocated ● spark.shuffle.memoryFraction = 0.5 ● spark.storage.memoryFraction = 0.1 ● spark.shuffle.consolidateFiles = true ● spark.serializer = org.apache.spark.serializer.KryoSerializer ● spark.sql.shuffle.partitions = 1009 ● spark-sql --master yarn-client ● driver-memory 8g ● Default GC configuration spark.sql.codegen was not enabled as it caused most queries to fail.
  • 17. Page 17 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 200GB ● Warm timings reported, Cold queries on Spark are significantly slower ● Hive on Tez using ORC format ● Hive on Spark using Parquet format ● Spark-sql using Parquet format 1,118 1,982 1,235
  • 18. Page 18 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 200GB continued.. ● Warm timings reported, Cold queries on Spark are significantly slower ● Hive on Tez using ORC format ● Hive on Spark using Parquet format ● Spark-sql using Parquet format 1,118 1,982 1,235 Hive on Tez is 77% faster than Hive on Spark 10% faster than Spark-sql Spark-sql is 60% faster than Hive on Spark
  • 19. Page 19 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 200GB summary
  • 20. Page 20 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 200GB summary Even simple queries don’t run in sub- second
  • 21. Page 21 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 200GB summary Even simple queries don’t run in sub- second
  • 22. Page 22 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 200GB ● 200GB Scale factor, un-partitioned schema ● 45x unmodified queries from TPC-DS ● ORC format compression ratio 3.4x ● Parquet format compression ratio of 2.8x
  • 23. Page 23 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 30TB ● 30 TB Scale factor ● ORC Table format ● Fact tables partitioned on *_date_sk ● Explicit partition filters where used for Hive on Spark and Spark-SQL (but not for Hive-on-Tez) ● 20 out of the previously used queries where used, warm query timings reported ● Hive on Tez outperforms Hive on Spark and Spark-SQL by up to 18x ● Hive on Spark completed 15 out of the 20, the remaining 5 queries errored out or where stuck in GC and got cancelled ● Spark-SQL completed 7 out of the 20, the remaining 13 queries either failed within a couple of minutes or errored out after running for hours ● Spark-SQL performance is negatively affected by in-efficient query plans as it lacks a query optimizer Workload config Highlights from 30TB TPC-DS test
  • 24. Page 24 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 30TB 1,828 10,098
  • 25. Page 25 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 30TB 1,828 10,098For large data set Hive on Tez is ~5x faster than Hive on Spark
  • 26. Page 26 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 30TB continued
  • 27. Page 27 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 30TB continued Failed Spark-SQL queries
  • 28. Page 28 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 30TB Q17
  • 29. Page 29 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 30TB Q17 Hive on Tez query ends here
  • 30. Page 30 © Hortonworks Inc. 2014 Why didn’t Spark take Hive to sub-second? ● Hive is CPU bound for most operations specially after the introduction of columnar file formats (do more with less) ● Spark consumes more CPU, Disk & Network IO than Tez ● Hive on Spark spends a lot of time translating from RDDs to Hive’s “Row Containers”
  • 31. Page 31 © Hortonworks Inc. 2014 Why didn’t Spark take Hive to sub-second? ● Hive is CPU bound for most operations specially after the introduction of columnar file formats (do more with less) ● Spark consumes more CPU, Disk & Network IO than Tez for relatively large datasets ● Hive on Spark spends a lot of time translating from RDDs to Hive’s “Row Containers” 2x less Disk IO 4x less Network IO6x less CPU
  • 32. Page 32 © Hortonworks Inc. 2014 I don’t believe what you just said!!! Show me some queries I can understand... Simple queries to understand complex systems Execution engine Primitives
  • 33. Page 33 © Hortonworks Inc. 2014 Performance comparison : What are those primitives? Group Test case Comment ETL Create table as select * Insert 8 Billion rows, 570 GB of Data Create table as select with Group by Group by and Insert 8 Billion rows, 570 GB of Data Create table as with Group by on all columns followed by cluster by Group by, cluster by and Insert 8 Billion rows, 570 GB of Data Group by Group by on primary key Group by 25 billion distinct keys Group by on column with low NDV* Group by 82 billion rows with 8K distinct keys Map join store_sales x item Map join 28 Billion x 462K store_sales x item x store Map join 28 Billion x 462K x 1.7K store_sales x item x store x customer_demographics Map join 28 Billion x 462K x 1.7K x 1.9 Million Shuffle Join Shuffle join Shuffle join 8.6 Billion x 706 Million rows Shuffle join + Group by on primary key Shuffle join 8.6 Billion x 706 Million rows followed by group by on 675 Million rows NDV* Number of distinct values
  • 34. Page 34 © Hortonworks Inc. 2014 Performance comparison : CTAS Create table test_table as select * from store_returns; Execution engine Elapsed time (Seconds) Tez Gain % Hive on Tez 316 Hive on Spark 351 11% Hive on Mapreduce 494 56% Spark-SQL 418 32% Table Scan store_returns 8 Billion rows Table Insert 8 Billion rows 316 351 494 418
  • 35. Page 35 © Hortonworks Inc. 2014 Performance comparison : CTAS Create table test_table as select * from store_returns; Execution engine Elapsed time (Seconds) Tez Gain % Hive on Tez 316 Hive on Spark 351 11% Hive on Mapreduce 494 56% Spark-SQL 418 32% Table Scan store_returns 8 Billion rows Table Insert 8 Billion rows 316 351 494 418 Tez is 11% faster than Spark 56% faster than Mapreduce 32% faster than Spark-SQL
  • 36. Page 36 © Hortonworks Inc. 2014 Performance comparison : CTAS with group by Create table test_table as select * from store_returns group by *; Execution engine Elapsed time (Seconds) Tez Gain % Hive on Tez 630 Hive on Spark 1,608 155% Hive on Mapreduce 840 33% Spark-SQL 1,202 91% Table Insert 4 Billion rows Shuffle On all columns 8 Billion rows Group by On all columns 7 billion rows Table Scan store_returns 8 Billion rows 630 1,608 840 1,202
  • 37. Page 37 © Hortonworks Inc. 2014 Performance comparison : CTAS with group by Create table test_table as select * from store_returns group by *; Execution engine Elapsed time (Seconds) Tez Gain % Hive on Tez 630 Hive on Spark 1,608 155% Hive on Mapreduce 840 33% Spark-SQL 1,202 91% Table Insert 4 Billion rows Shuffle On all columns 8 Billion rows Group by On all columns 7 billion rows Table Scan store_returns 8 Billion rows 630 1,608 840 1,202 This time, execution engine must prepare, shuffle and aggregate data.
  • 38. Page 38 © Hortonworks Inc. 2014 Performance comparison : CTAS with group by Create table test_table as select * from store_returns group by *; Execution engine Elapsed time (Seconds) Tez Gain % Hive on Tez 630 Hive on Spark 1,608 155% Hive on Mapreduce 840 33% Spark-SQL 1,202 91% Table Insert 4 Billion rows Shuffle On all columns 8 Billion rows Group by On all columns 7 billion rows Table Scan store_returns 8 Billion rows 630 1,608 840 1,202 Tez is 155% faster than Spark 33% faster than Mapreduce 91% faster than Spark-SQL
  • 39. Page 39 © Hortonworks Inc. 2014 Performance comparison : Select + group by on PK select count(*) rowcount from store_sales group by ss_item_sk , ss_ticket_number having rowcount > 100000000 Execution engine Elapsed time (Seconds) Tez Gain % Hive on Tez 457 Hive on Spark 2,966 550% Hive on Mapreduce 893 96% Spark-SQL 862 89% Select 0 rows qualify Shuffle 25 Billion rows Group by 25 billion rows Table Scan 25 Billion rows Filter operator 25 billion rows 457 2,966 893 862
  • 40. Page 40 © Hortonworks Inc. 2014 Performance comparison : Select + group by on PK select count(*) rowcount from store_sales group by ss_item_sk , ss_ticket_number having rowcount > 100000000 Execution engine Elapsed time (Seconds) Tez Gain % Hive on Tez 457 Hive on Spark 2,966 550% Hive on Mapreduce 893 96% Spark-SQL 862 89% Select 0 rows qualify Shuffle 25 Billion rows Group by 25 billion rows Table Scan 25 Billion rows Filter operator 25 billion rows 457 2,966 893 862 Group-By performed on all 25 billion distinct keys.
  • 41. Page 41 © Hortonworks Inc. 2014 Performance comparison : Select + group by on PK select count(*) rowcount from store_sales group by ss_item_sk , ss_ticket_number having rowcount > 100000000 Execution engine Elapsed time (Seconds) Tez Gain % Hive on Tez 457 Hive on Spark 2,966 550% Hive on Mapreduce 893 96% Spark-SQL 862 89% Select 0 rows qualify Shuffle 25 Billion rows Group by 25 billion rows Table Scan 25 Billion rows Filter operator 25 billion rows 457 2,966 893 862 Tez is 550% faster than Spark 96% faster than Mapreduce 89% faster than Spark-SQL
  • 42. Page 42 © Hortonworks Inc. 2014 Performance comparison : Select + group by on low NDV select sum(ss_list_price) from store_sales group by ss_sold_date_sk having sum(ss_list_price) = 1 Execution engine Elapsed time (Seconds) Tez Gain % Hive on Tez 51 Hive on Spark 56 10% Hive on Mapreduce 290 465% Spark-SQL 164 221% Select 0 rows qualify Group by 85 billion rows Table Scan 85 Billion rows Filter operator 8K rows 51 290 56 164
  • 43. Page 43 © Hortonworks Inc. 2014 Performance comparison : Select + group by on low NDV select sum(ss_list_price) from store_sales group by ss_sold_date_sk having sum(ss_list_price) = 1 Execution engine Elapsed time (Seconds) Tez Gain % Hive on Tez 51 Hive on Spark 56 10% Hive on Mapreduce 290 465% Spark-SQL 164 221% Select 0 rows qualify Group by 85 billion rows Table Scan 85 Billion rows Filter operator 8K rows 51 290 56 164 Hive on Tez and Hive on Spark outperform Spark-SQL
  • 44. Page 44 © Hortonworks Inc. 2014 select count(*) from store_sales, item, store, customer_demographics where i_item_sk = ss_item_sk and s_store_sk = ss_store_sk and ss_cdemo_sk = cd_demo_sk Performance comparison : Map join with 1,2 & 3 tables Map join 27 Billion rows Map join 27 Billion rows Map join 27 Billion rows Table Scan store_sales 28 Billion rows Table Scan customer_demographic s 1.9 Million rows Table Scan item 472K rows Table Scan Store 1.7K rows Execution engine Map join #1 Map join #2 Map join #3 Tez Join #1 Gain % Tez Join #2 Gain % Tez join #3 Gain % Hive on Tez 108 145 232 Hive on Spark 106 142 289 98% 98% 125% Hive on Mapreduce 247 280 800 228% 193% 345% Spark-SQL 86 117 166 -20% -20% -28%
  • 45. Page 45 © Hortonworks Inc. 2014 select count(*) from store_sales, item, store, customer_demographics where i_item_sk = ss_item_sk and s_store_sk = ss_store_sk and ss_cdemo_sk = cd_demo_sk Performance comparison : Map join with 1,2 & 3 tables Map join 27 Billion rows Map join 27 Billion rows Map join 27 Billion rows Table Scan store_sales 28 Billion rows Table Scan customer_demographic s 1.9 Million rows Table Scan item 472K rows Table Scan Store 1.7K rows Execution engine Map join #1 Map join #2 Map join #3 Tez Join #1 Gain % Tez Join #2 Gain % Tez join #3 Gain % Hive on Tez 108 145 232 Hive on Spark 106 142 289 98% 98% 125% Hive on Mapreduce 247 280 800 228% 193% 345% Spark-SQL 86 117 166 -20% -20% -28% Spark-SQL is faster than Hive on Tez and Hive on Spark for Map-joins
  • 46. Page 46 © Hortonworks Inc. 2014 Performance comparison : Shuffle join + group by ● select count(*) from store_sales a ,store_returns b where a.ss_item_sk = b.sr_item_sk and a.ss_ticket_number = b.sr_ticket_number ● select count(*) from store_sales a ,store_returns b where a.ss_item_sk = b.sr_item_sk and a.ss_ticket_number = b.sr_ticket_number group by ss_item_sk , ss_ticket_number having rowcount > 1 Execution engine Shuffle join Shuffle join + group by Tez Shuffle Gain % Tez Gain % Hive on Tez 400 453 Hive on Spark 1,078 1,120 170% 147% Hive on Mapreduce 756 826 89% 82% Spark-SQL 1,835 1,884 359% 316% Shuffle Join 9 Billion rows Group by 675 Million rows Table Scan 8.6 Billion rows Table Scan 6 Million rows Select 0 rows Filter 675 Million rows 400 1,078 1,120 826 453 756 1,884 1,835
  • 47. Page 47 © Hortonworks Inc. 2014 Performance comparison : Shuffle join + group by ● select count(*) from store_sales a ,store_returns b where a.ss_item_sk = b.sr_item_sk and a.ss_ticket_number = b.sr_ticket_number ● select count(*) from store_sales a ,store_returns b where a.ss_item_sk = b.sr_item_sk and a.ss_ticket_number = b.sr_ticket_number group by ss_item_sk , ss_ticket_number having rowcount > 1 Shuffle Join 9 Billion rows Group by 675 Million rows Table Scan 8.6 Billion rows Table Scan 6 Million rows Select 0 rows Filter 675 Million rows 400 1,078 1,120 826 453 756 1,884 1,835 Tez is 170% faster than Spark 89% faster than Mapreduce 359% faster than Spark-SQL Tez is 147% faster than Spark 82% faster than Mapreduce 316% faster than Spark-SQL Execution engine Shuffle join Shuffle join + group by Tez Shuffle Gain % Tez Gain % Hive on Tez 400 453 Hive on Spark 1,078 1,120 170% 147% Hive on Mapreduce 756 826 89% 82% Spark-SQL 1,835 1,884 359% 316%
  • 48. Page 48 © Hortonworks Inc. 2014 Performance comparison : Shuffle join + group by ● select count(*) from store_sales a ,store_returns b where a.ss_item_sk = b.sr_item_sk and a.ss_ticket_number = b.sr_ticket_number ● select count(*) from store_sales a ,store_returns b where a.ss_item_sk = b.sr_item_sk and a.ss_ticket_number = b.sr_ticket_number group by ss_item_sk , ss_ticket_number having rowcount > 1 Shuffle Join 9 Billion rows Group by 675 Million rows Table Scan 8.6 Billion rows Table Scan 6 Million rows Select 0 rows Filter 675 Million rows 400 1,078 1,120 826 453 756 1,884 1,835 Why are shuffles so slow for Hive on Spark and Spark-SQL Execution engine Shuffle join Shuffle join + group by Tez Shuffle Gain % Tez Gain % Hive on Tez 400 453 Hive on Spark 1,078 1,120 170% 147% Hive on Mapreduce 756 826 89% 82% Spark-SQL 1,835 1,884 359% 316%
  • 49. Page 49 © Hortonworks Inc. 2014 Performance comparison : Shuffle join cluster CPU utilization
  • 50. Page 50 © Hortonworks Inc. 2014 Performance comparison : Shuffle join cluster CPU utilization Hive on Tez query ends here
  • 51. Page 51 © Hortonworks Inc. 2014 Performance comparison : Shuffle join cluster CPU utilization Hive on Spark query ends here
  • 52. Page 52 © Hortonworks Inc. 2014 Performance comparison : Primitive results summary
  • 53. Page 53 © Hortonworks Inc. 2014 Performance comparison : Performance summary Short running query+ ETL+ Large joins and aggregates+ Slower than Spark-SQL in Map joins High GC Instability SQL support limited compared to Hive Lack of sophisticated query optimizer Efficient resource utilization+ Map join performance+ Large Joins Outperforms Spark-SQL in large join+ Slower than Tez for large joins and aggregates High GC Hive Tez Spark-SQL Hive on Spark MapReduce Promising initial release+
  • 54. Page 54 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Solving Hive’s Top Performance Challenges
  • 55. Page55 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Hive: Modern ArchitectureStorage Columnar Storage ORCFile Parquet Unstructured Data JSON CSV Text Avro Custom Weblog Engine SQL Engines Row Engine Vector Engine SQL SQL Support SQL:2011 Optimizer HCatalog HiveServer2 Cache Block Cache Linux Cache Distributed Execution Hadoop 1 MapReduce Hadoop 2 Tez Spark Vector Cache LLAP Persistent Server Historical Current In Development Legend
  • 56. Page56 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Storage Columnar Storage ORCFile Parquet Unstructured Data JSON CSV Text Avro Custom Weblog Engine SQL Engines Row Engine Vector Engine SQL SQL Support SQL:2011 Optimizer HCatalog HiveServer2 Apache Hive: Getting to Sub-Second Improvement LLAP: Persistent servers cache vectors and start queries instantly. Pluggable integrations with Tez or Spark. Cache Block Cache Linux Cache Distributed Execution Hadoop 1 MapReduce Hadoop 2 Tez Spark Historical Current In Development Legend Vector Cache LLAP Persistent Server
  • 57. Page57 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Storage Columnar Storage ORCFile Parquet Unstructured Data JSON CSV Text Avro Custom Weblog Engine SQL Engines Row Engine Vector Engine SQL SQL Support SQL:2011 Optimizer HCatalog HiveServer2 Vectorized Hash Join Solves CPU Boundedness for Hive on Tez or on Spark. Cache Block Cache Linux Cache Distributed Execution Hadoop 1 MapReduce Hadoop 2 Tez Spark Historical Current In Development Legend Apache Hive: Getting to Sub-Second Improvement Vector Cache LLAP Persistent Server
  • 58. Page58 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Storage Columnar Storage ORCFile Parquet Unstructured Data JSON CSV Text Avro Custom Weblog Engine SQL Engines Row Engine Vector Engine SQL SQL Support SQL:2011 Optimizer HCatalog HiveServer2 Improved metadata catalog allows instant query planning and optimization for any engine. Cache Block Cache Linux Cache Distributed Execution Hadoop 1 MapReduce Hadoop 2 Tez Spark Historical Current In Development Legend Apache Hive: Getting to Sub-Second Improvement Vector Cache LLAP Persistent Server
  • 59. Page59 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Hive’s Sub-Second Future = Sub-Second Hive Metadata Fast, Scalable Metadata Catalog Persistent Server LLAP + + SQL Engine Vectorized Hash Join Choice of Execution Engines Tez or Spark +
  • 60. Page60 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Questions? ? Interested? Stop by the Hortonworks booth to learn more
  • 61. Page61 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Endnotes (1) https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/ (2) https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently- with-corona/10151142560538920 (3) http://hortonworks.com/blog/benchmarking-apache-hive-13-enterprise-hadoop/ (4) http://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive-tez-and-yarn (5) http://www.slideshare.net/AdamKawa/a-perfect-hive-query-for-a-perfect-meeting-hadoop-summit-2014