Hive Data Organization Tips for Performance

•Télécharger en tant que PPTX, PDF•

5 j'aime•2,147 vues

The document discusses various techniques for optimizing data organization and performance in Hive, including: - Partitioning data by meaningful columns like customer ID or VIN to improve lookup performance. - Using the right number and size of buckets to avoid performance issues from too many small files or skewed data distribution. - Denormalizing data and optimizing JOIN queries through techniques like broadcast joins. - Storing data in its natural types like numbers instead of strings to enable predicate pushdown and better performance. - Using temporary tables and in-memory storage to optimize queries involving data reorganization or distinct slices.

Logiciels

Page1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive: Data Organization for Performance
Gopal Vijayaraghavan

Page2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
In this episode
All BigData problems are primarily lookup problems
All Lookup problems are really Storage problems
All Storage problems turn into ETL problems
ETL problems are all about the Data
Data navigation?
Data organization?
Data ingestion?
It’s Big?

Page3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Good idea: Do things that scale!
There are many problems like this, but this one is mine

Page4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Partitions
If you have a database on cars and you
partition on VIN#
If you have a database on sales and you
partition on customer_id
Rule of thumb: Average partition is
>=1Gb and total # of partitions per
query <1000

Page5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Buckets
If you have more files than rows, you’ve
definitely got bucketing wrong
“clustered by” != “cluster by”
Bucketing on a skewed column slows
down ETL a *lot* (for no win)
If you have partitions, sort-merge
bucket-mapjoin can be slower than a
shuffle!!

Page6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Buckets - II
Histograms!
select explode(histogram_numeric(
hash(<col>)% <n-bucket>, <n-bucket>
)) as h from table;
The Curse of 31 & the last byte
If you have buckets & partitions, always
remember to ETL with
set hive.optimize.sort.dynamic.partition=true;

Page7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Denormalization
Denormalization can turn a compute
problem into an IO/lookup problem.
But if you then optimize that with
compression, you get a compute
problem again.
If you think JOINs are bad, you
probably haven’t moved out of
MapReduce.
Broadcast joins are good & dynamically
partitioned broadcast joins can scale
that ~1000x

Page8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Indexes?
Indexes in hive barely help in a
columnar world – incremental rebuild
isn’t really there
ORC maintains internal bloom filter
indexes (PARQUET-41 too)
You can store your indexes as ORC files,
if you want, so that you can have an
index in your index, to speedup indexes

Page9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Schema & Predicate Push Down
Never store a Number as a string,
because guess what “11” < “9” and
“11.0” != “11” – transform, then load
Predicate push-down cannot fight the
type system (♫ … breaking blocks in the
hot sun ♫)
UDFs applied on the data column is
always a bad idea for fast filtering.
If you need case-insensitive lookups,
always store as UPPER/lower.
If you need LIKE “%.twimg.com”, store
like DNS does “.com.twimg…”

Page10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Temporary tables
In-memory temp-tables
set hive.exec.temporary.table.storage=memory;
Easiest way to reorganize data
temporarily or to produce a “distinct
slice”
“create temporary table if not exists stored as
orc as select …”
Can be used for pagination queries to
good effect, for display tools

Page11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Complex types & Nesting
There’s pretty much no advantage to
using structs – they’re nearly
columns, without any of the good
stuff
Maps – not so bad, but handle with
care
Maps are way better than 4000
columns, most of them null
Arrays – ignore mostly
(JDBC!!)

Page12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Schema Evolution
Add columns, never remove them
Schemas are per-partition
Remember, the partitions don’t change their
schema after they’re created
All new inserts have new schema
After schema update, inserting data into old
partitions is a recipe for disaster
Type changes for a column also complicate
things (except for simple stuff like Int ->
BigInt or Float -> Double)

Page13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions?
?

Contenu connexe

Tendances

Optimizing Hive QueriesOwen O'Malley

Major advancements in Apache Hive towards full support of SQL complianceDataWorks Summit/Hadoop Summit

Tune up Yarn and Hiverxu

Hive acid and_2.x new_featuresAlberto Romero

LLAP Nov Meetupt3rmin4t0r

Apache Hive ACID ProjectDataWorks Summit/Hadoop Summit

LLAP: long-lived execution in HiveDataWorks Summit

Hive Does ACIDDataWorks Summit

LLAP: Sub-Second Analytical Queries in HiveDataWorks Summit/Hadoop Summit

Hive Data Modeling and Query OptimizationEyad Garelnabi

Hive acid-updates-summit-sjc-2014alanfgates

ORC 2015t3rmin4t0r

HiveACIDPublicInderaj (Raj) Bains

Sub-second-sql-on-hadoop-at-scaleYifeng Jiang

Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...DataWorks Summit

ORC 2015: Faster, Better, SmallerThe Apache Software Foundation

Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015alanfgates

Apache Hive on ACIDDataWorks Summit/Hadoop Summit

Hive on spark is blazing fast or is it finalHortonworks

Achieving 100k Queries per Hour on Hive on TezDataWorks Summit/Hadoop Summit

Tendances (20)

Optimizing Hive Queries

Major advancements in Apache Hive towards full support of SQL compliance

Tune up Yarn and Hive

Hive acid and_2.x new_features

LLAP Nov Meetup

Apache Hive ACID Project

LLAP: long-lived execution in Hive

Hive Does ACID

LLAP: Sub-Second Analytical Queries in Hive

Hive Data Modeling and Query Optimization

Hive acid-updates-summit-sjc-2014

ORC 2015

HiveACIDPublic

Sub-second-sql-on-hadoop-at-scale

Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...

ORC 2015: Faster, Better, Smaller

Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015

Apache Hive on ACID

Hive on spark is blazing fast or is it final

Achieving 100k Queries per Hour on Hive on Tez

En vedette

Hive+Tez: A performance deep divet3rmin4t0r

Using Apache Hive with High PerformanceInderaj (Raj) Bains

Hive tuningMichael Zhang

Apache Hive 2.0: SQL, Speed, ScaleDataWorks Summit/Hadoop Summit

Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks

Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive PerformanceOlga Lavrentieva

What's new in Apache HiveDataWorks Summit

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionCloudera, Inc.

Hive Demo Paper at VLDB 2009Namit Jain

ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceDataWorks Summit

Advanced Analytics using Apache HiveMurtaza Doctor

Analytical Queries with Hive: SQL Windowing and Table FunctionsDataWorks Summit

Methods Of OrganizationBarbara Yardley

Methods of organizing dataRoxane La'O

frequency distribution tableMonie Ali

Frequency distributionmetnashikiom2011-13

Presentation of Data and Frequency DistributionElain Cruz

Hive + Tez: A Performance Deep DiveDataWorks Summit

How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit

Frequency Distributions and Graphsmonritche

En vedette (20)

Hive+Tez: A performance deep dive

Using Apache Hive with High Performance

Hive tuning

Apache Hive 2.0: SQL, Speed, Scale

Hortonworks Technical Workshop: Interactive Query with Apache Hive

Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

What's new in Apache Hive

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production

Hive Demo Paper at VLDB 2009

ORC File & Vectorization - Improving Hive Data Storage and Query Performance

Advanced Analytics using Apache Hive

Analytical Queries with Hive: SQL Windowing and Table Functions

Methods Of Organization

Methods of organizing data

frequency distribution table

Frequency distribution

Presentation of Data and Frequency Distribution

Hive + Tez: A Performance Deep Dive

How to understand and analyze Apache Hive query execution plan for performanc...

Frequency Distributions and Graphs

Similaire à Hive Data Organization Tips for Performance

Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCCal Henderson

Hadoop crash course workshop at Hadoop SummitDataWorks Summit

Making MySQL Great For Business IntelligenceCalpont

Building modern data lakes Minio

SQL Server In-Memory OLTP introduction (Hekaton)Shy Engelberg

Web20expo Scalable Web Archroyans

Web20expo Scalable Web Archguest18a0f1

Web20expo Scalable Web Archmclee

ActiveWarehouse/ETL - BI & DW for Ruby/RailsPaul Gallagher

Front Range PHP NoSQL DatabasesJon Meredith

Apache Phoenix + Apache HBaseDataWorks Summit/Hadoop Summit

Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseJosh Elser

Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3DataWorks Summit

The Computer Science Behind a modern Distributed DatabaseArangoDB Database

Hive ACID Apache BigData 2016alanfgates

Apache Hive on ACIDHortonworks

Bhupeshbansal bigdata Bhupesh Bansal

SAS Programming.pptssuser660bb1

In-Place analytics with Unified Data AccessDataWorks Summit

World-class Data Engineering with Amazon RedshiftLars Kamp

Similaire à Hive Data Organization Tips for Performance (20)

Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC

Hadoop crash course workshop at Hadoop Summit

Making MySQL Great For Business Intelligence

Building modern data lakes

SQL Server In-Memory OLTP introduction (Hekaton)

Web20expo Scalable Web Arch

ActiveWarehouse/ETL - BI & DW for Ruby/Rails

Front Range PHP NoSQL Databases

Apache Phoenix + Apache HBase

Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3

The Computer Science Behind a modern Distributed Database

Hive ACID Apache BigData 2016

Apache Hive on ACID

Bhupeshbansal bigdata

SAS Programming.ppt

In-Place analytics with Unified Data Access

World-class Data Engineering with Amazon Redshift

Dernier

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky

OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan

Post Quantum Cryptography – The Impact on Identityteam-WIBU

Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden

Patterns for automating API delivery. API conferencessuser9e7c64

The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp

20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray

Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts

VK Business Profile - provides IT solutions and Web Developmentvyaparkranti

Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions

Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1

Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools

Sending Calendar Invites on SES and Calendarsnack.pdf31events.com

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol

SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa

Precise and Complete Requirements? An Elusive GoalLionel Briand

UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz

Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki

Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171

Dernier (20)

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...

OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording

Post Quantum Cryptography – The Impact on Identity

Simplifying Microservices & Apps - The art of effortless development - Meetup...

Patterns for automating API delivery. API conference

The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx

20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...

Odoo 14 - eLearning Module In Odoo 14 Enterprise

VK Business Profile - provides IT solutions and Web Development

Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...

Amazon Bedrock in Action - presentation of the Bedrock's capabilities

Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton

Sending Calendar Invites on SES and Calendarsnack.pdf

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha

SpotFlow: Tracking Method Calls and States at Runtime

Precise and Complete Requirements? An Elusive Goal

UI5ers live - Custom Controls wrapping 3rd-party libs.pptx

Machine Learning Software Engineering Patterns and Their Engineering

Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf

Hive Data Organization Tips for Performance

2. Page2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved In this episode All BigData problems are primarily lookup problems All Lookup problems are really Storage problems All Storage problems turn into ETL problems ETL problems are all about the Data Data navigation? Data organization? Data ingestion? It’s Big?

4. Page4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Partitions If you have a database on cars and you partition on VIN# If you have a database on sales and you partition on customer_id Rule of thumb: Average partition is >=1Gb and total # of partitions per query <1000

5. Page5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Buckets If you have more files than rows, you’ve definitely got bucketing wrong “clustered by” != “cluster by” Bucketing on a skewed column slows down ETL a *lot* (for no win) If you have partitions, sort-merge bucket-mapjoin can be slower than a shuffle!!

6. Page6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Buckets - II Histograms! select explode(histogram_numeric( hash(<col>)% <n-bucket>, <n-bucket> )) as h from table; The Curse of 31 & the last byte If you have buckets & partitions, always remember to ETL with set hive.optimize.sort.dynamic.partition=true;

7. Page7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Denormalization Denormalization can turn a compute problem into an IO/lookup problem. But if you then optimize that with compression, you get a compute problem again. If you think JOINs are bad, you probably haven’t moved out of MapReduce. Broadcast joins are good & dynamically partitioned broadcast joins can scale that ~1000x

8. Page8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Indexes? Indexes in hive barely help in a columnar world – incremental rebuild isn’t really there ORC maintains internal bloom filter indexes (PARQUET-41 too) You can store your indexes as ORC files, if you want, so that you can have an index in your index, to speedup indexes

9. Page9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Schema & Predicate Push Down Never store a Number as a string, because guess what “11” < “9” and “11.0” != “11” – transform, then load Predicate push-down cannot fight the type system (♫ … breaking blocks in the hot sun ♫) UDFs applied on the data column is always a bad idea for fast filtering. If you need case-insensitive lookups, always store as UPPER/lower. If you need LIKE “%.twimg.com”, store like DNS does “.com.twimg…”

10. Page10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Temporary tables In-memory temp-tables set hive.exec.temporary.table.storage=memory; Easiest way to reorganize data temporarily or to produce a “distinct slice” “create temporary table if not exists stored as orc as select …” Can be used for pagination queries to good effect, for display tools

11. Page11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Complex types & Nesting There’s pretty much no advantage to using structs – they’re nearly columns, without any of the good stuff Maps – not so bad, but handle with care Maps are way better than 4000 columns, most of them null Arrays – ignore mostly (JDBC!!)

12. Page12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Schema Evolution Add columns, never remove them Schemas are per-partition Remember, the partitions don’t change their schema after they’re created All new inserts have new schema After schema update, inserting data into old partitions is a recipe for disaster Type changes for a column also complicate things (except for simple stuff like Int -> BigInt or Float -> Double)

Hive Data Organization Tips for Performance

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Hive Data Organization Tips for Performance

Similaire à Hive Data Organization Tips for Performance (20)

Dernier

Dernier (20)

Hive Data Organization Tips for Performance