Powerpoint exploring the locations used in television show Time Clash
Hive & HBase For Transaction Processing
1. Hive & HBase For
Transaction Processing
Page 1
Alan Gates
@alanfgates
2. Agenda
Page 2Hive & HBase For Transaction Processing
• Our goal
– Combine Apache Hive, HBase, Phoenix, and Calcite to build a single data store
that can be used for analytics and transaction processing
• But before we get to that we need to consider
– Some things happening in Hive
– Some things happening in Phoenix
3. Agenda
Page 3Hive & HBase For Transaction Processing
• Our goal
– Combine Apache Hive, Hbase, Phoenix, and Calcite to build a single data store
that can be used for analytics and transaction processing
• But before we get to that we need to consider
– Some things happening in Hive
– Some things happening in Phoenix
4. A Brief History of Hive
Page 4Hive & HBase For Transaction Processing
• Initial goal was to make it easy to execute MapReduce using a familiar
language: SQL
– Most queries took minutes or hours
– Primarily used for batch ETL jobs
• Since 0.11 much has been done to support interactive and ad hoc queries
– Many new features focused on improving performance: ORC and Parquet, Tez and
Spark, vectorization
– As of Hive 0.14 (November 2014) TPC-DS query 3 (star-join, group, order, limit) using
ORC, Tez, and vectorization finishes in 9s for 200GB scale and 32s for 30TB scale.
– Still have ~2-5 second minimum for all queries
• Ongoing performance work with goal of reaching sub-second response time
– Continued investment in vectorization
– LLAP
– Using Apache HBase for metastore
LLAP = Live Long And Process
5. LLAP: Why?
Page 5Hive & HBase For Transaction Processing
• It is hard to be fast and flexible in Tez
– When SQL session starts Tez AM spun up (first query cost)
– For subsequent queries Tez containers can be
– pre-allocated – fast but not flexible
– allocated and released for each query – flexible but start up cost for every query
• No caching of data between queries
– Even if data is in OS cache much of IO cost is deserialization/vector marshaling
which is not shared
6. LLAP: What
Page 6Hive & HBase For Transaction Processing
• LLAP is a node resident daemon process
– Low latency by reducing setup cost
– Multi-threaded engine that runs smaller tasks for query
including reads, filter and some joins
– Use regular Tez tasks for larger shuffle and other operators
• LLAP has In-memory columnar data cache
– High throughput IO using Async IO Elevator with dedicated
thread and core per disk
– Low latency by providing data from in-memory (off heap)
cache instead of going to HDFS
– Store data in columnar format for vectorization irrespective
of underlying file type
– Security enforced across queries and users
• Uses YARN for resource management
Node
LLAP Process
Query
Fragment
LLAP In-
Memory
columnar
cache
LLAP
process
running a
task for a
query
HDFS
7. LLAP: What
Page 7Hive & HBase For Transaction Processing
Node
LLAP
Process
HDFS
Query
Fragm
ent
LLAP In-Memory
columnar cache
LLAP process
running read task
for a query
LLAP process runs on multiple nodes,
accelerating Tez tasks
Node
Hive
Query
Node NodeNode Node
LLAP LLAP LLAP LLAP
8. LLAP: Is and Is Not
Page 8Hive & HBase For Transaction Processing
• It is not MPP
– Data not shuffled between LLAP nodes (except in limited cases)
• It is not a replacement for Tez or Spark
– Configured engine still used to launch tasks for post-shuffle operations (e.g. hash
joins, distributed aggregations, etc.)
• It is not required, users can still use Hive without installing LLAP
demons
• It is a Map server, or a set of standing map tasks
• It is currently under development on the llap branch
• Hope to merge to master and do alpha release next month
11. HBase Metastore: Why?
Page 11Hive & HBase For Transaction Processing
> 700 metastore queries to plan
TPC-DS query 27!!!
12. HBase Metastore: Why?
Page 12Hive & HBase For Transaction Processing
• Object Relational Modeling is an impedance mismatch
• The need to work across different DBs limits tuning opportunities
• No caching of catalog objects or stats in HiveServer2 or Hive metastore
• Hadoop nodes cannot contact RDBMS directly due to scale issues
– Forces all planning to be done up front
– Limits caching opportunities
• Solution: use HBase
– Can store object directly, no need to normalize
– Already scales, performs, etc.
– Can store additional data not stored today due to RDBMS capacity limitations
– Can access the metadata from the cluster (e.g. LLAP, Tez AM)
13. But...
Page 13Hive & HBase For Transaction Processing
• HBase does not have transactions –
metastore needs them
– Tephra, Omid 2 (Yahoo), others working on this
• HBase is hard to administer and install
– Yes, we will need to improve this
– We will also need embedded option for test/POC
setups to keep HBase from becoming barrier to
adoption
• Basically any work we need to do to HBase
for this is good since it benefits all HBase
users
14. HBase Metastore: How
Page 14Hive & HBase For Transaction Processing
• HBaseStore, a new implementation of RawStore that stores data in
HBase
• Not default, users still free to use RDBMS
• Less than 10 tables in HBase
– DBS, TBLS, PARTITIONS, ... – basically one for each object type
– Common partition data factored out to significantly reduce size
• Layout highly optimized for SELECT and DML queries, longer
operations moved into DDL (e.g. grant)
• Extensive caching
– Of data catalog objects for length of a query
– Of aggregated stats across queries and users
• On going work in hbase-metastore branch
• Will alpha release with llap soon
15. Agenda
Page 15Hive & HBase For Transaction Processing
• Our goal
– Combine Apache Hive, Hbase, Phoenix, and Calcite to build a single data store
that can be used for analytics and transaction processing
• But before we get to that we need to consider
– Some things happening in Hive
– Some things happening in Phoenix
16. Apache Phoenix: Putting SQL Back in NoSQL
Page 16Hive & HBase For Transaction Processing
• SQL layer on top of HBase
• Originally oriented toward transaction processing
• Moving to add more analytics type operators
– Adding multiple join implementations
– Requests for OLAP functions (PHOENIX-154)
• Working on adding transactions (PHOENIX-1674)
• Moving to Apache Calcite for optimization (PHOENIX-1488)
17. Agenda
Page 17Hive & HBase For Transaction Processing
• Our goal
– Combine Apache Hive, HBase, Phoenix, and Calcite to build a single data store
that can be used for analytics and transaction processing
• But before we get to that we need to consider
– Some things happening in Hive
– Some things happening in Phoenix
18. What If?
Page 18Hive & HBase For Transaction Processing
• We could share one O/JDBC driver?
• We could share one SQL dialect?
• Phoenix could leverage extensive analytics
functionality in Hive without re-inventing it
• Users could access their transactional and
analytics data in single SQL operations?
19. How?
Page 19Hive & HBase For Transaction Processing
• Insight #1: LLAP is a storage plus operations
server for Hive; we can swap it out for other
implementations
• Insight #2: Tez and Spark can do post-shuffle
operations (hash join, etc.) with LLAP or HBase
• Insight #3: Calcite (used by both Hive and
Phoenix) is built specifically to integrate
disparate data storage systems
20. Vision
Page 20Hive & HBase For Transaction Processing
• User picks storage location for table in create
table (LLAP or HBase)
• Transactions more efficient in HBase tables but
work in both
• Analytics more efficient in LLAP tables but work
in both
• Queries that require shuffle use Tez or Spark for
post shuffle operators
HDFS
JDBC Server
Node Node
HBase LLAP
Query
Query
Query
Calcite
used for
planning
Phoenix
used for
execution
21. Hurdles
Page 21Hive & HBase For Transaction Processing
• Need to integrate types/data representation
• Need to integrate transaction management
• Work to do in Calcite to optimize transactional queries well