Soumettre la recherche
Mettre en ligne
ORC 2015
•
Télécharger en tant que PPTX, PDF
•
6 j'aime
•
3,355 vues
T
t3rmin4t0r
Suivre
ORC 2015
Lire moins
Lire la suite
Logiciels
Signaler
Partager
Signaler
Partager
1 sur 29
Télécharger maintenant
Recommandé
ORC Deep Dive 2020
ORC Deep Dive 2020
Owen O'Malley
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
DataWorks Summit
ORC Files
ORC Files
Owen O'Malley
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
DataWorks Summit/Hadoop Summit
Recommandé
ORC Deep Dive 2020
ORC Deep Dive 2020
Owen O'Malley
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
DataWorks Summit
ORC Files
ORC Files
Owen O'Malley
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
DataWorks Summit/Hadoop Summit
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
DataWorks Summit
Building Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFi
Bryan Bende
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
Ceph on arm64 upload
Ceph on arm64 upload
Ceph Community
PySpark Best Practices
PySpark Best Practices
Cloudera, Inc.
ORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, Smaller
DataWorks Summit
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Databricks
Transactional SQL in Apache Hive
Transactional SQL in Apache Hive
DataWorks Summit
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
DataWorks Summit
StarlingX - A Platform for the Distributed Edge | Ildiko Vancsa
StarlingX - A Platform for the Distributed Edge | Ildiko Vancsa
Vietnam Open Infrastructure User Group
Cloud arch patterns
Cloud arch patterns
Corey Huinker
HiveServer2
HiveServer2
Schubert Zhang
Dive into PySpark
Dive into PySpark
Mateusz Buśkiewicz
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
prajods
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Yoshiyasu SAEKI
Spy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platform
Redge Technologies
Apache Arrow - An Overview
Apache Arrow - An Overview
Dremio Corporation
Apache Tez – Present and Future
Apache Tez – Present and Future
DataWorks Summit
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
Optimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
DataWorks Summit
Contenu connexe
Tendances
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
DataWorks Summit
Building Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFi
Bryan Bende
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
Ceph on arm64 upload
Ceph on arm64 upload
Ceph Community
PySpark Best Practices
PySpark Best Practices
Cloudera, Inc.
ORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, Smaller
DataWorks Summit
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Databricks
Transactional SQL in Apache Hive
Transactional SQL in Apache Hive
DataWorks Summit
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
DataWorks Summit
StarlingX - A Platform for the Distributed Edge | Ildiko Vancsa
StarlingX - A Platform for the Distributed Edge | Ildiko Vancsa
Vietnam Open Infrastructure User Group
Cloud arch patterns
Cloud arch patterns
Corey Huinker
HiveServer2
HiveServer2
Schubert Zhang
Dive into PySpark
Dive into PySpark
Mateusz Buśkiewicz
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
prajods
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Yoshiyasu SAEKI
Spy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platform
Redge Technologies
Apache Arrow - An Overview
Apache Arrow - An Overview
Dremio Corporation
Apache Tez – Present and Future
Apache Tez – Present and Future
DataWorks Summit
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
Tendances
(20)
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
Building Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFi
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
Ceph on arm64 upload
Ceph on arm64 upload
PySpark Best Practices
PySpark Best Practices
ORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, Smaller
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Transactional SQL in Apache Hive
Transactional SQL in Apache Hive
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
StarlingX - A Platform for the Distributed Edge | Ildiko Vancsa
StarlingX - A Platform for the Distributed Edge | Ildiko Vancsa
Cloud arch patterns
Cloud arch patterns
HiveServer2
HiveServer2
Dive into PySpark
Dive into PySpark
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Spy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platform
Apache Arrow - An Overview
Apache Arrow - An Overview
Apache Tez – Present and Future
Apache Tez – Present and Future
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
En vedette
Optimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
DataWorks Summit
Adding ACID Updates to Hive
Adding ACID Updates to Hive
Owen O'Malley
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
Owen O'Malley
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
DataWorks Summit
Data protection2015
Data protection2015
Owen O'Malley
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
Owen O'Malley
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in Hadoop
Owen O'Malley
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
Julien Le Dem
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
DataWorks Summit
Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015
alanfgates
ORC File Introduction
ORC File Introduction
Owen O'Malley
Next Generation MapReduce
Next Generation MapReduce
Owen O'Malley
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 Intro
Owen O'Malley
Next Generation Hadoop Operations
Next Generation Hadoop Operations
Owen O'Malley
Apache Hive on ACID
Apache Hive on ACID
DataWorks Summit/Hadoop Summit
Differences of Deep Learning Frameworks
Differences of Deep Learning Frameworks
Seiya Tokui
Hive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
Hive Does ACID
Hive Does ACID
DataWorks Summit
En vedette
(20)
Optimizing Hive Queries
Optimizing Hive Queries
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
Adding ACID Updates to Hive
Adding ACID Updates to Hive
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
Data protection2015
Data protection2015
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in Hadoop
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015
ORC File Introduction
ORC File Introduction
Next Generation MapReduce
Next Generation MapReduce
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 Intro
Next Generation Hadoop Operations
Next Generation Hadoop Operations
Apache Hive on ACID
Apache Hive on ACID
Differences of Deep Learning Frameworks
Differences of Deep Learning Frameworks
Hive: Loading Data
Hive: Loading Data
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
Hive Does ACID
Hive Does ACID
Similaire à ORC 2015
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
The Apache Software Foundation
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scale
Yifeng Jiang
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 edition
Bob Ward
Using Apache Hive with High Performance
Using Apache Hive with High Performance
Inderaj (Raj) Bains
LLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
DataWorks Summit
Oracle SPARC T7 a M7 servery
Oracle SPARC T7 a M7 servery
MarketingArrowECS_CZ
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
Jim St. Leger
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
DataWorks Summit
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
DataWorks Summit
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus SDN/OpenFlow switch
High throughput data replication over RAFT
High throughput data replication over RAFT
DataWorks Summit
Hive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Community
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Odinot Stanislas
Ceph
Ceph
Hien Nguyen Van
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
MongoDB
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
DataWorks Summit
Similaire à ORC 2015
(20)
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scale
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 edition
Using Apache Hive with High Performance
Using Apache Hive with High Performance
LLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
Oracle SPARC T7 a M7 servery
Oracle SPARC T7 a M7 servery
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
High throughput data replication over RAFT
High throughput data replication over RAFT
Hive acid and_2.x new_features
Hive acid and_2.x new_features
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph
Ceph
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
Plus de t3rmin4t0r
Llap: Locality is Dead
Llap: Locality is Dead
t3rmin4t0r
LLAP Nov Meetup
LLAP Nov Meetup
t3rmin4t0r
Data organization: hive meetup
Data organization: hive meetup
t3rmin4t0r
TEZ-8 UI Walkthrough
TEZ-8 UI Walkthrough
t3rmin4t0r
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
t3rmin4t0r
Performance Hive+Tez 2
Performance Hive+Tez 2
t3rmin4t0r
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
Plus de t3rmin4t0r
(7)
Llap: Locality is Dead
Llap: Locality is Dead
LLAP Nov Meetup
LLAP Nov Meetup
Data organization: hive meetup
Data organization: hive meetup
TEZ-8 UI Walkthrough
TEZ-8 UI Walkthrough
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
Performance Hive+Tez 2
Performance Hive+Tez 2
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
Dernier
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
Alina Yurenko
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
Hironori Washizaki
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
Philip Schwarz
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
qr0udbr0
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Matt Ray
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
team-WIBU
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Rob Geurden
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
Christian Birchler
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
YashikaSharma391629
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Drew Moseley
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
Dinusha Kumarasiri
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
BradBedford3
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
ABSYZ Inc
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
andrehoraa
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
Technogeeks
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
Safe Software
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTRO
motivationalword821
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
Andreas Granig
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
StefanoLambiase
Cyber security and its impact on E commerce
Cyber security and its impact on E commerce
manigoyal112
Dernier
(20)
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTRO
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Cyber security and its impact on E commerce
Cyber security and its impact on E commerce
ORC 2015
1.
Page1 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved ORC: 2015 Gopal Vijayaraghavan
2.
Page2 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved ORC – Optimized Row-Columnar File Columnar Storage+ Row-groups & Fixed splits Protobuf Metadata Storage+ + Type-safe Vectorization+ Hive ACID transactions+ Single SerDe for Format+
3.
Page3 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Need for Speed: The Stinger Initiative Stinger: An Open Roadmap to improve Apache Hive’s performance 100x. Launched: February 2013; Delivered: April 2014. Delivered in 100% Apache Open Source. SQL Engine Vectorized SQL Engine Columnar Storage ORC = 100X+ + Distributed Execution Apache Tez
4.
Page4 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved ORC at Facebook Saved more than 1,400 servers worth of storage. Compressioni Compression ratio increased from 5x to 8x globally. Compressioni [1]
5.
Page5 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved ORC at Spotify 16x less HDFS read when using ORC versus Avro.(5) IOi 32x less CPU when using ORC versus Avro.(5) CPUi [2]
6.
Page 6 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: Today What is Optimized about ORC?
7.
Page7 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved ORC – Optimized Row-Columnar File Columnar Storage+ Row-groups & Stripe splits Protobuf Metadata Storage+ + Type-safe Vectorization+ Hive ACID transactions+ Single SerDe for Format+
8.
Page8 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Columnar Storage Storage Performance ● Compress each column differently ● Detect & compress common sub-sequences ● Auto-increment ids ● String Enums ● Large Integers (uid scale) ● Unique strings (UUIDS) Read Performance ● Column projection ● Columnar deserializers ● Data locality Write Throughput ● Stats auto-gather
9.
Page9 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Row-groups & Stripe splits Split Parallelism ● Effective parallelism ● No seeks to find boundaries ● No splits with zero data ● Decompress fixed chunks Stripes ● Single unsplittable chunk ● Will reside in 1 HDFS block entirely ● Is self-contained for all read ops
10.
Page10 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved A Single SerDe for all ORC Files A Single Writer ● No mismatch of serialization ● Forward compatibility Readers ● Multiple reader implementations ● Allows for vector readers ● And row-mode readers ● Similar loop – good JIT hit-rate
11.
Page11 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Protobuf Metadata Storage Standardized Metadata ● Readers are easier to write ● Metadata readers are auto-generated Metadata Forward Compatibility ● Protobuf Optional fields Statistics Storage in Metadata ● Standard serialization for stats ● Allows for PPD into the IO layer
12.
Page12 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Type-safe Vectorization Schema on Write ● Write ORC Structs with types ● SerDe & Inputformat Read Performance ● Data is read with few copies ● Primitive types are fast ● Primitives are also unboxed ● Predicates are typed too
13.
Page 13 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: ETL Improvements Always more new data
14.
Page14 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved ORC (Zlib): Compress Differently 674 389 433 ORC (old zlib) ORC SNAPPY ORC (new zlib) ETL for TPC-H LineItem (scale 1 Tb) Time Taken Different Zlib algorithms for encoding ● Z_FILTERED ● Z_DEFAULT ● Z_BEST_SPEED ● Z_DEFAULT_COMPRESSION In detail ● Compress IS_NULL bitsets lightly ● Compress Integers differently from Doubles ● Compress string dictionaries differently ● Allow for user choice
15.
Page15 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved ORC (Zlib): Compress Differently Different Zlib algorithms for encoding ● Z_FILTERED ● Z_DEFAULT ● Z_BEST_SPEED ● Z_DEFAULT_COMPRESSION In detail ● Compress IS_NULL bitsets lightly ● Compress Integers differently from Doubles ● Compress string dictionaries differently ● Allow for user choice 178.5 225.1 172.2 ORC (old zlib) ORC SNAPPY ORC (new zlib) Data Sizes for TPC-H Lineitem (Scale 1 Tb) Size on Disk
16.
Page16 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Using JDK8 SIMD: Integer Writers Integer encodings ● Base + Delta ● Run-length ● Direct Trade-off for Size/Speed ● Use fixed bit-width loops ● Snap to nearest bit-width 0 200 400 600 800 1000 1200 1400 1600 1800 2000 1 2 4 8 16 24 32 40 48 56 64 MeanTime(ms) Bit Width ORC Write Integer Performance (smaller better) hive 0.13 bitpacking hive 1.0 bitpacking (new)
17.
Page17 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Double Writers 273.331 247.634 231.741 0 50 100 150 200 250 300 old buffered + BE buffered + LE MeanTime(ms) Double Write Modes ORC Write Double Performance (smaller is better) Double Writers ● JVM is big-endian ● X86 is little-endian ● Special handling of NaN
18.
Page18 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved ORC: Scale compression buffers 269.4 263.3 258.5 258.4 258.4 258.4 184.8 183.5 182.2 180.1 178.3 177.4 140 160 180 200 220 240 260 280 300 320 8 16 32 64 128 256 SizeinMB Compression Buffer Size in KB File Size ZLIB SNAPPY Large Columns vs More Columns ● Adjust when >1000 columns Trade offs ● Compression ● Low memory use More additions ● Dynamically partitioned insert
19.
Page19 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved ORC: Streaming Ingest + ACID Broken pattern: Partitions for Atomicity- - Isolation & Consistency on retries+ Transactions are pluggable (txn.manager)+ Cache/Replication friendly (base + deltas)+
20.
Page 20 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: LLAP and Sub-second ORC – Pushing for Sub-second
21.
Page21 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved ORC: Row Indexes Min-Max pruning ● Evaluate on statistics Bloom filters ● Better String filters ● Filter a random distribution LLAP Future ● Row-level vector SARGs 5999989709 540,000 10,000 No Indexes Min-Max Indexes Bloomfilter Indexes from tpch_1000.lineitem where l_orderkey = 1212000001; (log scale) Rows Read
22.
Page22 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved ORC: Row Indexes Min-Max pruning ● Evaluate on Statistics Bloom filters ● Better String filters ● Filter a random distribution LLAP Future ● Row-level vector SARGs 74 4.5 1.34 No Indexes Min-Max Indexes Bloomfilter Indexes * from tpch_1000.lineitem where l_orderkey=1212000001; (smaller better) Time Taken (seconds)
23.
Page23 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved ORC: JDK8 SIMD Readers Integer encodings ● Base + Delta ● Run-length ● Direct Trade-off for Size/Speed ● Use fixed bit-width loops ● Snap to nearest bit-width 0 200 400 600 800 1000 1200 1400 1600 1800 1 2 4 8 16 24 32 40 48 56 64 MeanTime(ms) Bit Width ORC Read Integer Performance hive 0.13 unpacking hive-1.0 unpacking (new)
24.
Page24 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved ORC: Vectorization + SIMD Advantage of a Single SerDe ● Primitive Types Allocation free tight inner loops ● JDK8 has auto-vectorization Vectorized Early Filter ● Vectors can be filtered early in ORC ● StringDictionary can be used to binary-search Vectorized SIMD Join ● Performance for single key joins 0x00007f13d2e6afb0: vmovdqu 0x10(%rsi,%rax,8),%ymm2 0x00007f13d2e6afb6: vaddpd %ymm1,%ymm2,%ymm2 0x00007f13d2e6afba: movslq %eax,%r10 0x00007f13d2e6afbd: vmovdqu 0x30(%rsi,%r10,8),%ymm3 ;*daload vector.expressions.gen.DoubleColAddDoubleColumn::evaluate (line 94) 0x00007f13d2e6afc4: vmovdqu %ymm2,0x10(%rdx,%rax,8) 0x00007f13d2e6afca: vaddpd %ymm1,%ymm3,%ymm2 0x00007f13d2e6afce: vmovdqu %ymm2,0x30(%rdx,%r10,8) ;*dastore vector.expressions.gen.DoubleColAddDoubleColumn::evaluate (line 94)
25.
Page25 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved ORC: Split Strategies + Tez Grouping Amdahl’s Law ● As fast as the slowest task ● Slice work thinly, but not too thin Split-generation vs Execution time ● ETL ● BI ● Hybrid Split-grouping & estimation ● ColumnarSplit size ● Group by estimate, not file size ● Bucket pruning Slow split
26.
Page26 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved ORC: LLAP - JIT Performance for short queries+ Row-group level caching+ Asynchronous IO Elevator+ + Multi-threaded Column Vector processing+
27.
Page27 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved ORC: LLAP (+ SIMD + Split Strategies + Row Indexes)
28.
Page28 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Questions? ? Interested? Stop by the Hortonworks booth to learn more
29.
Page29 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Endnotes (1) https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/ (2) http://www.slideshare.net/AdamKawa/a-perfect-hive-query-for-a-perfect-meeting-hadoop-summit-2014
Télécharger maintenant