Soumettre la recherche
Mettre en ligne
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics
•
Télécharger en tant que PPTX, PDF
•
43 j'aime
•
11,696 vues
DataWorks Summit/Hadoop Summit
Suivre
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics
Lire moins
Lire la suite
Technologie
Signaler
Partager
Signaler
Partager
1 sur 43
Télécharger maintenant
Recommandé
HBase in Practice
HBase in Practice
larsgeorge
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
Dremio Corporation
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
Dremio Corporation
Recommandé
HBase in Practice
HBase in Practice
larsgeorge
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
Dremio Corporation
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
Dremio Corporation
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer Consumers
Jean-Paul Azar
Spark shuffle introduction
Spark shuffle introduction
colorant
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
Introduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
Apache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
Cloudera, Inc.
Change Data Feed in Delta
Change Data Feed in Delta
Databricks
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
Apache Kudu: Technical Deep Dive
Apache Kudu: Technical Deep Dive
Cloudera, Inc.
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
Flink vs. Spark
Flink vs. Spark
Slim Baltagi
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Stores
confluent
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
Killing ETL with Apache Drill
Killing ETL with Apache Drill
Charles Givre
Introduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
Contenu connexe
Tendances
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer Consumers
Jean-Paul Azar
Spark shuffle introduction
Spark shuffle introduction
colorant
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
Introduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
Apache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
Cloudera, Inc.
Change Data Feed in Delta
Change Data Feed in Delta
Databricks
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
Apache Kudu: Technical Deep Dive
Apache Kudu: Technical Deep Dive
Cloudera, Inc.
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
Flink vs. Spark
Flink vs. Spark
Slim Baltagi
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Stores
confluent
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
Tendances
(20)
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer Consumers
Spark shuffle introduction
Spark shuffle introduction
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
Introduction to Spark Internals
Introduction to Spark Internals
Apache Spark Architecture
Apache Spark Architecture
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
Change Data Feed in Delta
Change Data Feed in Delta
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Apache Kudu: Technical Deep Dive
Apache Kudu: Technical Deep Dive
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Flink vs. Spark
Flink vs. Spark
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Stores
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
En vedette
Killing ETL with Apache Drill
Killing ETL with Apache Drill
Charles Givre
Introduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Turi, Inc.
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
DataWorks Summit
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
DataWorks Summit/Hadoop Summit
Machine Learning with GraphLab Create
Machine Learning with GraphLab Create
Turi, Inc.
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
Time Series Analysis with Spark
Time Series Analysis with Spark
Sandy Ryza
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
sscdotopen
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr
Apache kudu
Apache kudu
Asim Jalis
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
En vedette
(14)
Killing ETL with Apache Drill
Killing ETL with Apache Drill
Introduction to Apache Kudu
Introduction to Apache Kudu
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
Machine Learning with GraphLab Create
Machine Learning with GraphLab Create
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Time Series Analysis with Spark
Time Series Analysis with Spark
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Apache kudu
Apache kudu
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Similaire à The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Julien Le Dem
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
Julien Le Dem
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
Julien Le Dem
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
John Mulhall
Application Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
hadooparchbook
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
huguk
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
Application Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
Cloudera, Inc.
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
Evolve The Adobe Digital Marketing Community
Architecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
DataFrames: The Extended Cut
DataFrames: The Extended Cut
Wes McKinney
Platform Provisioning Automation for Oracle Cloud
Platform Provisioning Automation for Oracle Cloud
Simon Haslam
Similaire à The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics
(20)
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
Application Architectures with Hadoop
Application Architectures with Hadoop
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
Application Architectures with Hadoop
Application Architectures with Hadoop
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
Architecting Applications with Hadoop
Architecting Applications with Hadoop
DataFrames: The Extended Cut
DataFrames: The Extended Cut
Platform Provisioning Automation for Oracle Cloud
Platform Provisioning Automation for Oracle Cloud
Plus de DataWorks Summit/Hadoop Summit
Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
Hadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
Data Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
Apache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
Dataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
Schema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
Plus de DataWorks Summit/Hadoop Summit
(20)
Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Hadoop Crash Course
Data Science Crash Course
Data Science Crash Course
Apache Spark Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
Dernier
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
charlottematthew16
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Kalema Edgar
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Patryk Bandurski
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
Memoori
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
Rizwan Syed
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
Sergiu Bodiu
Training state-of-the-art general text embedding
Training state-of-the-art general text embedding
Zilliz
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
Addepto
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
NavinnSomaal
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
2toLead Limited
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
RankYa
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Fwdays
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Mark Simos
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
charlottematthew16
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
Alfredo García Lavilla
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
Slibray Presentation
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
Mark Billinghurst
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
Fwdays
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
gvaughan
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
Dubai Multi Commodity Centre
Dernier
(20)
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
Training state-of-the-art general text embedding
Training state-of-the-art general text embedding
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics
1.
© 2016 Dremio
Corporation @DremioHQ The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics Julien Le Dem Principal Architect, Dremio VP Apache Parquet, Apache Arrow PMC
2.
© 2016 Dremio
Corporation @DremioHQ • Architect at @DremioHQ • Formerly Tech Lead at Twitter on Data Platforms. • Creator of Parquet • Apache member • Apache PMCs: Arrow, Incubator, Pig, Parquet Julien Le Dem @J_ Julien
3.
© 2016 Dremio
Corporation @DremioHQ Agenda • Benefits of Columnar representation – Immutable On disk (Apache Parquet) – Mutable on disk (Apache Kudu) – In memory (Apache Arrow) • Community Driven Standard • Interoperability and Ecosystem
4.
© 2016 Dremio
Corporation @DremioHQ Benefits of Columnar formats @EmrgencyKittens
5.
© 2016 Dremio
Corporation @DremioHQ Columnar layout Logical table representation Row layout Column layout
6.
© 2016 Dremio
Corporation @DremioHQ Mutable or Immutable Storage • Different trade offs – Immutable: (Parquet). • Higher write throughput (no random modification after completion). • Easy to share, replicate, access concurrently. • Modifications require rewrite of dataset. • No operational overhead (no extra service, just your file system) – Mutable: (Kudu) • More flexible trade off between update speed and read speed. • Low-latency for short accesses (primary key indexes and quorum replication) • Database-like semantics (initially single-row ACID) • Needs to be managed (new daemon).
7.
© 2016 Dremio
Corporation @DremioHQ On Disk and in Memory • Different trade offs – On disk: Storage. • Accessed by multiple queries. • Priority to I/O reduction (but still needs good CPU throughput). • Mostly Streaming access. – In memory: Transient. • Specific to one query execution. • Priority to CPU throughput (but still needs good I/O). • Streaming and Random access.
8.
© 2016 Dremio
Corporation @DremioHQ Parquet on disk columnar format
9.
© 2016 Dremio
Corporation @DremioHQ Parquet on disk columnar format • Nested data structures • Compact format: – type aware encodings – better compression • Optimized I/O: – Projection push down (column pruning) – Predicate push down (filters based on stats)
10.
© 2016 Dremio
Corporation @DremioHQ Access only the data you need a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + = Columnar Statistics Read only the data you need!
11.
© 2016 Dremio
Corporation @DremioHQ Parquet nested representation Document DocId Links Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url Borrowed from the Google Dremel paper https://blog.twitter.com/2013/dremel-made-simple-with-parquet
12.
© 2016 Dremio
Corporation @DremioHQ Kudu data representation
13.
© 2016 Dremio
Corporation @DremioHQ Kudu Tablets • Typed columns • Inserts buffered in an in-memory store (like HBase’s memstore) • Flushed to disk: Columnar layout, similar to Apache Parquet • Updates use MVCC (updates tagged with timestamp, not in-place) – Allow “SELECT AS OF <timestamp>” queries and consistent cross-tablet scans • Near-optimal read path for “current time” scans – No per row branches, fast vectorized decoding and predicate evaluation • Performance worsens based on number of recent updates
14.
© 2016 Dremio
Corporation @DremioHQ Kudu • High throughput for big scans (columnar storage and replication) – Goal: Within 2x of Parquet • Low-latency for short accesses (primary key indexes and quorum replication) – Goal: 1ms read/write on SSD • Database-like semantics (initially single-row ACID) • Relational data model – SQL query – “NoSQL” style scan/insert/update (Java client) Parquet
15.
© 2016 Dremio
Corporation @DremioHQ LSM vs Kudu • LSM – Log Structured Merge (Cassandra, HBase, etc) – Inserts and updates all go to an in-memory map (MemStore) and later flush to on-disk files (HFile/SSTable) – Reads perform an on-the-fly merge of all on-disk HFiles • Kudu – Shares some traits (memstores, compactions) – More complex. – Slower writes in exchange for faster reads (especially scans) 15
16.
© 2016 Dremio
Corporation @DremioHQ Kudu trade-offs: write • Batch inserts are slower than Parquet – Extra bloom filter lookup per insert • Random updates are slower than HBase – HBase model allows random updates without incurring a disk seek – Kudu requires a key lookup before update, bloom lookup before insert 16
17.
© 2016 Dremio
Corporation @DremioHQ Kudu trade-offs: read • Scan speed is close to Parquet and faster than HBase – Columnar on disk like Parquet – Only one DiskRowSet contains updates for a given row. Fewer files lookup than Hbase (but more than Parquet). • Single-row reads may be slower than Hbase (and both are faster than Parquet) – Columnar design is optimized for scans – Future: may introduce “column groups” for applications where single-row access is more important – Especially slow at reading a row that has had many recent updates (e.g YCSB “zipfian”) 17
18.
© 2016 Dremio
Corporation @DremioHQ Kudu is… – NOT a SQL database • “Bring Your Own SQL” – NOT a filesystem • data must have tabular structure – NOT an in-memory database • Very fast for memory-sized workloads, but can operate on larger data too 18
19.
© 2016 Dremio
Corporation @DremioHQ Arrow in memory columnar format
20.
© 2016 Dremio
Corporation @DremioHQ Arrow goals • Well-documented and cross language compatible • Designed to take advantage of modern CPU characteristics • Embeddable in execution engines, storage layers, etc. • Interoperable
21.
© 2016 Dremio
Corporation @DremioHQ Arrow in memory columnar format • Nested Data Structures • Maximize CPU throughput – Pipelining – SIMD – cache locality • Scatter/gather I/O
22.
© 2016 Dremio
Corporation @DremioHQ CPU pipeline
23.
© 2016 Dremio
Corporation @DremioHQ Minimize CPU cache misses a cache miss costs 10 to 100s cycles depending on the level
24.
© 2016 Dremio
Corporation @DremioHQ Focus on CPU Efficiency Traditional Memory Buffer Arrow Memory Buffer • Cache Locality • Super-scalar & vectorized operation • Minimal Structure Overhead • Constant value access – With minimal structure overhead • Operate directly on columnar compressed data
25.
© 2016 Dremio
Corporation @DremioHQ Columnar data persons = [{ name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] }, { name: ’Jack', age: 37, phones: [ ‘555-333-3333’ ] }]
26.
© 2016 Dremio
Corporation @DremioHQ Java: Memory Management • Chunk-based managed allocator – Built on top of Netty’s JEMalloc implementation • Create a tree of allocators – Limit and transfer semantics across allocators – Leak detection and location accounting • Wrap native memory from other applications
27.
© 2016 Dremio
Corporation @DremioHQ Arrow RPC & IPC
28.
© 2016 Dremio
Corporation @DremioHQ Common Message Pattern • Schema Negotiation – Logical Description of structure – Identification of dictionary encoded Nodes • Dictionary Batch – Dictionary ID, Values • Record Batch – Batches of records up to 64K – Leaf nodes up to 2B values Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch 1..N Batches 0..N Batches
29.
© 2016 Dremio
Corporation @DremioHQ Record Batch Construction Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch name (offset) name (data) age (data) phones (list offset) phones (data) data header (describes offsets into data) name (bitmap) age (bitmap) phones (bitmap) phones (offset) { name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] } Each box (vector) is contiguous memory The entire record batch is contiguous on wire
30.
© 2016 Dremio
Corporation @DremioHQ Moving Data Between Systems RPC • Avoid Serialization & Deserialization • Layer TBD: Focused on supporting vectored io – Scatter/gather reads/writes against socket IPC • Alpha implementation using memory mapped files – Moving data between Python and Drill • Working on shared allocation approach – Shared reference counting and well-defined ownership semantics
31.
© 2016 Dremio
Corporation @DremioHQ Shared Need => Open Source Opportunity “We are also considering switching to a columnar canonical in-memory format for data that needs to be materialized during query processing, in order to take advantage of SIMD instructions” -Impala Team “A large fraction of the CPU time is spent waiting for data to be fetched from main memory…we are designing cache-friendly algorithms and data structures so Spark applications will spend less time waiting to fetch data from memory and more time doing useful work” – Spark Team “Drill provides a flexible hierarchical columnar data model that can represent complex, highly dynamic and evolving data models and allows efficient processing of it without need to flatten or materialize.” -Drill Team
32.
© 2016 Dremio
Corporation @DremioHQ Community Driven Standard
33.
© 2016 Dremio
Corporation @DremioHQ An open source standard • Parquet: Common need for on disk columnar. • Arrow: Common need for in memory columnar. • Arrow building on the success of Parquet. • Benefits: – Share the effort – Create an ecosystem • Standard from the start
34.
© 2016 Dremio
Corporation @DremioHQ The Apache Arrow Project • New Top-level Apache Software Foundation project – Announced Feb 17, 2016 • Focused on Columnar In-Memory Analytics 1. 10-100x speedup on many workloads 2. Common data layer enables companies to choose best of breed systems 3. Designed to work with any programming language 4. Support for both relational and complex data as-is • Developers from 13+ major open source projects involved – A significant % of the world’s data will be processed through Arrow! Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
35.
© 2016 Dremio
Corporation @DremioHQ Interoperability and Ecosystem
36.
© 2016 Dremio
Corporation @DremioHQ High Performance Sharing & Interchange Today With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Functionality duplication and unnecessary conversions • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg: Parquet-to-Arrow reader)
37.
© 2016 Dremio
Corporation @DremioHQ Language Bindings Parquet • Target Languages – Java – CPP (underway) – Python & Pandas (underway) • Engines integration: – Faster to list those who don’t support it Arrow • Target Languages – Java (beta) – CPP (underway) – Python & Pandas (underway) – R – Julia • Initial Focus – Read a structure – Write a structure – Manage Memory Kudu • Target Languages – Java – CPP • Engines integration: – MapReduce, – Spark – Impala – Drill
38.
© 2016 Dremio
Corporation @DremioHQ Example data exchanges:
39.
© 2016 Dremio
Corporation @DremioHQ RPC: Query execution The memory representation is sent over the wire. No serialization overhead. Scanner Scanner Scanner Parquet files projection push down read only a and b Partial Agg Partial Agg Partial Agg Agg Agg Agg Shuffle Arrow batches Result
40.
© 2016 Dremio
Corporation @DremioHQ RPC: future arrow based interchange The memory representation is sent over the wire. No serialization overhead. Scanner projection/predicate push down Operator Arrow batches Tablet Mem Disk SQL execution Scanner Operator Scanner Operator Tablet Mem Disk Tablet Mem Disk …
41.
© 2016 Dremio
Corporation @DremioHQ IPC: Python with Spark or Drill SQL engine Python process User defined function SQL Operator 1 SQL Operator 2 reads reads
42.
© 2016 Dremio
Corporation @DremioHQ What’s Next • Parquet – Arrow conversion for Python & C++ • Arrow IPC Implementation • Kudu – Arrow integration • Apache {Spark, Drill} to Arrow Integration – Faster UDFs, Storage interfaces • Support for integration with Intel’s Persistent Memory library via Apache Mnemonic
43.
© 2016 Dremio
Corporation @DremioHQ Get Involved • Join the community – dev@{arrow,parquet,kudu.incubator}.apache.org – Slack: • https://apachearrowslackin.herokuapp.com/ • https://getkudu-slack.herokuapp.com/ – http://{arrow,parquet,kudu}.apache.org – Follow @Apache{Parquet,Arrow,Kudu}
Télécharger maintenant