The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight

The Data Lake Engine
Data Microservices in Spark
using Apache Arrow Flight

>12M monthly downloads & growing
exponentially
Arrow powers dozens of open source &
commercial technologies
Java, C, C++, Python,
R, JavaScript, C#,
Ruby, Rust, Go, …
10+ programming languages supported Arrow’s adoption provides numerous beneﬁts:
• 300+ developers contributing
• Broad architecture (CPU/GPU/FPGA), OS and language support
• Awareness & OSS thought leadership
Arrow has become the industry standard for in-memory data

What is Arrow?
What is it?
●A speciﬁcation that outlines
in-memory binary layout of data
●A set of libraries and tools
●A set of standards to make analytical
data transportable
●Representation for efﬁcient
analytical processing on CPUs and
GPUs
What isn’t it?
●It’s not an installable system
●It’s not a memory grid or in-memory
cache
●It’s not designed for streaming or
other single record operations (e.g.
transactions)

Arrow In Memory Columnar Format
●Shredded Nested Data Structures
●Randomly Accessible
●Maximize CPU throughput
○ Pipelining
○ SIMD
○ cache locality
●Scatter/gather I/O
Traditional
Memory
Arrow
Memory

Example Arrow Building Blocks
Gandiva
● LLVM-based JIT compilation for
execution of arbitrary expressions
against Arrow data structures
Feather
● Fast ephemeral format for
movement of data between
R/Python
Arrow Flight
● RPC/IPC interchange library for
efﬁcient interchange of data
between processes
Parquet
● Read and write Arrow quickly
to/from Parquet. C++ library
builds directly on Arrow.

Arrow Flight
●High performance wire protocol
●Focused on bulk transfer for analytics
●Full delivery of Arrow interoperability promise
●Cross-platform
●Built for parallel
●Designed for Security
FLIGHT

Arrow Data Paradigm: Streams of Batches
● Primary Communication:
○ A stream of Arrow record batches
○ Bulk transfer targeting efﬁcient movement
○ Effectively peer-to-peer
● Speciﬁc Methods:
○ Put Stream: Client sends a stream to server
○ Get Stream: Server sends a stream to client
○ Both initiated by Client
Client Server
Put HeaderDataDataDataend
Thanks
endDataDataDataHeader
Get Descriptor

Endpoint: Retrieved with Ticket
Flight
Host 1
Host 2
Big Datasets Require Parallel Streams
● Parallel consumption and locality awareness
○ A ﬂight is composed of streams
○ Each stream has a FlightEndpoint: A opaque stream
ticket along with a consumption location
○ Systems can take advantage of location information to
improve data locality
● Flights have two reference systems:
○ Dotted path namespace for simple services (e.g.
marketing.yesterday.sales)
○ Arbitrary binary command descriptor: (e.g. “select a,b
from foo where c > 10”)
● Support for Stream Listing
○ ListFlights (Criteria)
○ GetFlightInfo (FlightDescriptor)
Stream
Stream
Stream
Stream

Spark DataSource V2
● Columnar support
● Transactions
● Partitions
● Better support for pushdowns

Flight Spark Source
● Uses Columnar Batch to leverage
Spark’s Arrow support
● Supports pushdown of ﬁlters and
projects
● Partitioned by Arrow ﬂight ticket

Benchmarks
● 4x node EMR querying 4x node
Dremio AWS Edition (m5d.8xlarge)
● Return n rows to spark executors then
perform a non-trivial calculation
● Table shows t1 (t2) where t1 is total
time and t2 is only transport time
● All units are seconds
Data Size JDBC Serial
Flight
Parallel
Flight
Parallel Flight -
8 nodes
100,000 3.84 (1) 1 (1) 2.9 (2.21) 3.78 (3.02)
1,000,000 6.5 (2.8) 1.41 (1) 3.07 (2.76) 4.38 (2.98)
10,000,000 25.88 (22.9) 8.05 (4.3) 6.25 (3.43) 8.19 (4)
100,000,000 223 (220) 109 (105) 18.72 (11) 8.53 (10)
1,000,000,000 n/a n/a 36.6 (16) 18.9 (15)

Thanks!
Let me know your thoughts
○ rymurr@dremio.com
○ https://github.com/rymurr
Join the Arrow Community
○ @apachearrow
○ subscribe-dev@apache.arrow.org
○ arrow.apache.org
Try out Dremio
○ bit.ly/dremiodeploy
○ community.dremio.com
Benchmarks
● Flight: https://bit.ly/32IWvCB
● Spark Connector: https://bit.ly/3bpR0Ni
Code Examples
● Arrow Flight Example Code:
https://bit.ly/2XgjmUE

The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight

Similaire à The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight (20)

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight