Machine learning pipelines are a hot topic at the moment. Moving data through the pipeline in an efficient and predictable way is one of the most important aspects of running machine learning models in production.
3. >12M monthly downloads & growing
exponentially
Arrow powers dozens of open source &
commercial technologies
Java, C, C++, Python,
R, JavaScript, C#,
Ruby, Rust, Go, …
10+ programming languages supported Arrow’s adoption provides numerous benefits:
• 300+ developers contributing
• Broad architecture (CPU/GPU/FPGA), OS and language support
• Awareness & OSS thought leadership
Arrow has become the industry standard for in-memory data
4. What is Arrow?
What is it?
●A specification that outlines
in-memory binary layout of data
●A set of libraries and tools
●A set of standards to make analytical
data transportable
●Representation for efficient
analytical processing on CPUs and
GPUs
What isn’t it?
●It’s not an installable system
●It’s not a memory grid or in-memory
cache
●It’s not designed for streaming or
other single record operations (e.g.
transactions)
5. Arrow In Memory Columnar Format
●Shredded Nested Data Structures
●Randomly Accessible
●Maximize CPU throughput
○ Pipelining
○ SIMD
○ cache locality
●Scatter/gather I/O
Traditional
Memory
Arrow
Memory
6. Example Arrow Building Blocks
Gandiva
● LLVM-based JIT compilation for
execution of arbitrary expressions
against Arrow data structures
Feather
● Fast ephemeral format for
movement of data between
R/Python
Arrow Flight
● RPC/IPC interchange library for
efficient interchange of data
between processes
Parquet
● Read and write Arrow quickly
to/from Parquet. C++ library
builds directly on Arrow.
8. Arrow Flight
●High performance wire protocol
●Focused on bulk transfer for analytics
●Full delivery of Arrow interoperability promise
●Cross-platform
●Built for parallel
●Designed for Security
FLIGHT
9. Arrow Data Paradigm: Streams of Batches
● Primary Communication:
○ A stream of Arrow record batches
○ Bulk transfer targeting efficient movement
○ Effectively peer-to-peer
● Specific Methods:
○ Put Stream: Client sends a stream to server
○ Get Stream: Server sends a stream to client
○ Both initiated by Client
Client Server
Put HeaderDataDataDataend
Thanks
endDataDataDataHeader
Get Descriptor
10. Endpoint: Retrieved with Ticket
Flight
Host 1
Host 2
Big Datasets Require Parallel Streams
● Parallel consumption and locality awareness
○ A flight is composed of streams
○ Each stream has a FlightEndpoint: A opaque stream
ticket along with a consumption location
○ Systems can take advantage of location information to
improve data locality
● Flights have two reference systems:
○ Dotted path namespace for simple services (e.g.
marketing.yesterday.sales)
○ Arbitrary binary command descriptor: (e.g. “select a,b
from foo where c > 10”)
● Support for Stream Listing
○ ListFlights (Criteria)
○ GetFlightInfo (FlightDescriptor)
Stream
Stream
Stream
Stream