SlideShare une entreprise Scribd logo
1  sur  44
Télécharger pour lire hors ligne
The Future Roadmap for the
Composable Data Stack
Wes McKinney
Data Council Austin 2024
⌵
⌵
Agenda
● My background
● Posit and Me
● Composable data systems: an overview
● Active growth areas in 2024
● Opportunities and Predictions
⌵
⌵
Me
Principal
Architect
General Partner
Co-founder,
Advisor
Ibis
Creator Co-creator
Creator Advisor,
Investor
Author
⌵
⌵
Voltron Data (2021 — )
● Unlocking the potential of GPU
accelerated analytics for large
scale workloads
● Enterprise support for Apache
Arrow
● Open source partnerships
(Meta, Snowflake, others)
● $115M raised, 130 headcount
and growing
⌵
⌵
⌵
⌵
Posit PBC
● Founded 2009, originally as RStudio
● 300-person, remote-first company
● Open source software for code-first data science and
technical communication
● Certified B corp, no plans to go public or be acquired
● Designing for long-term resiliency, creating a 100-year
company
⌵
⌵
How Posit Makes Money
● Making open source work in the enterprise
● Products for managing the data science lifecycle
○ Posit Workbench: Managed IDEs and Jupyter
Notebooks with enterprise-grade security & compliance
○ Posit Connect: Publish and share interactive data apps,
notebooks, and reports
○ Posit Package Manager: securely manage deploy open
source and proprietary Python and R packages
⌵
⌵
My History with Posit
● 2016: Hadley Wickham and I make Feather file format
● 2018: I partner with RStudio to create non-profit Ursa Labs
for Arrow development
● 2020: Ursa team spins out into startup, becomes Voltron
Data in 2021
● 2022: RStudio becomes Posit, embracing polyglot future
● 2023: I rejoin Posit as a principal architect
⌵
⌵
https://ursalabs.org/blog/announcing-ursalabs/
The reality is that Hadley and I think the “language wars” are stupid when the
real problem we are solving is human user interface design for data analysis.
… The programming languages are our medium for crafting accessible and
productive tools. It has long been a frustration of mine that it isn’t easier to
share code and systems between R and Python.
…
I found that we share a passion for the long-term vision of empowering data
scientists and building a positive relationship with the open source user community.
Critically, RStudio has avoided the “startup trap” and managed to build a sustainable
business while still investing the vast majority of its engineering resources in open
source development.
⌵
⌵
https://ursalabs.org/blog/announcing-ursalabs/
The reality is that Hadley and I think the “language wars” are stupid when the real
problem we are solving is human user interface design for data analysis. … The
programming languages are our medium for crafting accessible and productive tools.
It has long been a frustration of mine that it isn’t easier to share code and systems
between R and Python.
…
I found that we share a passion for the long-term vision of empowering data
scientists and building a positive relationship with the open source user
community. Critically, RStudio has avoided the “startup trap” and managed to
build a sustainable business while still investing the vast majority of its
engineering resources in open source development.
⌵
⌵
Creative Solutions for Funding OSS Work
Sponsors
⌵
⌵
VLDB 2023
⌵
⌵
What is a “Composable Data System”?
● Builds with open standards and protocols
● Designed around modularity, reuse, and interoperability
with other systems that use shared interfaces
● Resists “vertical integration”, builds in a virtual cycle with
relevant open source ecosystem projects
⌵
⌵
“We envision that by decomposing data management systems
into a more modular stack of reusable components, the
development of new engines can be streamlined, while reducing
maintenance costs and ultimately providing a more consistent
user experience. By clearly outlining APIs and encapsulating
responsibilities, data management software could more easily be
adapted, for example, to leverage novel devices and accelerators, as
the underlying hardware evolves. By relying on a modular stack that
reuses execution engine and language frontend, data systems code
could provide a more consistent experience and semantics to users,
from transactional to analytic systems, from stream processing to
machine learning workloads."
Pedrera et al. 2023
⌵
⌵
“We envision that by decomposing data management systems into a
more modular stack of reusable components, the development of new
engines can be streamlined, while reducing maintenance costs and
ultimately providing a more consistent user experience. By clearly
outlining APIs and encapsulating responsibilities, data
management software could more easily be adapted, for example,
to leverage novel devices and accelerators, as the underlying
hardware evolves. By relying on a modular stack that reuses
execution engine and language frontend, data systems code could
provide a more consistent experience and semantics to users, from
transactional to analytic systems, from stream processing to machine
learning workloads."
Pedrera et al. 2023
⌵
⌵
“We envision that by decomposing data management systems into a
more modular stack of reusable components, the development of new
engines can be streamlined, while reducing maintenance costs and
ultimately providing a more consistent user experience. By clearly
outlining APIs and encapsulating responsibilities, data management
software could more easily be adapted, for example, to leverage novel
devices and accelerators, as the underlying hardware evolves. By
relying on a modular stack that reuses execution engine and
language frontend, data systems code could provide a more
consistent experience and semantics to users, from transactional
to analytic systems, from stream processing to machine learning
workloads."
Pedrera et al. 2023
⌵
⌵
Why now?
● 1st-Gen Big Data / Hadoop Era: 2006 - 2013
○ MapReduce popularizes disaggregated storage + compute
● 2nd-Gen 2013 - 2021
○ Vendors shift from proprietary software to delivery of services
○ Open source standards emerge and are popularized
○ Rapid progress in storage, networking, computing performance
● 3rd-Gen 2021 - … ?
○ Open standards for composability become widely accepted
○ Next-gen components emerge, existing systems start retrofitting with
composable pieces
⌵
⌵
Why now?
“...We foresee that composability is soon to cause
another major disruption to how data management
systems are designed. We foresee that monolithic
systems will become obsolete, and give space to a
new composable era for data management.”
Pedrera et al. 2023
⌵
⌵
NYC R Conference 2015
https://www.youtube.com/watch?v=stlxbC7uIzM&t=264s
⌵
⌵
McSherry, Isard, Murray 2015
⌵
⌵
“The published work on big data systems… detail[s]
their systems’ impressive scalability, [but] few directly
evaluate their absolute performance against
reasonable benchmarks. To what degree are these
systems truly improving performance, as opposed to
parallelizing overheads that they themselves
introduce?”
- McSherry, Isard, Murray 2015
⌵
⌵
Glimpse of the Composable Landscape
Engines
Substrait
ADBC, Flight
Storage
Protocols
Query Interface
Optimization
…?
⌵
⌵
Missing pieces
● Distributed computing (Spark, Dask, Ray, “Serverless”)
● ETL modeling (dbt and others)
● Workflow / Infra Orchestration (Airflow, Dagster, Flyte, …)
● Development Environments
● Metadata Catalogues
⌵
⌵
2016: A Cross-Language Fast
In-Memory Data Format
⌵
⌵
https://www.slideshare.net/wesm/data-science-without-borders-jupytercon-2017
2017
⌵
⌵
2017
⌵
⌵
Databases Equipped for the Composable Era
● ADBC : Arrow DataBase Connectivity
● FlightSQL: An Arrow-native wire protocol for SQL systems
⌵
⌵
⌵
⌵
2018
⌵
⌵
Layers of the Composable Cake
⌵
⌵
Modular Execution Engines
● Apache DataFusion (Rust)
● DuckDB (C++)
● Velox (C++)
● Theseus (C++ / CUDA)
⌵
⌵
Connecting Execution with Frontend + Optimizer
● Intermediate Representation (IR) for Queries
Substrait
https://substrait.io/
⌵
⌵
Modular Acceleration Projects
● Prestissimo (Velox in Presto)
● Apache DataFusion Comet (DataFusion in Spark)
● Apache Gluten (incubating) (Velox in Spark)
⌵
⌵
A Multi-Engine Data Stack?
● Why be locked into one full-stack execution engine (like
Spark)?
● Cost/performance/latency varies greatly across workload
shapes and sizes
● You can do a lot with DuckDB, but actual big data does still
exist
HT to Julien Hurault ! https://juhache.substack.com/p/multi-engine-data-stack-v0
⌵
⌵
⌵
⌵
Barriers to a multi-engine data stack
● SQL dialects are non-portable
and feature a wide spectrum of
supported features
○ sqlglot is helping with this!
● Deciding which engine-to-use is
non-trivial
● Diverse orchestration and
infrastructure requirements
⌵
⌵
CIDR 2021
⌵
⌵
⌵
⌵
Enter the Bird
● An effort to harmonize the best of modern SQL with
Pythonic fluent data frames
● Fixing a long list of shortcomings in the pandas API
● Leverage the benefits of a modern PL when creating
complex analytical SQL queries
● Bringing portability (> 20 backends supported) to provide a
unified Python API for a multi-engine data stack
ibis-project.org
⌵
⌵
https://ibis-project.org/posts/bigquery-arrays/
⌵
⌵
Other Ibis Notes
● Created in 2015 at the same time as Arrow
● Strong focus on deep SQL support
● Facilitates dev to prod with few code changes
● Integration with many third party libraries (streamlit, altair,
others)
● Helps automating the drudgery of tedious multi-step DDL
workflows in databases
⌵
⌵
Why this matters https://voltrondata.com/theseus
⌵
⌵
A Future Multi-Engine Stack
● An execution engine tailored for different data scales
○ <= 1TB: DuckDB and Friends
○ 1 - 10TB: Spark, Dask, Ray, etc.
○ > 10TB: Hardware-accelerated Processing (e.g. Theseus)
● A portable language front end (Ibis, Malloy, PRQL, or a
standard transpilable SQL)
● Arrow-native API and wire transport
Closing Thoughts

Contenu connexe

Tendances

(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and Protobuf(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and ProtobufGuido Schmutz
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David AndersonVerverica
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.Yousef Fadila
 
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jTransforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jDatabricks
 
美团数据平台之Kafka应用实践和优化
美团数据平台之Kafka应用实践和优化美团数据平台之Kafka应用实践和优化
美团数据平台之Kafka应用实践和优化confluent
 
Asynchronous Programming at Netflix
Asynchronous Programming at NetflixAsynchronous Programming at Netflix
Asynchronous Programming at NetflixC4Media
 
Machine Learning with PyCarent + MLflow
Machine Learning with PyCarent + MLflowMachine Learning with PyCarent + MLflow
Machine Learning with PyCarent + MLflowDatabricks
 
Intro to Cypher
Intro to CypherIntro to Cypher
Intro to CypherNeo4j
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchDatabricks
 
Big data processing using Hadoop with Cloudera Quickstart
Big data processing using Hadoop with Cloudera QuickstartBig data processing using Hadoop with Cloudera Quickstart
Big data processing using Hadoop with Cloudera QuickstartIMC Institute
 
Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonTimothy Spann
 
Scaling Data Quality @ Netflix
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ NetflixMichelle Ufford
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks
 

Tendances (20)

(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and Protobuf(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and Protobuf
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.
 
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jTransforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
 
美团数据平台之Kafka应用实践和优化
美团数据平台之Kafka应用实践和优化美团数据平台之Kafka应用实践和优化
美团数据平台之Kafka应用实践和优化
 
Asynchronous Programming at Netflix
Asynchronous Programming at NetflixAsynchronous Programming at Netflix
Asynchronous Programming at Netflix
 
Machine Learning with PyCarent + MLflow
Machine Learning with PyCarent + MLflowMachine Learning with PyCarent + MLflow
Machine Learning with PyCarent + MLflow
 
Intro to Cypher
Intro to CypherIntro to Cypher
Intro to Cypher
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
Big data processing using Hadoop with Cloudera Quickstart
Big data processing using Hadoop with Cloudera QuickstartBig data processing using Hadoop with Cloudera Quickstart
Big data processing using Hadoop with Cloudera Quickstart
 
Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with Python
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Scaling Data Quality @ Netflix
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ Netflix
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
 

Similaire à The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Council Austin 2024

The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data ScienceDataWorks Summit
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
IRJET- A Workflow Management System for Scalable Data Mining on Clouds
IRJET- A Workflow Management System for Scalable Data Mining on CloudsIRJET- A Workflow Management System for Scalable Data Mining on Clouds
IRJET- A Workflow Management System for Scalable Data Mining on CloudsIRJET Journal
 
Enabling Data centric Teams
Enabling Data centric TeamsEnabling Data centric Teams
Enabling Data centric TeamsData Con LA
 
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...phdAssistance1
 
Coding software and tools used for data science management - Phdassistance
Coding software and tools used for data science management - PhdassistanceCoding software and tools used for data science management - Phdassistance
Coding software and tools used for data science management - PhdassistancephdAssistance1
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4Ferdin Joe John Joseph PhD
 
Making advanced analytics accessible to more companies
Making advanced analytics accessible to more companiesMaking advanced analytics accessible to more companies
Making advanced analytics accessible to more companiesMárton Kodok
 
MSPInspire - Azure ML and Power BI
MSPInspire - Azure ML and Power BIMSPInspire - Azure ML and Power BI
MSPInspire - Azure ML and Power BIOrlando Mariano
 
Consulting Profile_Victor_Torres_2016-VVCS
Consulting Profile_Victor_Torres_2016-VVCSConsulting Profile_Victor_Torres_2016-VVCS
Consulting Profile_Victor_Torres_2016-VVCSVictor M Torres
 
SKOS as the focal point of linked data strategies
SKOS as the focal point of linked data strategiesSKOS as the focal point of linked data strategies
SKOS as the focal point of linked data strategiesSemantic Web Company
 
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Mobcoder
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014KDZ - Zentrum für Verwaltungsforschung
 
5 years of Dataverse evolution
5 years of Dataverse evolution 5 years of Dataverse evolution
5 years of Dataverse evolution vty
 
DDDP 2019 - Brown to Green
DDDP 2019  - Brown to GreenDDDP 2019  - Brown to Green
DDDP 2019 - Brown to GreenJohn Archer
 
IRJET- Cost Effective Workflow Scheduling in Bigdata
IRJET-  	  Cost Effective Workflow Scheduling in BigdataIRJET-  	  Cost Effective Workflow Scheduling in Bigdata
IRJET- Cost Effective Workflow Scheduling in BigdataIRJET Journal
 
It Consulting & Services - Black Basil Technologies
It Consulting & Services  - Black Basil TechnologiesIt Consulting & Services  - Black Basil Technologies
It Consulting & Services - Black Basil TechnologiesBlack Basil Technologies
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceFerdin Joe John Joseph PhD
 

Similaire à The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Council Austin 2024 (20)

The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
IRJET- A Workflow Management System for Scalable Data Mining on Clouds
IRJET- A Workflow Management System for Scalable Data Mining on CloudsIRJET- A Workflow Management System for Scalable Data Mining on Clouds
IRJET- A Workflow Management System for Scalable Data Mining on Clouds
 
Enabling Data centric Teams
Enabling Data centric TeamsEnabling Data centric Teams
Enabling Data centric Teams
 
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
 
Coding software and tools used for data science management - Phdassistance
Coding software and tools used for data science management - PhdassistanceCoding software and tools used for data science management - Phdassistance
Coding software and tools used for data science management - Phdassistance
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4
 
Making advanced analytics accessible to more companies
Making advanced analytics accessible to more companiesMaking advanced analytics accessible to more companies
Making advanced analytics accessible to more companies
 
MSPInspire - Azure ML and Power BI
MSPInspire - Azure ML and Power BIMSPInspire - Azure ML and Power BI
MSPInspire - Azure ML and Power BI
 
Consulting Profile_Victor_Torres_2016-VVCS
Consulting Profile_Victor_Torres_2016-VVCSConsulting Profile_Victor_Torres_2016-VVCS
Consulting Profile_Victor_Torres_2016-VVCS
 
SKOS as the focal point of linked data strategies
SKOS as the focal point of linked data strategiesSKOS as the focal point of linked data strategies
SKOS as the focal point of linked data strategies
 
Abhishek jaiswal
Abhishek jaiswalAbhishek jaiswal
Abhishek jaiswal
 
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
 
5 years of Dataverse evolution
5 years of Dataverse evolution 5 years of Dataverse evolution
5 years of Dataverse evolution
 
DDDP 2019 - Brown to Green
DDDP 2019  - Brown to GreenDDDP 2019  - Brown to Green
DDDP 2019 - Brown to Green
 
IRJET- Cost Effective Workflow Scheduling in Bigdata
IRJET-  	  Cost Effective Workflow Scheduling in BigdataIRJET-  	  Cost Effective Workflow Scheduling in Bigdata
IRJET- Cost Effective Workflow Scheduling in Bigdata
 
It Consulting & Services - Black Basil Technologies
It Consulting & Services  - Black Basil TechnologiesIt Consulting & Services  - Black Basil Technologies
It Consulting & Services - Black Basil Technologies
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 

Plus de Wes McKinney

Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
 

Plus de Wes McKinney (20)

Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 

Dernier

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Dernier (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Council Austin 2024

  • 1. The Future Roadmap for the Composable Data Stack Wes McKinney Data Council Austin 2024
  • 2. ⌵ ⌵ Agenda ● My background ● Posit and Me ● Composable data systems: an overview ● Active growth areas in 2024 ● Opportunities and Predictions
  • 4. ⌵ ⌵ Voltron Data (2021 — ) ● Unlocking the potential of GPU accelerated analytics for large scale workloads ● Enterprise support for Apache Arrow ● Open source partnerships (Meta, Snowflake, others) ● $115M raised, 130 headcount and growing
  • 6. ⌵ ⌵ Posit PBC ● Founded 2009, originally as RStudio ● 300-person, remote-first company ● Open source software for code-first data science and technical communication ● Certified B corp, no plans to go public or be acquired ● Designing for long-term resiliency, creating a 100-year company
  • 7. ⌵ ⌵ How Posit Makes Money ● Making open source work in the enterprise ● Products for managing the data science lifecycle ○ Posit Workbench: Managed IDEs and Jupyter Notebooks with enterprise-grade security & compliance ○ Posit Connect: Publish and share interactive data apps, notebooks, and reports ○ Posit Package Manager: securely manage deploy open source and proprietary Python and R packages
  • 8. ⌵ ⌵ My History with Posit ● 2016: Hadley Wickham and I make Feather file format ● 2018: I partner with RStudio to create non-profit Ursa Labs for Arrow development ● 2020: Ursa team spins out into startup, becomes Voltron Data in 2021 ● 2022: RStudio becomes Posit, embracing polyglot future ● 2023: I rejoin Posit as a principal architect
  • 9. ⌵ ⌵ https://ursalabs.org/blog/announcing-ursalabs/ The reality is that Hadley and I think the “language wars” are stupid when the real problem we are solving is human user interface design for data analysis. … The programming languages are our medium for crafting accessible and productive tools. It has long been a frustration of mine that it isn’t easier to share code and systems between R and Python. … I found that we share a passion for the long-term vision of empowering data scientists and building a positive relationship with the open source user community. Critically, RStudio has avoided the “startup trap” and managed to build a sustainable business while still investing the vast majority of its engineering resources in open source development.
  • 10. ⌵ ⌵ https://ursalabs.org/blog/announcing-ursalabs/ The reality is that Hadley and I think the “language wars” are stupid when the real problem we are solving is human user interface design for data analysis. … The programming languages are our medium for crafting accessible and productive tools. It has long been a frustration of mine that it isn’t easier to share code and systems between R and Python. … I found that we share a passion for the long-term vision of empowering data scientists and building a positive relationship with the open source user community. Critically, RStudio has avoided the “startup trap” and managed to build a sustainable business while still investing the vast majority of its engineering resources in open source development.
  • 11. ⌵ ⌵ Creative Solutions for Funding OSS Work Sponsors
  • 13. ⌵ ⌵ What is a “Composable Data System”? ● Builds with open standards and protocols ● Designed around modularity, reuse, and interoperability with other systems that use shared interfaces ● Resists “vertical integration”, builds in a virtual cycle with relevant open source ecosystem projects
  • 14. ⌵ ⌵ “We envision that by decomposing data management systems into a more modular stack of reusable components, the development of new engines can be streamlined, while reducing maintenance costs and ultimately providing a more consistent user experience. By clearly outlining APIs and encapsulating responsibilities, data management software could more easily be adapted, for example, to leverage novel devices and accelerators, as the underlying hardware evolves. By relying on a modular stack that reuses execution engine and language frontend, data systems code could provide a more consistent experience and semantics to users, from transactional to analytic systems, from stream processing to machine learning workloads." Pedrera et al. 2023
  • 15. ⌵ ⌵ “We envision that by decomposing data management systems into a more modular stack of reusable components, the development of new engines can be streamlined, while reducing maintenance costs and ultimately providing a more consistent user experience. By clearly outlining APIs and encapsulating responsibilities, data management software could more easily be adapted, for example, to leverage novel devices and accelerators, as the underlying hardware evolves. By relying on a modular stack that reuses execution engine and language frontend, data systems code could provide a more consistent experience and semantics to users, from transactional to analytic systems, from stream processing to machine learning workloads." Pedrera et al. 2023
  • 16. ⌵ ⌵ “We envision that by decomposing data management systems into a more modular stack of reusable components, the development of new engines can be streamlined, while reducing maintenance costs and ultimately providing a more consistent user experience. By clearly outlining APIs and encapsulating responsibilities, data management software could more easily be adapted, for example, to leverage novel devices and accelerators, as the underlying hardware evolves. By relying on a modular stack that reuses execution engine and language frontend, data systems code could provide a more consistent experience and semantics to users, from transactional to analytic systems, from stream processing to machine learning workloads." Pedrera et al. 2023
  • 17. ⌵ ⌵ Why now? ● 1st-Gen Big Data / Hadoop Era: 2006 - 2013 ○ MapReduce popularizes disaggregated storage + compute ● 2nd-Gen 2013 - 2021 ○ Vendors shift from proprietary software to delivery of services ○ Open source standards emerge and are popularized ○ Rapid progress in storage, networking, computing performance ● 3rd-Gen 2021 - … ? ○ Open standards for composability become widely accepted ○ Next-gen components emerge, existing systems start retrofitting with composable pieces
  • 18. ⌵ ⌵ Why now? “...We foresee that composability is soon to cause another major disruption to how data management systems are designed. We foresee that monolithic systems will become obsolete, and give space to a new composable era for data management.” Pedrera et al. 2023
  • 19. ⌵ ⌵ NYC R Conference 2015 https://www.youtube.com/watch?v=stlxbC7uIzM&t=264s
  • 21. ⌵ ⌵ “The published work on big data systems… detail[s] their systems’ impressive scalability, [but] few directly evaluate their absolute performance against reasonable benchmarks. To what degree are these systems truly improving performance, as opposed to parallelizing overheads that they themselves introduce?” - McSherry, Isard, Murray 2015
  • 22. ⌵ ⌵ Glimpse of the Composable Landscape Engines Substrait ADBC, Flight Storage Protocols Query Interface Optimization …?
  • 23. ⌵ ⌵ Missing pieces ● Distributed computing (Spark, Dask, Ray, “Serverless”) ● ETL modeling (dbt and others) ● Workflow / Infra Orchestration (Airflow, Dagster, Flyte, …) ● Development Environments ● Metadata Catalogues
  • 24. ⌵ ⌵ 2016: A Cross-Language Fast In-Memory Data Format
  • 27. ⌵ ⌵ Databases Equipped for the Composable Era ● ADBC : Arrow DataBase Connectivity ● FlightSQL: An Arrow-native wire protocol for SQL systems
  • 30. ⌵ ⌵ Layers of the Composable Cake
  • 31. ⌵ ⌵ Modular Execution Engines ● Apache DataFusion (Rust) ● DuckDB (C++) ● Velox (C++) ● Theseus (C++ / CUDA)
  • 32. ⌵ ⌵ Connecting Execution with Frontend + Optimizer ● Intermediate Representation (IR) for Queries Substrait https://substrait.io/
  • 33. ⌵ ⌵ Modular Acceleration Projects ● Prestissimo (Velox in Presto) ● Apache DataFusion Comet (DataFusion in Spark) ● Apache Gluten (incubating) (Velox in Spark)
  • 34. ⌵ ⌵ A Multi-Engine Data Stack? ● Why be locked into one full-stack execution engine (like Spark)? ● Cost/performance/latency varies greatly across workload shapes and sizes ● You can do a lot with DuckDB, but actual big data does still exist HT to Julien Hurault ! https://juhache.substack.com/p/multi-engine-data-stack-v0
  • 36. ⌵ ⌵ Barriers to a multi-engine data stack ● SQL dialects are non-portable and feature a wide spectrum of supported features ○ sqlglot is helping with this! ● Deciding which engine-to-use is non-trivial ● Diverse orchestration and infrastructure requirements
  • 39. ⌵ ⌵ Enter the Bird ● An effort to harmonize the best of modern SQL with Pythonic fluent data frames ● Fixing a long list of shortcomings in the pandas API ● Leverage the benefits of a modern PL when creating complex analytical SQL queries ● Bringing portability (> 20 backends supported) to provide a unified Python API for a multi-engine data stack ibis-project.org
  • 41. ⌵ ⌵ Other Ibis Notes ● Created in 2015 at the same time as Arrow ● Strong focus on deep SQL support ● Facilitates dev to prod with few code changes ● Integration with many third party libraries (streamlit, altair, others) ● Helps automating the drudgery of tedious multi-step DDL workflows in databases
  • 42. ⌵ ⌵ Why this matters https://voltrondata.com/theseus
  • 43. ⌵ ⌵ A Future Multi-Engine Stack ● An execution engine tailored for different data scales ○ <= 1TB: DuckDB and Friends ○ 1 - 10TB: Spark, Dask, Ray, etc. ○ > 10TB: Hardware-accelerated Processing (e.g. Theseus) ● A portable language front end (Ibis, Malloy, PRQL, or a standard transpilable SQL) ● Arrow-native API and wire transport