Scaling and Unifying SciKit Learn and Apache Spark Pipelines

•

0 j'aime•669 vues

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Données & analyses

Scaling and Unifying
Scikit Learn and Spark
Pipelines using Ray
Raghu Ganti
Principal Research Staff Member
IBM T J Watson Research Center
Team (IBM & Red Hat):
Michael Behrendt, Linsong Chu, Carlos
Costa, Erik Erlandson, Mudhakar Srivatsa

Ray.IO
§ Can we do pipelines on
Ray?
§ Can we scale popular
AI/ML pipelines on Ray?
§ Can we unify scikit learn
and Spark pipelines?

Current pipeline API
• Focus on scikit learn and Spark pipelines
• Scikit learn missing scaling; Spark focus on data parallel
scaling
Transform
Fit
X
X
y
X’
Fitted model

Scaling Pipelines: I/O as List of Objects
Transform
Fit
[X1, X2, … XN]
[X1, X2, … XN]
[y1, y2, … yN]
[X1’, X2’, …, XN’]
[FM1, FM2, … FMN]

Scaling Pipelines: AND/OR Graphs
And node
X1
X2
XN
X1’
X2’
XM’
Or node
X
Step1
Step2
StepN
X’
X’
X’

Key Features
▪ Python function as
unit of compute
▪ Intuitive for data
scientist
▪ Follows transformer
APIs
▪ MPI-style scaling
▪ Object references
as I/O for unit of
compute
▪ Sharing of objects
using Plasma store
▪ Enables zero-copy
object sharing
• List of objects as I/O
• Function as unit of
compute
▪ Scikit learn typically
in Python
▪ Ray.IO with RayDP
enables efficient
data exchange
• Cross environment
▪ Enriched DAGs from
plain pipelines
▪ OR nodes for fan-
out expressions
▪ AND nodes for
arbitrary lambdas
• AND/OR Graphs

Illustrative Example
8
Preprocess
Random
Forest
Gradient
Boost
Decision
Tree
Sample Pipeline
Scikit learn Pipeline
Our Pipeline

Pipelines Galore…
Airflow Kubeflow Scikit learn
Spark
Pipeline
Our
pipeline
Task
parallelism
✓ ✓ ✗ ✓ ✓
Data
parallelism
✗ ✗ ✗ ✓ ✓
And/Or Graphs ✓ ✓ ✗ ✗ ✓
Computational
unit
Container Container
Python
function
Python/Java
function
Python/Java
function
Mutability of
DAG
✗ ✗ ✓ ✓ ✓

What to expect?
• Execution strategies based on graph traversals
• Early stopping criteria
• Mutability of execution pipelines
• Current status: Proposal discussion with Ray and OSS
community

Q&A
Contacts:
Raghu Ganti (rganti@us.ibm.com)
Michael Behrendt (michaelbehrendt@de.ibm.com)
Linsong Chu (lchu@us.ibm.com)
Carlos Costa (chcost@us.ibm.com)
Erik Erlandson (eerlands@redhat.com)
Mudhakar Srivatsa (msrivats@us.ibm.com)

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Recommandé

Apache Spark on K8S Best Practice and Performance in the CloudDatabricks

Deep Dive: Memory Management in Apache SparkDatabricks

Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov

Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks

Apache Spark Core – Practical OptimizationDatabricks

Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks

Understanding Query Plans and Spark UIsDatabricks

Introduction to PySparkRussell Jurney

Recommandé

Apache Spark on K8S Best Practice and Performance in the CloudDatabricks

Deep Dive: Memory Management in Apache SparkDatabricks

Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov

Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks

Apache Spark Core – Practical OptimizationDatabricks

Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks

Understanding Query Plans and Spark UIsDatabricks

Introduction to PySparkRussell Jurney

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Getting Started with Apache Spark on KubernetesDatabricks

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...Spark Summit

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!

Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks

Apache Spark FundamentalsZahra Eskandari

Apache Spark CoreGirish Khanzode

Spark architectureGauravBiswas9

Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Databricks

How We Optimize Spark SQL Jobs With parallel and sync IODatabricks

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks

Apache Spark Introductionsudhakara st

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

PySpark in practice slidesDat Tran

Programming in Spark using PySpark Mostafa

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Big Data Architectural PatternsAmazon Web Services

Data Streaming Ecosystem Management at Booking.com confluent

Python business intelligence (PyData 2012 talk)Stefan Urbanek

Balancing Infrastructure with Optimization and Problem FormulationAlex D. Gaudio

Contenu connexe

Tendances

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Getting Started with Apache Spark on KubernetesDatabricks

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...Spark Summit

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!

Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks

Apache Spark FundamentalsZahra Eskandari

Apache Spark CoreGirish Khanzode

Spark architectureGauravBiswas9

Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Databricks

How We Optimize Spark SQL Jobs With parallel and sync IODatabricks

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks

Apache Spark Introductionsudhakara st

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

PySpark in practice slidesDat Tran

Programming in Spark using PySpark Mostafa

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Big Data Architectural PatternsAmazon Web Services

Data Streaming Ecosystem Management at Booking.com confluent

Tendances (20)

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Getting Started with Apache Spark on Kubernetes

Apache Spark Core—Deep Dive—Proper Optimization

Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...

Fine Tuning and Enhancing Performance of Apache Spark Jobs

Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...

Optimizing Delta/Parquet Data Lakes for Apache Spark

Apache Spark Fundamentals

Apache Spark Core

Spark architecture

Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...

How We Optimize Spark SQL Jobs With parallel and sync IO

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...

Apache Spark Introduction

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...

PySpark in practice slides

Programming in Spark using PySpark

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

Big Data Architectural Patterns

Data Streaming Ecosystem Management at Booking.com

Similaire à Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Python business intelligence (PyData 2012 talk)Stefan Urbanek

Balancing Infrastructure with Optimization and Problem FormulationAlex D. Gaudio

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu

Scilab Challenge@NTU 2014/2015 Project BriefingTBSS Group

Graph Analytics in SparkPaco Nathan

GraphX: Graph analytics for insights about developer communitiesPaco Nathan

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)

YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks

IBM Strategy for SparkMark Kerzner

MathWorks Interview LectureJohn Yates

Dev Ops TrainingSpark Summit

Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...Facultad de Informática UCM

Introduction to elasticsearchhypto

An R primer for SQL folksThomas Hütter

What’s New in the Berkeley Data Analytics StackTuri, Inc.

Practicing at the Cutting EdgeC4Media

Big data distributed processing: Spark introductionHektor Jacynycz García

Data Science with SparkKrishna Sankar

Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney

Spark meetup TCHUGRyan Bosshart

Similaire à Scaling and Unifying SciKit Learn and Apache Spark Pipelines (20)

Python business intelligence (PyData 2012 talk)

Balancing Infrastructure with Optimization and Problem Formulation

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Scilab Challenge@NTU 2014/2015 Project Briefing

Graph Analytics in Spark

GraphX: Graph analytics for insights about developer communities

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...

YARN webinar series: Using Scalding to write applications to Hadoop and YARN

IBM Strategy for Spark

MathWorks Interview Lecture

Dev Ops Training

Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...

Introduction to elasticsearch

An R primer for SQL folks

What’s New in the Berkeley Data Analytics Stack

Practicing at the Cutting Edge

Big data distributed processing: Spark introduction

Data Science with Spark

Apache Arrow (Strata-Hadoop World San Jose 2016)

Spark meetup TCHUG

Plus de Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Machine Learning CI/CD for Email Attack DetectionDatabricks

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Machine Learning CI/CD for Email Attack Detection

Dernier

detection and classification of knee osteoarthritis.pptxAleenaJamil4

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

How we prevented account sharing with MFAAndrei Kaleshka

LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann

Real-Time AI Streaming - AI Max PrincetonTimothy Spann

Multiple time frame trading analysis -brianshannon.pdfchwongval

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss

Dernier (20)

detection and classification of knee osteoarthritis.pptx

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理

Top 5 Best Data Analytics Courses In Queens

How we prevented account sharing with MFA

LLMs, LMMs, their Improvement Suggestions and the Path towards AGI

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024

Real-Time AI Streaming - AI Max Princeton

Multiple time frame trading analysis -brianshannon.pdf

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...

GA4 Without Cookies [Measure Camp AMS]

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

1. Scaling and Unifying Scikit Learn and Spark Pipelines using Ray Raghu Ganti Principal Research Staff Member IBM T J Watson Research Center Team (IBM & Red Hat): Michael Behrendt, Linsong Chu, Carlos Costa, Erik Erlandson, Mudhakar Srivatsa

2. So many pipelines… And many more…

3. Ray.IO § Can we do pipelines on Ray? § Can we scale popular AI/ML pipelines on Ray? § Can we unify scikit learn and Spark pipelines?

4. Current pipeline API • Focus on scikit learn and Spark pipelines • Scikit learn missing scaling; Spark focus on data parallel scaling Transform Fit X X y X’ Fitted model

5. Scaling Pipelines: I/O as List of Objects Transform Fit [X1, X2, … XN] [X1, X2, … XN] [y1, y2, … yN] [X1’, X2’, …, XN’] [FM1, FM2, … FMN]

6. Scaling Pipelines: AND/OR Graphs And node X1 X2 XN X1’ X2’ XM’ Or node X Step1 Step2 StepN X’ X’ X’

7. Key Features ▪ Python function as unit of compute ▪ Intuitive for data scientist ▪ Follows transformer APIs ▪ MPI-style scaling ▪ Object references as I/O for unit of compute ▪ Sharing of objects using Plasma store ▪ Enables zero-copy object sharing • List of objects as I/O • Function as unit of compute ▪ Scikit learn typically in Python ▪ Ray.IO with RayDP enables efficient data exchange • Cross environment ▪ Enriched DAGs from plain pipelines ▪ OR nodes for fan- out expressions ▪ AND nodes for arbitrary lambdas • AND/OR Graphs

8. Illustrative Example 8 Preprocess Random Forest Gradient Boost Decision Tree Sample Pipeline Scikit learn Pipeline Our Pipeline

9. Pipelines Galore… Airflow Kubeflow Scikit learn Spark Pipeline Our pipeline Task parallelism ✓ ✓ ✗ ✓ ✓ Data parallelism ✗ ✗ ✗ ✓ ✓ And/Or Graphs ✓ ✓ ✗ ✗ ✓ Computational unit Container Container Python function Python/Java function Python/Java function Mutability of DAG ✗ ✗ ✓ ✓ ✓

10. What to expect? • Execution strategies based on graph traversals • Early stopping criteria • Mutability of execution pipelines • Current status: Proposal discussion with Ray and OSS community

11. Q&A Contacts: Raghu Ganti (rganti@us.ibm.com) Michael Behrendt (michaelbehrendt@de.ibm.com) Linsong Chu (lchu@us.ibm.com) Carlos Costa (chcost@us.ibm.com) Erik Erlandson (eerlands@redhat.com) Mudhakar Srivatsa (msrivats@us.ibm.com)

12. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.