SlideShare une entreprise Scribd logo
1  sur  37
Télécharger pour lire hors ligne
Scaling Data Pipelines with
Apache Spark on Kubernetes
on Google Cloud
Rajesh Thallam
Machine Learning Specialist
Google
Sougata Biswas
Data Analytics Specialist
Google
May 2021
Outline
Spark on Kubernetes on Google Cloud
Why Spark on Kubernetes?
1
2
4 Use Cases / Implementation Patterns
Things to Know
3
5 Wrap up
Why Spark on Kubernetes?
Utilize existing
Kubernetes infrastructure
to run data engineering or
ML workload along with
other applications without
maintaining separate big
data infrastructure
Containerization of spark
applications gives ability
to run the spark
application on-prem and
on cloud
Packaging job
dependencies in
containers provides a
great way to isolate
workloads. Allowing
teams to scale
independently
Scaling containers are
much faster than VMs
(Virtual Machines)
Why Spark on Kubernetes?
Unique benefits orchestrating Spark Jobs on Kubernetes compared to other cluster managers -
YARN and Mesos
Optimize
Costs
Portability
Isolation
Faster Scaling
Proprietary + Confidential
Comparing Cluster Managers
Apache Hadoop YARN vs Kubernetes for Apache Spark
Apache Hadoop YARN
● First cluster manager since
inception of Apache Spark
● Battle tested
● General purpose scheduler for
big data applications
● Runs on cluster of VMs or
physical machines (e.g. on-prem
Hadoop clusters)
● Option to run: spark-submit to
YARN
Kubernetes (k8s)
● Resource manager starting Spark
2.3 as experimental and GA with
Spark 3.1.1
● Not in feature parity with YARN
● General purpose scheduler for
any containerized apps
● Runs as a container on k8s
cluster. Faster scaling in and out.
● Option to run: spark-submit,
spark k8s operator
Spark on Kubernetes on Google Cloud
Secure
Enterprise security
Encryption
Access control
Cost Effective
Only pay for what you
use
Managed Jobs
Spark on GKE
Workflow Templates
Airflow Operators
Managed Clusters
90s cluster spin-up
Autoscaling
Autozone placement
Cloud Dataproc
Combining the best of open source and cloud and simplifying Hadoop & Spark workloads
on Cloud
Built-in support for Hadoop & Spark
Managed hardware and configuration
Simplified version management
Flexible job configuration
Features of Dataproc
● Manage applications, not machines
○ Manages container clusters
○ Inspired and informed by Google’s experiences
○ Supports multiple cloud and bare-metal
environments
○ Supports multiple container runtimes
● Features similar to an OS for a host
○ Scheduling workload
○ Finding the right host to fit your workload
○ Monitoring health of the workload
○ Scaling it up and down as needed
○ Moving it around as needed
Kubernetes
OS for your compute fleet
Google Kubernetes Engine (GKE)
Secured and fully managed Kubernetes service
GKE, Kubernetes-as-a-service
Control
Plane
Nodes
kubectl
gcloud
● Turn-key solution to Kubernetes
○ Provision a cluster in minutes
○ Industry-leading automation
○ Scales to an industry-leading 15k worker nodes
○ Reliable and available
○ Deep GCP integration
● Generally Available since August, 2015
○ 99.5% or 99.95% SLA on Kubernetes APIs
○ $0.10 per cluster/hour + infrastructure cost
○ Supports GCE sole-tenant nodes and
reservations
Dataproc on GKE BETA
Run Spark jobs on GKE clusters with Dataproc Jobs API
● Simple way of executing Spark jobs on GKE clusters
● Single API to run Spark job on Dataproc as well as
GKE
● Extensible with custom Docker image for Spark job
● Enterprise security control out-of-box
● Ease of logging and monitoring with cloud Logging
and Monitoring
Create Cluster
Dataproc
GKE
Submit Job
Allocate resources Run Spark Job
Node
Dataproc
Agent
Spark Submit
using
Dataproc API
Kubernetes
Master
API Server
Scheduler
..
Job Scheduling
& Monitoring
Driver Pod
(Node 1)
Executor Pod
(Node 1)
Executor Pod
(Node 2)
Executor Pod
(Node n)
Google Kubernetes Engine (GKE)
Dataproc on GKE - How it works?
Submit Spark jobs to a running GKE cluster from the Dataproc Jobs API
● Dataproc agent runs as container inside GKE
communicating with GKE scheduler using
spark-kubernetes operator
● User submit jobs using Dataproc Jobs API while
job execution happens inside GKE cluster
● Spark driver and executor run on different Pods
inside separate namespaces within GKE cluster
● Driver and executor logs are sent to Google
Cloud Logging service
How is Dataproc on GKE different from alternatives?
Comparing against Spark Submit and Spark Operator for Kubernetes
Create Cluster
Dataproc
GKE
Submit Job
Allocate resources Run Spark Job
● Easy to get started with familiar Dataproc API
● Easy to setup and manage. No need to install
Spark Kubernetes operator and set up monitoring
or logging separately.
● Built-in security features with Dataproc API -
access control, auditing, encryption and more.
● Inherent benefits of managed services - Dataproc
and GKE
Demo
Spark on GKE using Dataproc Jobs API
Step 1: Setup a GKE Cluster
# setup environment variables
GCE_REGION=us-west2 #GCP region
GCE_ZONE=us-west2-a #GCP zone
GKE_CLUSTER=spark-on-gke #GKE Cluster name
DATAPROC_CLUSTER=dataproc-gke #Dataproc Cluster name
VERSION=1.4.27-beta #Dataproc image version
BUCKET=my-project-spark-on-k8s #GCS bucket
# create GKE cluster with auto-scaling enabled
gcloud container clusters create "${GKE_CLUSTER}" 
--scopes=cloud-platform 
--workload-metadata=GCE_METADATA 
--machine-type=n1-standard-4 
--zone="${GCE_ZONE}" 
--enable-autoscaling --min-nodes 1 --max-nodes 10
# add Kubernetes Engine Admin role to service-
projectid@dataproc-accounts.iam.gserviceaccount.com
Step 2: Create and Register Dataproc to GKE
# create dataproc cluster and register with GKE with
K8s namespace
gcloud dataproc clusters create "${DATAPROC_CLUSTER}"

--gke-cluster="${GKE_CLUSTER}" 
--region="${GCE_REGION}" 
--zone="${GCE_ZONE}" 
--image-version="${VERSION}" 
--bucket="${BUCKET}" 
--gke-cluster-namespace="spark-on-gke"
Step 3: Spark Job Execution
# Running a sample pyspark job using Dataproc API
# to read a table in Bigquery and generate word counts
gcloud dataproc jobs submit pyspark bq-word-count.py 
--cluster=${DATAPROC_CLUSTER} 
--region=${GCE_REGION} 
--
properties="spark.dynamicAllocation.enabled=false,spar
k.executor.instances=5,spark.executors.core=4" 
--jars gs://spark-lib/bigquery/spark-bigquery-
latest_2.11.jar
Step 4a: Monitoring - GKE & Cloud Logging
# Spark Driver Logs from Google Cloud Logging
resource.type="k8s_container"
resource.labels.cluster_name="spark-on-gke"
resource.labels.namespace_name="spark-on-gke"
resource.labels.container_name="spark-kubernetes-
driver"
# Spark Executor Logs from Google Cloud Logging
resource.type="k8s_container"
resource.labels.cluster_name="spark-on-gke"
resource.labels.namespace_name="spark-on-gke"
resource.labels.container_name="executor"
# TCP port forwarding to driver pod to view Spark UI
gcloud container clusters get-credentials
"${GKE_CLUSTER}" 
--zone "${GCE_ZONE}" 
--project "${PROJECT_ID}" && 
kubectl port-forward 
--namespace "${GKE_NAMESPACE}" 
$(kubectl get pod --namespace ${GKE_NAMESPACE} 
--selector="spark-
role=driver,sparkoperator.k8s.io/app-name=dataproc-
app_name" 
--output jsonpath='{.items[0].metadata.name}')
8080:4040
Step 4b: Monitoring with Spark Web UI
Dataproc with Apache Spark on GKE
Things to Know
Autoscaling Spark Jobs
Automatically resize node pools of GKE cluster based on the workload demands
# create GKE cluster with autoscaling enabled
gcloud container clusters create "${GKE_CLUSTER}" 
--scopes=cloud-platform 
--workload-metadata=GCE_METADATA 
--machine-type n1-standard-2 
--zone="${GCE_ZONE}" 
--num-nodes 2 
--enable-autoscaling --min-nodes 1 --max-nodes 10
# create dataproc cluster on GKE
gcloud dataproc clusters create "${DATAPROC_CLUSTER}" 
--gke-cluster="${GKE_CLUSTER}" 
--region="${GCE_REGION}" 
--zone="${GCE_ZONE}" 
--image-version="${VERSION}" 
--bucket="${BUCKET}"
● Dataproc Autoscaler not supported with Dataproc on
GKE
● Instead enable autoscaling on GKE Cluster node pool
● Specify a minimum and maximum size for the GKE
Cluster’s node pool, and the rest is automatic
● You can combine GKE Cluster Autoscaler with
Horizontal/Vertical Pod Autoscaling
# create GKE cluster or a node pool with local SSD
gcloud container clusters create "${GKE_CLUSTER}" 
...
--local-ssd-count ${NUMBER_OF_DISKS}
# config YAML to use local SSD as scratch space
spec:
volumes:
- name: "spark-local-dir-1"
hostPath:
path: "/tmp/spark-local-dir"
executor:
volumeMounts:
- name: "spark-local-dir-1"
mountPath: "/tmp/spark-local-dir"
# spark job conf to override scratch space
spark.local.dir=/tmp/spark-local-dir/
Shuffle in Spark on Kubernetes
Writes shuffle data to scratch space or local volume or Persistent Volume Claims
● Shuffle is the data exchange between different stages
in a Spark job.
● Shuffle is expensive and its performance depends on
disk IOPS and network throughput between the nodes.
● Spark supports writing shuffle data to Persistent
Volume Claims or local volumes or scratch space.
● Local SSDs are performant compared to Persistent
Disks but they are transient. Disk IOPS and throughput
improves as disk size increases.
● External shuffle service is not available today.
Source
Dynamic Resource Allocation *
Dynamically adjust the resources Spark application occupies based on the workload
# spark job conf to enable dynamic allocation
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.shuffleTracking.enabled=true
● When enabled, Spark dynamically adjusts resources
based on workload demand
● External shuffle service is not available in Spark on
Kubernetes (work in progress)
● Instead soft dynamic resource allocation is available in
Spark 3.0 where the driver tracks the shuffle files and
evicts only executors not storing active shuffle files
● Dynamic allocation is a cost optimization technique -
cost vs latency trade-off
● To improve latency consider over-provisioning GKE
cluster - fine-tune Horizontal Pod Autoscaling or
configure pause Pods
* Dataproc on GKE supports only Spark 2.4 at the time of this talk and the support
for Spark 3.x is coming soon. Spark 3.x is supported on Dataproc on GCE
# create GKE cluster with preemptible VMs
gcloud container clusters create "${GKE_CLUSTER}" 
--preemptible
# or create GKE node pool with preemptible VMs
gcloud container node-pools create "${GKE_NODE_POOL}" 
--preemptible 
--cluster "${GKE_CLUSTER}"
# submit Dataproc job to node pool with preemptible VMs
gcloud dataproc jobs submit pyspark 
--cluster="${DATAPROC_CLUSTER}" foo.py 
--region="${GCE_REGION}" 
--
properties=spark.kubernetes.node.selector.cloud.google.
com/gke-nodepool=${GKE_NODE_POOL}"
Running Spark Jobs on Preemptible VMs (PVMs) on GKE
Reduce cost of running Spark jobs without sacrificing predictability
● PVMs are excess Compute Engine capacity, that last
for a max of 24 hours with no availability guarantees
● Best suited for running batch or fault-tolerant jobs
● Much cheaper than standard VMs and running Spark
on GKE with PVMs reduces cost of deployment. But,
○ PVMs can shut down inadvertently and
rescheduling Pods to a new node may add
latency
○ Spark executors with active shuffle files that were
shut down will be recomputed adding latency
● At the time of creating a Dataproc cluster on GKE, the
default Dataproc Docker image is used based on the
image version specified
● You can bring your own image or extend the default
image as the container image to use for the Spark
application
● Create Dataproc cluster with custom image when you
need to include your own packages or applications
# submit Dataproc job with custom container image
gcloud dataproc jobs submit pyspark 
--cluster="${DATAPROC_CLUSTER}" foo.py 
--region="${GCE_REGION}" 
--
properties=spark.kubernetes.container.image="gcr.io/${P
ROJECT_ID}/my-spark-image" 
Create Dataproc Cluster on GKE with Custom Image
Bring your own image or extend the default Dataproc image
Integrating with Google Cloud Storage (GCS) and BigQuery (BQ)
Use Spark BigQuery Connector and Google Cloud Storage connector for better performance
# submit Dataproc job to use BigQuery as source/sink
gcloud dataproc jobs submit pyspark bq-word-count.py 
--cluster=${DATAPROC_CLUSTER} 
--region=${GCE_REGION} 
--
properties="spark.dynamicAllocation.enabled=false,spark
.executor.instances=5,spark.executors.core=4" 
--jars gs://spark-lib/bigquery/spark-bigquery-
latest_2.11.jar
● Built-in Cloud Storage Connector in the Dataproc
default image
● Add Spark BigQuery connector as dependency, which
uses BQ Storage API to stream data directly from BQ
via gRPC without using GCS as an intermediary.
Autoscaling
Automatically resize GKE
cluster node pools based
on workload demand
Shuffle
Writes to scratch space or
local volume or Persistent
Volume Claims
Dynamic Allocation
Dynamically adjust the job
resources based on the
workload
Preemptible VMs
Reduce cost of running
Spark jobs without
sacrificing predictability
Custom Image
Bring your own image or
extend the default
Dataproc image
Integration with
Google Cloud Services
Built-in Cloud Storage
connector and add Spark
BigQuery connector
Dataproc with Apache Spark on GKE - Things to Know at a Glance
Dataproc with Apache Spark on GKE
Use Cases / Architectural Patterns
Unified Infrastructure
Google Kubernetes Engine (GKE) Cluster
Dataproc Clusters on
GKE
Apache Spark 2.4 Airflow Kubeflow
Other Workloads
Apache Spark 3.x
● Unify all of our processing - data processing pipeline
or a machine learning pipeline or a web application or
anything else
● By migrating Spark jobs to a single cluster manager,
you can focus on modern cloud management in
Kubernetes
● Leads to a more efficient use of resources and
provides a unified logging and management
framework
Dataproc on GKE supports only Spark 2.4 at the time of this talk and the support
for Spark 3.x is coming soon. Spark 3.x is supported on Dataproc on GCE
Cloud Composer
Managed Apache Airflow service to create, schedule, monitor and manage
workflows
Author end-to-end
workflows on GCP via
triggers and integrations
Enterprise security for
your workflows through
Google managed
credentials.
What is Cloud Composer?
No need to think about
managing the
infrastructure after
initial config done with a
click.
Makes troubleshooting
simple with observability
through Cloud Logging
and Monitoring
Azure Blob Storage
AWS EMR
AWS S3
AWS EC2
AWS Redshift
Databricks
SubmitRunOperator
Workflow
Orchestration
Cloud Composer
Public Cloud
Integrations
GCP Integrations
On-prem
integration
BigQuery
Cloud
Dataproc
Cloud
Dataflow
Cloud
Pub/Sub
Cloud AI
Platform
Cloud
Storage
Cloud
Datastore
Orchestrating Apache Spark Jobs from Cloud Composer
Cloud Storage
Source/Targe
t
BigQuery
Source/Targe
t
Dataproc on GKE
Data
Processing
Cloud
Composer
Google Kubernetes Engine (GKE)
Any other data
sources or
targets
● Trigger DAG from Composer to submit job to
Dataproc cluster running on GKE
● Save time by not creating and tear down
ephemeral Dataproc cluster
● One cluster manager to orchestrate and
process jobs. Better utilization of resources.
● Optimize costs + better visibility and reliability
Machine Learning Lifecycle
DATA SCIENTIST / ML ENGINEER
• Apply ML model code on large
datasets
• Test performance and validate
• Train on LARGE or FULL dataset
DATA SCIENTIST
• Explore data
• Test features + algorithms
• Build model prototypes
• Prototype on SMALL or SAMPLED
dataset
DATA / ML ENGINEER
• Operationalize data processing
• Deploy models to production
Model Accuracy
Information
ML Model Code
ML Model
DATA ENGINEER
• Ingestion
• Cleaning
• Storage
Exploration &
Model Prototyping
Model Scoring &
Inference
Production Training
& Evaluation
Data
MLflow
Open Source platform to manage the ML lifecycle
Registry
Store, annotate,
discover, and manage
models in a central
repository
Models
Deploy machine learning
models in diverse serving
environments
Projects
Package data science
code to reproduce runs
on any platform
Tracking
Record and query
experiments: code, data,
config and results
Components of MLflow
Unifying Machine Learning & Data Pipeline Deployments
API Connectors
&
Data Imports
Cloud Storage
Data Source
Cloud Scheduler
Trigger
Security & Integrations
Key
Manageme
nt Service
Secret
Manager
Cloud
IAM
AI Platform
Data Science
/ ML
Target Bucket
Cloud Bigtable
BigQuery
BigQuery
Data Source
Artifacts Storage
Cloud
Storage
Dataproc on GKE
Data
Processing
Cloud
Composer
Google Kubernetes Engine (GKE)
ML Tracking
Kubeflow
Data Science /
ML
Notebooks
Training
Experimentation
Dataproc with Apache Spark on GKE
Wrapping up
Apache Spark on Kubernetes
Why Spark on Kubernetes?
● Do you have apps running on Kubernetes clusters? Are
they underutilized?
● Do you have pain managing multiple cluster managers -
YARN, Kubernetes?
● Do you have difficulties managing Spark job
dependencies, different Spark versions?
● Do you want to get same benefits as apps running on
Kubernetes - multitenancy, autoscaling, fine-grained
access control?
Why Dataproc on GKE?
● Faster scaling with reliability
● Inherent benefits of managed infrastructure
● Enterprise security control
● Unified logging and monitoring
● Optimized costs due to effective resource sharing
Open Source Documentation
● Running Spark on Kubernetes - Spark Documentation
● Kubernetes operator for managing the lifecycle of
Apache Spark applications on Kubernetes.
● Code Example used in the demo.
Blog Posts & Solution
● Make the most out of your Data Lake with Google Cloud
● Cloud Dataproc Spark Jobs on GKE: How to get started
Google Cloud Documentation
● Google Cloud Dataproc
● Google Kubernetes Engine (GKE)
● Google Cloud Composer
● Dataproc on Google Kubernetes Engine
Resources
Google Cloud
Feedback
Your feedback is important to us.
Don’t forget to rate and review the
sessions.

Contenu connexe

Tendances

Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight OverviewJacques Nadeau
 
The State of Spark in the Cloud with Nicolas Poggi
The State of Spark in the Cloud with Nicolas PoggiThe State of Spark in the Cloud with Nicolas Poggi
The State of Spark in the Cloud with Nicolas PoggiSpark Summit
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeDatabricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
 

Tendances (20)

Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
The State of Spark in the Cloud with Nicolas Poggi
The State of Spark in the Cloud with Nicolas PoggiThe State of Spark in the Cloud with Nicolas Poggi
The State of Spark in the Cloud with Nicolas Poggi
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 

Similaire à Scaling your Data Pipelines with Apache Spark on Kubernetes

18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on KubernetesAthens Big Data
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Faster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-JobserverFaster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-JobserverDatabricks
 
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex DadgarHomologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex DadgarDatabricks
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Chris Fregly
 
Data Engineer's Lunch #76: Airflow and Google Dataproc
Data Engineer's Lunch #76: Airflow and Google DataprocData Engineer's Lunch #76: Airflow and Google Dataproc
Data Engineer's Lunch #76: Airflow and Google DataprocAnant Corporation
 
Reliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesReliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesDatabricks
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On DemandBogdan Kyryliuk
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftLi Gao
 
Spark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesSpark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesYousun Jeong
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkAnu Shetty
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
Session 4 GCCP.pptx
Session 4 GCCP.pptxSession 4 GCCP.pptx
Session 4 GCCP.pptxDSCIITPatna
 

Similaire à Scaling your Data Pipelines with Apache Spark on Kubernetes (20)

18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Faster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-JobserverFaster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-Jobserver
 
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex DadgarHomologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
Data Engineer's Lunch #76: Airflow and Google Dataproc
Data Engineer's Lunch #76: Airflow and Google DataprocData Engineer's Lunch #76: Airflow and Google Dataproc
Data Engineer's Lunch #76: Airflow and Google Dataproc
 
Reliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesReliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on Kubernetes
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On Demand
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at Lyft
 
Spark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesSpark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on Kubernetes
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing spark
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
Spark on Yarn @ Netflix
Spark on Yarn @ NetflixSpark on Yarn @ Netflix
Spark on Yarn @ Netflix
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Serverless Data Science
Serverless Data ScienceServerless Data Science
Serverless Data Science
 
Session 4 GCCP.pptx
Session 4 GCCP.pptxSession 4 GCCP.pptx
Session 4 GCCP.pptx
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Dernier

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 

Dernier (20)

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 

Scaling your Data Pipelines with Apache Spark on Kubernetes

  • 1. Scaling Data Pipelines with Apache Spark on Kubernetes on Google Cloud Rajesh Thallam Machine Learning Specialist Google Sougata Biswas Data Analytics Specialist Google May 2021
  • 2. Outline Spark on Kubernetes on Google Cloud Why Spark on Kubernetes? 1 2 4 Use Cases / Implementation Patterns Things to Know 3 5 Wrap up
  • 3. Why Spark on Kubernetes?
  • 4. Utilize existing Kubernetes infrastructure to run data engineering or ML workload along with other applications without maintaining separate big data infrastructure Containerization of spark applications gives ability to run the spark application on-prem and on cloud Packaging job dependencies in containers provides a great way to isolate workloads. Allowing teams to scale independently Scaling containers are much faster than VMs (Virtual Machines) Why Spark on Kubernetes? Unique benefits orchestrating Spark Jobs on Kubernetes compared to other cluster managers - YARN and Mesos Optimize Costs Portability Isolation Faster Scaling
  • 5. Proprietary + Confidential Comparing Cluster Managers Apache Hadoop YARN vs Kubernetes for Apache Spark Apache Hadoop YARN ● First cluster manager since inception of Apache Spark ● Battle tested ● General purpose scheduler for big data applications ● Runs on cluster of VMs or physical machines (e.g. on-prem Hadoop clusters) ● Option to run: spark-submit to YARN Kubernetes (k8s) ● Resource manager starting Spark 2.3 as experimental and GA with Spark 3.1.1 ● Not in feature parity with YARN ● General purpose scheduler for any containerized apps ● Runs as a container on k8s cluster. Faster scaling in and out. ● Option to run: spark-submit, spark k8s operator
  • 6. Spark on Kubernetes on Google Cloud
  • 7. Secure Enterprise security Encryption Access control Cost Effective Only pay for what you use Managed Jobs Spark on GKE Workflow Templates Airflow Operators Managed Clusters 90s cluster spin-up Autoscaling Autozone placement Cloud Dataproc Combining the best of open source and cloud and simplifying Hadoop & Spark workloads on Cloud Built-in support for Hadoop & Spark Managed hardware and configuration Simplified version management Flexible job configuration Features of Dataproc
  • 8. ● Manage applications, not machines ○ Manages container clusters ○ Inspired and informed by Google’s experiences ○ Supports multiple cloud and bare-metal environments ○ Supports multiple container runtimes ● Features similar to an OS for a host ○ Scheduling workload ○ Finding the right host to fit your workload ○ Monitoring health of the workload ○ Scaling it up and down as needed ○ Moving it around as needed Kubernetes OS for your compute fleet
  • 9. Google Kubernetes Engine (GKE) Secured and fully managed Kubernetes service GKE, Kubernetes-as-a-service Control Plane Nodes kubectl gcloud ● Turn-key solution to Kubernetes ○ Provision a cluster in minutes ○ Industry-leading automation ○ Scales to an industry-leading 15k worker nodes ○ Reliable and available ○ Deep GCP integration ● Generally Available since August, 2015 ○ 99.5% or 99.95% SLA on Kubernetes APIs ○ $0.10 per cluster/hour + infrastructure cost ○ Supports GCE sole-tenant nodes and reservations
  • 10. Dataproc on GKE BETA Run Spark jobs on GKE clusters with Dataproc Jobs API ● Simple way of executing Spark jobs on GKE clusters ● Single API to run Spark job on Dataproc as well as GKE ● Extensible with custom Docker image for Spark job ● Enterprise security control out-of-box ● Ease of logging and monitoring with cloud Logging and Monitoring Create Cluster Dataproc GKE Submit Job Allocate resources Run Spark Job
  • 11. Node Dataproc Agent Spark Submit using Dataproc API Kubernetes Master API Server Scheduler .. Job Scheduling & Monitoring Driver Pod (Node 1) Executor Pod (Node 1) Executor Pod (Node 2) Executor Pod (Node n) Google Kubernetes Engine (GKE) Dataproc on GKE - How it works? Submit Spark jobs to a running GKE cluster from the Dataproc Jobs API ● Dataproc agent runs as container inside GKE communicating with GKE scheduler using spark-kubernetes operator ● User submit jobs using Dataproc Jobs API while job execution happens inside GKE cluster ● Spark driver and executor run on different Pods inside separate namespaces within GKE cluster ● Driver and executor logs are sent to Google Cloud Logging service
  • 12. How is Dataproc on GKE different from alternatives? Comparing against Spark Submit and Spark Operator for Kubernetes Create Cluster Dataproc GKE Submit Job Allocate resources Run Spark Job ● Easy to get started with familiar Dataproc API ● Easy to setup and manage. No need to install Spark Kubernetes operator and set up monitoring or logging separately. ● Built-in security features with Dataproc API - access control, auditing, encryption and more. ● Inherent benefits of managed services - Dataproc and GKE
  • 13. Demo Spark on GKE using Dataproc Jobs API
  • 14. Step 1: Setup a GKE Cluster # setup environment variables GCE_REGION=us-west2 #GCP region GCE_ZONE=us-west2-a #GCP zone GKE_CLUSTER=spark-on-gke #GKE Cluster name DATAPROC_CLUSTER=dataproc-gke #Dataproc Cluster name VERSION=1.4.27-beta #Dataproc image version BUCKET=my-project-spark-on-k8s #GCS bucket # create GKE cluster with auto-scaling enabled gcloud container clusters create "${GKE_CLUSTER}" --scopes=cloud-platform --workload-metadata=GCE_METADATA --machine-type=n1-standard-4 --zone="${GCE_ZONE}" --enable-autoscaling --min-nodes 1 --max-nodes 10 # add Kubernetes Engine Admin role to service- projectid@dataproc-accounts.iam.gserviceaccount.com
  • 15. Step 2: Create and Register Dataproc to GKE # create dataproc cluster and register with GKE with K8s namespace gcloud dataproc clusters create "${DATAPROC_CLUSTER}" --gke-cluster="${GKE_CLUSTER}" --region="${GCE_REGION}" --zone="${GCE_ZONE}" --image-version="${VERSION}" --bucket="${BUCKET}" --gke-cluster-namespace="spark-on-gke"
  • 16. Step 3: Spark Job Execution # Running a sample pyspark job using Dataproc API # to read a table in Bigquery and generate word counts gcloud dataproc jobs submit pyspark bq-word-count.py --cluster=${DATAPROC_CLUSTER} --region=${GCE_REGION} -- properties="spark.dynamicAllocation.enabled=false,spar k.executor.instances=5,spark.executors.core=4" --jars gs://spark-lib/bigquery/spark-bigquery- latest_2.11.jar
  • 17. Step 4a: Monitoring - GKE & Cloud Logging # Spark Driver Logs from Google Cloud Logging resource.type="k8s_container" resource.labels.cluster_name="spark-on-gke" resource.labels.namespace_name="spark-on-gke" resource.labels.container_name="spark-kubernetes- driver" # Spark Executor Logs from Google Cloud Logging resource.type="k8s_container" resource.labels.cluster_name="spark-on-gke" resource.labels.namespace_name="spark-on-gke" resource.labels.container_name="executor"
  • 18. # TCP port forwarding to driver pod to view Spark UI gcloud container clusters get-credentials "${GKE_CLUSTER}" --zone "${GCE_ZONE}" --project "${PROJECT_ID}" && kubectl port-forward --namespace "${GKE_NAMESPACE}" $(kubectl get pod --namespace ${GKE_NAMESPACE} --selector="spark- role=driver,sparkoperator.k8s.io/app-name=dataproc- app_name" --output jsonpath='{.items[0].metadata.name}') 8080:4040 Step 4b: Monitoring with Spark Web UI
  • 19. Dataproc with Apache Spark on GKE Things to Know
  • 20. Autoscaling Spark Jobs Automatically resize node pools of GKE cluster based on the workload demands # create GKE cluster with autoscaling enabled gcloud container clusters create "${GKE_CLUSTER}" --scopes=cloud-platform --workload-metadata=GCE_METADATA --machine-type n1-standard-2 --zone="${GCE_ZONE}" --num-nodes 2 --enable-autoscaling --min-nodes 1 --max-nodes 10 # create dataproc cluster on GKE gcloud dataproc clusters create "${DATAPROC_CLUSTER}" --gke-cluster="${GKE_CLUSTER}" --region="${GCE_REGION}" --zone="${GCE_ZONE}" --image-version="${VERSION}" --bucket="${BUCKET}" ● Dataproc Autoscaler not supported with Dataproc on GKE ● Instead enable autoscaling on GKE Cluster node pool ● Specify a minimum and maximum size for the GKE Cluster’s node pool, and the rest is automatic ● You can combine GKE Cluster Autoscaler with Horizontal/Vertical Pod Autoscaling
  • 21. # create GKE cluster or a node pool with local SSD gcloud container clusters create "${GKE_CLUSTER}" ... --local-ssd-count ${NUMBER_OF_DISKS} # config YAML to use local SSD as scratch space spec: volumes: - name: "spark-local-dir-1" hostPath: path: "/tmp/spark-local-dir" executor: volumeMounts: - name: "spark-local-dir-1" mountPath: "/tmp/spark-local-dir" # spark job conf to override scratch space spark.local.dir=/tmp/spark-local-dir/ Shuffle in Spark on Kubernetes Writes shuffle data to scratch space or local volume or Persistent Volume Claims ● Shuffle is the data exchange between different stages in a Spark job. ● Shuffle is expensive and its performance depends on disk IOPS and network throughput between the nodes. ● Spark supports writing shuffle data to Persistent Volume Claims or local volumes or scratch space. ● Local SSDs are performant compared to Persistent Disks but they are transient. Disk IOPS and throughput improves as disk size increases. ● External shuffle service is not available today. Source
  • 22. Dynamic Resource Allocation * Dynamically adjust the resources Spark application occupies based on the workload # spark job conf to enable dynamic allocation spark.dynamicAllocation.enabled=true spark.dynamicAllocation.shuffleTracking.enabled=true ● When enabled, Spark dynamically adjusts resources based on workload demand ● External shuffle service is not available in Spark on Kubernetes (work in progress) ● Instead soft dynamic resource allocation is available in Spark 3.0 where the driver tracks the shuffle files and evicts only executors not storing active shuffle files ● Dynamic allocation is a cost optimization technique - cost vs latency trade-off ● To improve latency consider over-provisioning GKE cluster - fine-tune Horizontal Pod Autoscaling or configure pause Pods * Dataproc on GKE supports only Spark 2.4 at the time of this talk and the support for Spark 3.x is coming soon. Spark 3.x is supported on Dataproc on GCE
  • 23. # create GKE cluster with preemptible VMs gcloud container clusters create "${GKE_CLUSTER}" --preemptible # or create GKE node pool with preemptible VMs gcloud container node-pools create "${GKE_NODE_POOL}" --preemptible --cluster "${GKE_CLUSTER}" # submit Dataproc job to node pool with preemptible VMs gcloud dataproc jobs submit pyspark --cluster="${DATAPROC_CLUSTER}" foo.py --region="${GCE_REGION}" -- properties=spark.kubernetes.node.selector.cloud.google. com/gke-nodepool=${GKE_NODE_POOL}" Running Spark Jobs on Preemptible VMs (PVMs) on GKE Reduce cost of running Spark jobs without sacrificing predictability ● PVMs are excess Compute Engine capacity, that last for a max of 24 hours with no availability guarantees ● Best suited for running batch or fault-tolerant jobs ● Much cheaper than standard VMs and running Spark on GKE with PVMs reduces cost of deployment. But, ○ PVMs can shut down inadvertently and rescheduling Pods to a new node may add latency ○ Spark executors with active shuffle files that were shut down will be recomputed adding latency
  • 24. ● At the time of creating a Dataproc cluster on GKE, the default Dataproc Docker image is used based on the image version specified ● You can bring your own image or extend the default image as the container image to use for the Spark application ● Create Dataproc cluster with custom image when you need to include your own packages or applications # submit Dataproc job with custom container image gcloud dataproc jobs submit pyspark --cluster="${DATAPROC_CLUSTER}" foo.py --region="${GCE_REGION}" -- properties=spark.kubernetes.container.image="gcr.io/${P ROJECT_ID}/my-spark-image" Create Dataproc Cluster on GKE with Custom Image Bring your own image or extend the default Dataproc image
  • 25. Integrating with Google Cloud Storage (GCS) and BigQuery (BQ) Use Spark BigQuery Connector and Google Cloud Storage connector for better performance # submit Dataproc job to use BigQuery as source/sink gcloud dataproc jobs submit pyspark bq-word-count.py --cluster=${DATAPROC_CLUSTER} --region=${GCE_REGION} -- properties="spark.dynamicAllocation.enabled=false,spark .executor.instances=5,spark.executors.core=4" --jars gs://spark-lib/bigquery/spark-bigquery- latest_2.11.jar ● Built-in Cloud Storage Connector in the Dataproc default image ● Add Spark BigQuery connector as dependency, which uses BQ Storage API to stream data directly from BQ via gRPC without using GCS as an intermediary.
  • 26. Autoscaling Automatically resize GKE cluster node pools based on workload demand Shuffle Writes to scratch space or local volume or Persistent Volume Claims Dynamic Allocation Dynamically adjust the job resources based on the workload Preemptible VMs Reduce cost of running Spark jobs without sacrificing predictability Custom Image Bring your own image or extend the default Dataproc image Integration with Google Cloud Services Built-in Cloud Storage connector and add Spark BigQuery connector Dataproc with Apache Spark on GKE - Things to Know at a Glance
  • 27. Dataproc with Apache Spark on GKE Use Cases / Architectural Patterns
  • 28. Unified Infrastructure Google Kubernetes Engine (GKE) Cluster Dataproc Clusters on GKE Apache Spark 2.4 Airflow Kubeflow Other Workloads Apache Spark 3.x ● Unify all of our processing - data processing pipeline or a machine learning pipeline or a web application or anything else ● By migrating Spark jobs to a single cluster manager, you can focus on modern cloud management in Kubernetes ● Leads to a more efficient use of resources and provides a unified logging and management framework Dataproc on GKE supports only Spark 2.4 at the time of this talk and the support for Spark 3.x is coming soon. Spark 3.x is supported on Dataproc on GCE
  • 29. Cloud Composer Managed Apache Airflow service to create, schedule, monitor and manage workflows Author end-to-end workflows on GCP via triggers and integrations Enterprise security for your workflows through Google managed credentials. What is Cloud Composer? No need to think about managing the infrastructure after initial config done with a click. Makes troubleshooting simple with observability through Cloud Logging and Monitoring Azure Blob Storage AWS EMR AWS S3 AWS EC2 AWS Redshift Databricks SubmitRunOperator Workflow Orchestration Cloud Composer Public Cloud Integrations GCP Integrations On-prem integration BigQuery Cloud Dataproc Cloud Dataflow Cloud Pub/Sub Cloud AI Platform Cloud Storage Cloud Datastore
  • 30. Orchestrating Apache Spark Jobs from Cloud Composer Cloud Storage Source/Targe t BigQuery Source/Targe t Dataproc on GKE Data Processing Cloud Composer Google Kubernetes Engine (GKE) Any other data sources or targets ● Trigger DAG from Composer to submit job to Dataproc cluster running on GKE ● Save time by not creating and tear down ephemeral Dataproc cluster ● One cluster manager to orchestrate and process jobs. Better utilization of resources. ● Optimize costs + better visibility and reliability
  • 31. Machine Learning Lifecycle DATA SCIENTIST / ML ENGINEER • Apply ML model code on large datasets • Test performance and validate • Train on LARGE or FULL dataset DATA SCIENTIST • Explore data • Test features + algorithms • Build model prototypes • Prototype on SMALL or SAMPLED dataset DATA / ML ENGINEER • Operationalize data processing • Deploy models to production Model Accuracy Information ML Model Code ML Model DATA ENGINEER • Ingestion • Cleaning • Storage Exploration & Model Prototyping Model Scoring & Inference Production Training & Evaluation Data
  • 32. MLflow Open Source platform to manage the ML lifecycle Registry Store, annotate, discover, and manage models in a central repository Models Deploy machine learning models in diverse serving environments Projects Package data science code to reproduce runs on any platform Tracking Record and query experiments: code, data, config and results Components of MLflow
  • 33. Unifying Machine Learning & Data Pipeline Deployments API Connectors & Data Imports Cloud Storage Data Source Cloud Scheduler Trigger Security & Integrations Key Manageme nt Service Secret Manager Cloud IAM AI Platform Data Science / ML Target Bucket Cloud Bigtable BigQuery BigQuery Data Source Artifacts Storage Cloud Storage Dataproc on GKE Data Processing Cloud Composer Google Kubernetes Engine (GKE) ML Tracking Kubeflow Data Science / ML Notebooks Training Experimentation
  • 34. Dataproc with Apache Spark on GKE Wrapping up
  • 35. Apache Spark on Kubernetes Why Spark on Kubernetes? ● Do you have apps running on Kubernetes clusters? Are they underutilized? ● Do you have pain managing multiple cluster managers - YARN, Kubernetes? ● Do you have difficulties managing Spark job dependencies, different Spark versions? ● Do you want to get same benefits as apps running on Kubernetes - multitenancy, autoscaling, fine-grained access control? Why Dataproc on GKE? ● Faster scaling with reliability ● Inherent benefits of managed infrastructure ● Enterprise security control ● Unified logging and monitoring ● Optimized costs due to effective resource sharing
  • 36. Open Source Documentation ● Running Spark on Kubernetes - Spark Documentation ● Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes. ● Code Example used in the demo. Blog Posts & Solution ● Make the most out of your Data Lake with Google Cloud ● Cloud Dataproc Spark Jobs on GKE: How to get started Google Cloud Documentation ● Google Cloud Dataproc ● Google Kubernetes Engine (GKE) ● Google Cloud Composer ● Dataproc on Google Kubernetes Engine Resources Google Cloud
  • 37. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.