SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Re-architecting
Data Platform with Kubernetes
SeungYong Oh, Devsisters Corp.
오 승 용, 데브시스터즈
● Mobile Game Publisher: including Cookie Run Series
● 10+ Million App Downloads, 2M+ Monthly Active Users
"Creating the Best Player Experiences through Excellent Content, Service, and Technology!"
"탁월한 기술, 서비스, 콘텐츠로 전 세계 고객에게 최고의 경험을 선사합니다."
3
Who am I?
SeungYong Oh (오 승 용)
● DevOps & Data Engineer at Devsisters
○ Also worked as a game server developer for Cookie Run series
● Kubernetes user from 2016
○ Adopted Kubernetes for development/testing environment for Cookie Run OvenBreak
■ NDC 2017: Kubernetes로 개발서버 간단히 찍어내기
https://www.slideshare.net/seungyongoh3/ndc17-kubernetes
4
Kubernetes @ Devsisters
● 2014~2015: Proof-of-Concept level evaluation
● 2016: Dev/testing environment game infra for Cookie Run: OvenBreak
● 2017~: Dev/testing environment game infra platform
● 2019~: Production-level game infra platform
○ Cookie Run for Kakao, Hello Brave Cookies
● 2019~: Data Platform on Kubernetes
○ Still ongoing
5
Data Platform @ Devsisters
● Central logging infra with various data sources
Using those logs,
● Search
○ Search server/client logs for debug or operational purpose
○ Search specific user's action logs for customer service purpose
● Applications
○ A/B Testing, Machine Learning, ..
● And of course, Analytics!
6
Data Platform @ Devsisters
● Analytics: Analytics for Everyone!
○ Everyone?
■ Not only limited to data scientist, analyst
■ CEO, Project Manager, Marketer, Game Designer, ...
○ KPI(Key Performance Indicator) Service: Active users, Revenue, Retention,..
○ Ad-hoc query/analytics environment
■ Fully programmable environment for Data Scientist / Engineer
■ SQL query / clickable interface for many users
7
Analytics Infra / Service
Data Platform @ Devsisters
Server Logs
k8s Pod / Docker /
Host syslog
Game Client / Web /
3rd Party Logs
(via custom SDK/API)
Apache Kafka /
Kafka Streams
Log collection
& pre-processing
Elastic Stack (ELK)
(Near) Real-time
Log Search & Analytics
Batch
Amazon S3
Data Lake
AWS Glue Amazon
Athena
Amazon
QuickSight
8
$SOME_PROJECT_NAME supports K8S!
Many Big data & analytics related projects support Kubernetes in 2019!
● Hadoop, Spark, Airflow, Kubeflow, JupyterHub, Hue, Zeppelin, Superset, Kafka, Elastic Stack, etc.
Support?
● Case 1: (Official/Popular) Helm Chart
● Case 2: K8S Operator for the project
● Case 3: Cloud Native / Kubernetes-specific Integrations
9
But, does it mean we should move to K8S?
Many projects are stateful apps, especially in Big data / analytics
● Considerable risks / maintenance costs
○ Node Upgrade? Retirement? Reallocation is NOT easy
● Benefits?
○ Bin-packing? 🤔 (especially in public cloud)
○ Autoscaling? 🤔
○ Scheduling? 🤔
■ Already there are some advanced scheduler/resource manager (ex. Hadoop YARN)
10
Actually, there are some (great!) benefits!
11
Benefit #1: Deployment / Ops Made Easy
12
Example: Apache Airflow
Apache Airflow: Workflow Management tool
Source:
https://github.com/GoogleCloudPlatform/airflow-operator/blob/
master/docs/design.md
Quick Deployment with
● helm install stable/airflow
or
● Airflow operator
https://github.com/GoogleCloudPlatform/airflow-operator/
13
Benefit #2: Analytics - Easy and Efficient
14
Apache Spark @ Devsisters
A unified analytics engine for large-scale data processing
Use-cases of Spark @ Devsisters:
Amazon S3
Data Lake: Raw
Amazon S3
Data Lake: Analytics
(Snapshot, DW, DM..)
15
Case #1: Store Logs
(Batch)
Case #2: Transform
(Batch)
Case #3: Ad-hoc
Analytics
Data Scientist,
Engineer, Marketer,
PM, ..
Apache Spark @ Devsisters
Self-service web interface for Spark Cluster Provision
Launches Spark Cluster (standalone mode)
● 1 master node with Jupyter Notebook
● multiple worker nodes
16https://www.slideshare.net/JPark0426/ndc-2018-spark-flintrock-airflow
Apache Spark @ Devsisters
So, anyone can have their own Spark Cluster with one-click! Looks Good?
● Many users (ex. Marketing Manager) only want to focus on data analysis!
○ r5.xlarge?
○ spot? bidding price?
○ Waiting few minutes for spark provisioning, with some failures
○ Should backup their files before terminating cluster
● Cost ineffective
○ idle resources
○ forget to terminate cluster after use
Then sharing a centralized cluster? NO! (access control issues etc.)
17
Spark on Kubernetes
Apache Spark - Native K8S Scheduler
(experimental)
https://spark.apache.org/docs/latest/running-on-kubernetes.html
K8S Operator for Apache Spark
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
18
Jupyter + Spark on K8S
# Execute below in Jupyter Pod
PYSPARK_DRIVER_PYTHON=jupyter 
PYSPARK_DRIVER_PYTHON_OPTS='notebook --allow-root' 
./bin/pyspark 
--master k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_PORT_443_TCP_PORT/ 
--name spark-$APP_NAME 
--conf spark.executor.instances=$EXECUTOR_COUNT 
--conf spark.kubernetes.container.image=<<custom built spark container>> 
--conf spark.kubernetes.namespace=spark-test 
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-test-server 
--conf spark.driver.host=spark-driver 
--conf spark.driver.port=5555 
--conf spark.driver.bindAddress=0.0.0.0 
--conf spark.driver.blockManager.port=5556 
--conf spark.kubernetes.executor.annotation.iam.amazonaws.com/role=<<AWS IAM role (KIAM)>>
● But users need much easier interface!
19
JupyterHub + Spark on K8S
● Multi-user version of Jupyter Notebook, support k8s natively
● Launch a notebook pod per user
○ Data is persisted (stored in PersistentVolume)
● Let's Integrate Spark on K8S with JupyterHub !
20
JupyterHub + Spark on K8S
● Demo
21
22
23
But many users only need SQL interface
● Spark has distributed SQL Engine
https://spark.apache.org/docs/latest/sql-distributed-sql-engine.html
○ Thrift JDBC/ODBC server (corresponds to the HiveServer2)
● Integration with Apache Superset (Business Intelligence web app)
24
25
For applications?
● Example #1:
○ Send push messages to users that played game more than a hour yesterday
■ JDBC!
● Example #2:
○ Give special items to users that has some possibility not to play the game anymore
■ spark-submit?
26
Benefit #3: Fine-tuning Access Controls
27
Still Stressful but Easier
● Web Apps: Use Auth Proxy Sidecar
● For apps that don't support RBAC:
○ Pod-level data access control (S3 in Devsisters' case)
■ In AWS, with KIAM or AWS IAM Authenticator
○ Or Node-level (with nodeSelector)
○ Pods/Deployment per each group/roles + Benefit of Bin-packing
28
Then, Let's move to Kubernetes Now..?!
29
Still in Early Stage
● Spark on Kubernetes
○ Doesn't support dynamic allocation / external shuffle service yet
■ You can't change count of executors
○ Pod Templating - available on 3.0 Preview
■ Even You can't specify service account of executor pod now
● Other Many Projects also in alpha stage,
which requires code-level modifications / forking
to meet the team's requirements
● If storage & computing resource is coupled, maybe little more hard to move
30
But it's worth to move!
● A Data Engineer who doesn't have prior knowledge of Kubernetes
could implement Spark on K8s/JupyterHub/Superset customizations
less than 1.5 months (expected more than 4 months for non-k8s env)
● Kubernetes' abstraction allows users to focus on what they can do best
31

Contenu connexe

Tendances

[넥슨] kubernetes 소개 (2018)
[넥슨] kubernetes 소개 (2018)[넥슨] kubernetes 소개 (2018)
[넥슨] kubernetes 소개 (2018)용호 최
 
게임의 성공을 위한 Scalable 한 데이터 플랫폼 사례 공유 - 오승용, 데이터 플랫폼 리더, 데브시스터즈 ::: Games on AW...
게임의 성공을 위한 Scalable 한 데이터 플랫폼 사례 공유 - 오승용, 데이터 플랫폼 리더, 데브시스터즈 ::: Games on AW...게임의 성공을 위한 Scalable 한 데이터 플랫폼 사례 공유 - 오승용, 데이터 플랫폼 리더, 데브시스터즈 ::: Games on AW...
게임의 성공을 위한 Scalable 한 데이터 플랫폼 사례 공유 - 오승용, 데이터 플랫폼 리더, 데브시스터즈 ::: Games on AW...Amazon Web Services Korea
 
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...Amazon Web Services Korea
 
[오픈소스컨설팅] 스카우터 사용자 가이드 2020
[오픈소스컨설팅] 스카우터 사용자 가이드 2020[오픈소스컨설팅] 스카우터 사용자 가이드 2020
[오픈소스컨설팅] 스카우터 사용자 가이드 2020Ji-Woong Choi
 
Learning how AWS implement AWS VPC CNI
Learning how AWS implement AWS VPC CNILearning how AWS implement AWS VPC CNI
Learning how AWS implement AWS VPC CNIHungWei Chiu
 
Event Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQEvent Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQAraf Karsh Hamid
 
[NDC17] Kubernetes로 개발서버 간단히 찍어내기
[NDC17] Kubernetes로 개발서버 간단히 찍어내기[NDC17] Kubernetes로 개발서버 간단히 찍어내기
[NDC17] Kubernetes로 개발서버 간단히 찍어내기SeungYong Oh
 
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)Amazon Web Services Korea
 
Amazon Aurora Deep Dive (김기완) - AWS DB Day
Amazon Aurora Deep Dive (김기완) - AWS DB DayAmazon Aurora Deep Dive (김기완) - AWS DB Day
Amazon Aurora Deep Dive (김기완) - AWS DB DayAmazon Web Services Korea
 
천만사용자를 위한 AWS 클라우드 아키텍처 진화하기 – 문종민, AWS솔루션즈 아키텍트:: AWS Summit Online Korea 2020
천만사용자를 위한 AWS 클라우드 아키텍처 진화하기 – 문종민, AWS솔루션즈 아키텍트::  AWS Summit Online Korea 2020천만사용자를 위한 AWS 클라우드 아키텍처 진화하기 – 문종민, AWS솔루션즈 아키텍트::  AWS Summit Online Korea 2020
천만사용자를 위한 AWS 클라우드 아키텍처 진화하기 – 문종민, AWS솔루션즈 아키텍트:: AWS Summit Online Korea 2020Amazon Web Services Korea
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)Amazon Web Services Korea
 
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3SANG WON PARK
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Data platform data pipeline(Airflow, Kubernetes)
Data platform data pipeline(Airflow, Kubernetes)Data platform data pipeline(Airflow, Kubernetes)
Data platform data pipeline(Airflow, Kubernetes)창언 정
 
AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020
AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020 AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020
AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020 AWSKRUG - AWS한국사용자모임
 
The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022Kai Wähner
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David AndersonVerverica
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 

Tendances (20)

[넥슨] kubernetes 소개 (2018)
[넥슨] kubernetes 소개 (2018)[넥슨] kubernetes 소개 (2018)
[넥슨] kubernetes 소개 (2018)
 
게임의 성공을 위한 Scalable 한 데이터 플랫폼 사례 공유 - 오승용, 데이터 플랫폼 리더, 데브시스터즈 ::: Games on AW...
게임의 성공을 위한 Scalable 한 데이터 플랫폼 사례 공유 - 오승용, 데이터 플랫폼 리더, 데브시스터즈 ::: Games on AW...게임의 성공을 위한 Scalable 한 데이터 플랫폼 사례 공유 - 오승용, 데이터 플랫폼 리더, 데브시스터즈 ::: Games on AW...
게임의 성공을 위한 Scalable 한 데이터 플랫폼 사례 공유 - 오승용, 데이터 플랫폼 리더, 데브시스터즈 ::: Games on AW...
 
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
 
[오픈소스컨설팅] 스카우터 사용자 가이드 2020
[오픈소스컨설팅] 스카우터 사용자 가이드 2020[오픈소스컨설팅] 스카우터 사용자 가이드 2020
[오픈소스컨설팅] 스카우터 사용자 가이드 2020
 
Learning how AWS implement AWS VPC CNI
Learning how AWS implement AWS VPC CNILearning how AWS implement AWS VPC CNI
Learning how AWS implement AWS VPC CNI
 
Event Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQEvent Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQ
 
[NDC17] Kubernetes로 개발서버 간단히 찍어내기
[NDC17] Kubernetes로 개발서버 간단히 찍어내기[NDC17] Kubernetes로 개발서버 간단히 찍어내기
[NDC17] Kubernetes로 개발서버 간단히 찍어내기
 
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
 
Amazon Aurora Deep Dive (김기완) - AWS DB Day
Amazon Aurora Deep Dive (김기완) - AWS DB DayAmazon Aurora Deep Dive (김기완) - AWS DB Day
Amazon Aurora Deep Dive (김기완) - AWS DB Day
 
천만사용자를 위한 AWS 클라우드 아키텍처 진화하기 – 문종민, AWS솔루션즈 아키텍트:: AWS Summit Online Korea 2020
천만사용자를 위한 AWS 클라우드 아키텍처 진화하기 – 문종민, AWS솔루션즈 아키텍트::  AWS Summit Online Korea 2020천만사용자를 위한 AWS 클라우드 아키텍처 진화하기 – 문종민, AWS솔루션즈 아키텍트::  AWS Summit Online Korea 2020
천만사용자를 위한 AWS 클라우드 아키텍처 진화하기 – 문종민, AWS솔루션즈 아키텍트:: AWS Summit Online Korea 2020
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
 
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Data platform data pipeline(Airflow, Kubernetes)
Data platform data pipeline(Airflow, Kubernetes)Data platform data pipeline(Airflow, Kubernetes)
Data platform data pipeline(Airflow, Kubernetes)
 
AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020
AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020 AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020
AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020
 
The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 

Similaire à Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes

Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On DemandBogdan Kyryliuk
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesDatabricks
 
Ultimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on KubernetesUltimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on Kuberneteskloia
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
Exploring Google APIs with Python
Exploring Google APIs with PythonExploring Google APIs with Python
Exploring Google APIs with Pythonwesley chun
 
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Bandwidth: Use Cases for Elastic Cloud on Kubernetes Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Bandwidth: Use Cases for Elastic Cloud on Kubernetes Elasticsearch
 
Node.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleNode.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleDmytro Semenov
 
Openstack India May Meetup
Openstack India May MeetupOpenstack India May Meetup
Openstack India May MeetupDeepak Garg
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016aspyker
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Sharma Podila
 
OpenEBS hangout #4
OpenEBS hangout #4OpenEBS hangout #4
OpenEBS hangout #4OpenEBS
 
Serverless computing with Google Cloud
Serverless computing with Google CloudServerless computing with Google Cloud
Serverless computing with Google Cloudwesley chun
 
Scaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at LyftScaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at LyftDatabricks
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupHolden Karau
 
SpringOne Platform 2018 Recap in 5 minutes
SpringOne Platform 2018 Recap in 5 minutesSpringOne Platform 2018 Recap in 5 minutes
SpringOne Platform 2018 Recap in 5 minutesRohit Kelapure
 
Automating using Ansible
Automating using AnsibleAutomating using Ansible
Automating using AnsibleAlok Patra
 
Spring Cloud Services with Pivotal Cloud Foundry- Gokhan Goksu
Spring Cloud Services with Pivotal Cloud Foundry- Gokhan GoksuSpring Cloud Services with Pivotal Cloud Foundry- Gokhan Goksu
Spring Cloud Services with Pivotal Cloud Foundry- Gokhan GoksuVMware Tanzu
 
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Pierre GRANDIN
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices worldKarol Chrapek
 

Similaire à Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes (20)

Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On Demand
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
Ultimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on KubernetesUltimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on Kubernetes
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
Exploring Google APIs with Python
Exploring Google APIs with PythonExploring Google APIs with Python
Exploring Google APIs with Python
 
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Bandwidth: Use Cases for Elastic Cloud on Kubernetes Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
 
Node.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleNode.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scale
 
Openstack India May Meetup
Openstack India May MeetupOpenstack India May Meetup
Openstack India May Meetup
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
OpenEBS hangout #4
OpenEBS hangout #4OpenEBS hangout #4
OpenEBS hangout #4
 
Serverless computing with Google Cloud
Serverless computing with Google CloudServerless computing with Google Cloud
Serverless computing with Google Cloud
 
Where should I run my code? Serverless, Containers, Virtual Machines and more
Where should I run my code? Serverless, Containers, Virtual Machines and moreWhere should I run my code? Serverless, Containers, Virtual Machines and more
Where should I run my code? Serverless, Containers, Virtual Machines and more
 
Scaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at LyftScaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at Lyft
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March Meetup
 
SpringOne Platform 2018 Recap in 5 minutes
SpringOne Platform 2018 Recap in 5 minutesSpringOne Platform 2018 Recap in 5 minutes
SpringOne Platform 2018 Recap in 5 minutes
 
Automating using Ansible
Automating using AnsibleAutomating using Ansible
Automating using Ansible
 
Spring Cloud Services with Pivotal Cloud Foundry- Gokhan Goksu
Spring Cloud Services with Pivotal Cloud Foundry- Gokhan GoksuSpring Cloud Services with Pivotal Cloud Foundry- Gokhan Goksu
Spring Cloud Services with Pivotal Cloud Foundry- Gokhan Goksu
 
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices world
 

Dernier

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 

Dernier (20)

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 

Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes

  • 1. Re-architecting Data Platform with Kubernetes SeungYong Oh, Devsisters Corp. 오 승 용, 데브시스터즈
  • 2. ● Mobile Game Publisher: including Cookie Run Series ● 10+ Million App Downloads, 2M+ Monthly Active Users "Creating the Best Player Experiences through Excellent Content, Service, and Technology!" "탁월한 기술, 서비스, 콘텐츠로 전 세계 고객에게 최고의 경험을 선사합니다." 3
  • 3. Who am I? SeungYong Oh (오 승 용) ● DevOps & Data Engineer at Devsisters ○ Also worked as a game server developer for Cookie Run series ● Kubernetes user from 2016 ○ Adopted Kubernetes for development/testing environment for Cookie Run OvenBreak ■ NDC 2017: Kubernetes로 개발서버 간단히 찍어내기 https://www.slideshare.net/seungyongoh3/ndc17-kubernetes 4
  • 4. Kubernetes @ Devsisters ● 2014~2015: Proof-of-Concept level evaluation ● 2016: Dev/testing environment game infra for Cookie Run: OvenBreak ● 2017~: Dev/testing environment game infra platform ● 2019~: Production-level game infra platform ○ Cookie Run for Kakao, Hello Brave Cookies ● 2019~: Data Platform on Kubernetes ○ Still ongoing 5
  • 5. Data Platform @ Devsisters ● Central logging infra with various data sources Using those logs, ● Search ○ Search server/client logs for debug or operational purpose ○ Search specific user's action logs for customer service purpose ● Applications ○ A/B Testing, Machine Learning, .. ● And of course, Analytics! 6
  • 6. Data Platform @ Devsisters ● Analytics: Analytics for Everyone! ○ Everyone? ■ Not only limited to data scientist, analyst ■ CEO, Project Manager, Marketer, Game Designer, ... ○ KPI(Key Performance Indicator) Service: Active users, Revenue, Retention,.. ○ Ad-hoc query/analytics environment ■ Fully programmable environment for Data Scientist / Engineer ■ SQL query / clickable interface for many users 7
  • 7. Analytics Infra / Service Data Platform @ Devsisters Server Logs k8s Pod / Docker / Host syslog Game Client / Web / 3rd Party Logs (via custom SDK/API) Apache Kafka / Kafka Streams Log collection & pre-processing Elastic Stack (ELK) (Near) Real-time Log Search & Analytics Batch Amazon S3 Data Lake AWS Glue Amazon Athena Amazon QuickSight 8
  • 8. $SOME_PROJECT_NAME supports K8S! Many Big data & analytics related projects support Kubernetes in 2019! ● Hadoop, Spark, Airflow, Kubeflow, JupyterHub, Hue, Zeppelin, Superset, Kafka, Elastic Stack, etc. Support? ● Case 1: (Official/Popular) Helm Chart ● Case 2: K8S Operator for the project ● Case 3: Cloud Native / Kubernetes-specific Integrations 9
  • 9. But, does it mean we should move to K8S? Many projects are stateful apps, especially in Big data / analytics ● Considerable risks / maintenance costs ○ Node Upgrade? Retirement? Reallocation is NOT easy ● Benefits? ○ Bin-packing? 🤔 (especially in public cloud) ○ Autoscaling? 🤔 ○ Scheduling? 🤔 ■ Already there are some advanced scheduler/resource manager (ex. Hadoop YARN) 10
  • 10. Actually, there are some (great!) benefits! 11
  • 11. Benefit #1: Deployment / Ops Made Easy 12
  • 12. Example: Apache Airflow Apache Airflow: Workflow Management tool Source: https://github.com/GoogleCloudPlatform/airflow-operator/blob/ master/docs/design.md Quick Deployment with ● helm install stable/airflow or ● Airflow operator https://github.com/GoogleCloudPlatform/airflow-operator/ 13
  • 13. Benefit #2: Analytics - Easy and Efficient 14
  • 14. Apache Spark @ Devsisters A unified analytics engine for large-scale data processing Use-cases of Spark @ Devsisters: Amazon S3 Data Lake: Raw Amazon S3 Data Lake: Analytics (Snapshot, DW, DM..) 15 Case #1: Store Logs (Batch) Case #2: Transform (Batch) Case #3: Ad-hoc Analytics Data Scientist, Engineer, Marketer, PM, ..
  • 15. Apache Spark @ Devsisters Self-service web interface for Spark Cluster Provision Launches Spark Cluster (standalone mode) ● 1 master node with Jupyter Notebook ● multiple worker nodes 16https://www.slideshare.net/JPark0426/ndc-2018-spark-flintrock-airflow
  • 16. Apache Spark @ Devsisters So, anyone can have their own Spark Cluster with one-click! Looks Good? ● Many users (ex. Marketing Manager) only want to focus on data analysis! ○ r5.xlarge? ○ spot? bidding price? ○ Waiting few minutes for spark provisioning, with some failures ○ Should backup their files before terminating cluster ● Cost ineffective ○ idle resources ○ forget to terminate cluster after use Then sharing a centralized cluster? NO! (access control issues etc.) 17
  • 17. Spark on Kubernetes Apache Spark - Native K8S Scheduler (experimental) https://spark.apache.org/docs/latest/running-on-kubernetes.html K8S Operator for Apache Spark https://github.com/GoogleCloudPlatform/spark-on-k8s-operator 18
  • 18. Jupyter + Spark on K8S # Execute below in Jupyter Pod PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook --allow-root' ./bin/pyspark --master k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_PORT_443_TCP_PORT/ --name spark-$APP_NAME --conf spark.executor.instances=$EXECUTOR_COUNT --conf spark.kubernetes.container.image=<<custom built spark container>> --conf spark.kubernetes.namespace=spark-test --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-test-server --conf spark.driver.host=spark-driver --conf spark.driver.port=5555 --conf spark.driver.bindAddress=0.0.0.0 --conf spark.driver.blockManager.port=5556 --conf spark.kubernetes.executor.annotation.iam.amazonaws.com/role=<<AWS IAM role (KIAM)>> ● But users need much easier interface! 19
  • 19. JupyterHub + Spark on K8S ● Multi-user version of Jupyter Notebook, support k8s natively ● Launch a notebook pod per user ○ Data is persisted (stored in PersistentVolume) ● Let's Integrate Spark on K8S with JupyterHub ! 20
  • 20. JupyterHub + Spark on K8S ● Demo 21
  • 21. 22
  • 22. 23
  • 23. But many users only need SQL interface ● Spark has distributed SQL Engine https://spark.apache.org/docs/latest/sql-distributed-sql-engine.html ○ Thrift JDBC/ODBC server (corresponds to the HiveServer2) ● Integration with Apache Superset (Business Intelligence web app) 24
  • 24. 25
  • 25. For applications? ● Example #1: ○ Send push messages to users that played game more than a hour yesterday ■ JDBC! ● Example #2: ○ Give special items to users that has some possibility not to play the game anymore ■ spark-submit? 26
  • 26. Benefit #3: Fine-tuning Access Controls 27
  • 27. Still Stressful but Easier ● Web Apps: Use Auth Proxy Sidecar ● For apps that don't support RBAC: ○ Pod-level data access control (S3 in Devsisters' case) ■ In AWS, with KIAM or AWS IAM Authenticator ○ Or Node-level (with nodeSelector) ○ Pods/Deployment per each group/roles + Benefit of Bin-packing 28
  • 28. Then, Let's move to Kubernetes Now..?! 29
  • 29. Still in Early Stage ● Spark on Kubernetes ○ Doesn't support dynamic allocation / external shuffle service yet ■ You can't change count of executors ○ Pod Templating - available on 3.0 Preview ■ Even You can't specify service account of executor pod now ● Other Many Projects also in alpha stage, which requires code-level modifications / forking to meet the team's requirements ● If storage & computing resource is coupled, maybe little more hard to move 30
  • 30. But it's worth to move! ● A Data Engineer who doesn't have prior knowledge of Kubernetes could implement Spark on K8s/JupyterHub/Superset customizations less than 1.5 months (expected more than 4 months for non-k8s env) ● Kubernetes' abstraction allows users to focus on what they can do best 31