SlideShare a Scribd company logo
1 of 22
Download to read offline
Herding Kats -
Netflix’s Kubernetes Journey
Andrew Spyker - @aspyker
10/27/2020
Hint: Don’t know him? Search: “Tiger King”
About the Speaker (@aspyker)
● Manager, Compute Platform
○ OS Engineering (OS, Kernel, Images)
○ Container Management Platform (Titus)
○ Image Baking and Registry
○ 20 engineers who design, implement, operate
● Part of Cloud Infrastructure Engineering
○ Compute, Networking, Reliability, Performance, Traffic/Capacity
● Part of Platform Engineering
○ Infrastructure, Productivity, Data Platform, TPM
○ Supports all of Product Engineering Cloud Infrastructure
Engineering
Batch first, added services later with focus on developer productivity
- Simpler compute management, deployment packaging, local dev
As organic adoption grew, focused on scale and reliability
- First very large scale users - Flink, Media Encoding
- Netflix Streaming critical services onboarded
Focus on efficiency - largest “cluster” in Netflix fleet
- First to make Titus cheaper, later focus on workloads
Titus and containers become supported first class (as much as VMs)
Netflix Container Management History (Titus)
2015
2016
2019
2018
Design Principles of Titus
● Runs any “Docker” container (IaaS, not PaaS)
○ Deep integration with existing Netflix platform services
● AWS workloads work unchanged
○ Not trying to abstract AWS EC2 away
○ Deep AWS Collaboration, making containers work well on EC2
● Just what Netflix needs
○ Not extendable to all vendors/companies’ needs
What runs on Titus today
The Netflix Streaming Product
Open Connect CDN monitoring/planning
Machine learning and recommendations
Workloads that power our Studios and Creatives
Content Engineering (Encoding, Artwork personalization, etc.)
Big Data platform (Presto, Notebooks, ETL, etc.)
We started with Mesos and then moved on
Mesos in 2015?
● Used at Netflix in other systems
● Already proven at scale for general use (Twitter, Apple, etc.)
● Just enough model, very adaptable to our needs
● Worked very well!
NOT Mesos in 2019?
● Web Scale users moving away
● “Northbound” agent to control plane communication
“Easy” to move as we don’t leverage Mesos features
Goal: remove Mesos …
bringing all existing workloads along seamlessly
Existing Stack
Agent
Agent
Existing
Agents
Mesos agent
Titus container
runtime
User ContainersUser ContainersUser Containers
Titus API
(Federation)
Titus Scheduling &
Coordination
Mesos etcd
Agent
Agent
Existing
Agents
Virtual Kubelet
Titus container
runtime
User ContainersUser ContainersUser Containers
Titus Scheduling &
Coordination
API Server
Kubernetes Powered Stack
Move
workloads
over time
Rollout timeframe
Q1
2019
Q2 Q3 Q4 Q1
2020
Q2 Q3
Design Implementation
and test env rollout
Production
rollout
Major scale-up
Multiple production
issues
Reliability Focus
Mesos turned off
Production Incidents
Foundational change to our production systems during unprecedented time
Already partially deployed by the start of shelter in place
All team members impacted by world events across 2020
A story of learning and eventual success
A story of success
Kinds of failures we saw
Scale
● API Server tuning for timeouts and caching for list and watch
Distributed
● Load balancing for web hooks (timeouts, network security)
● Overloading etcd with custom CRDs
Other “normal” challenges in running new infrastructure
Time to Recovery >> Root Cause
Recovery time was much longer than our previous system
Why?
● Didn’t know the new system as well as old system
● It takes real incidents to learn all areas of a new system
● Recency bias in diagnosis and remediation actions
Actions taken
● Implement recovery escape valves, use first, then diagnose
● Game days including breaking our systems
From Mesos architecture - “two monoliths”
⮑ Kubernetes architecture - “distributed system”
Likely rushed the productization of this change
Realized we under-invested in
● Cross component dashboarding
● Centralized log analysis
● Distributed tracing
Understanding and Observability
Our current production architecture
Cassandra
Titus Control Plane
● API
● Scheduling
● Job Lifecycle Control
EC2 Autoscaling
Fenzo
container
container
container
docker
Titus Hosts
Virtual Kubelet
Docker
Docker Registry
containercontainerUser Containers
AWS Virtual Machines
Kubernetes API
Server and etcd
Titus System ServicesBatch/Workflow
Systems
Service
CI/CD
Current and Future work
Control-plane responsibilities
- Fleet capacity management
- Cluster ingress / load balancing
- Batch job mgmt
- Service cluster mgmt, autoscaling
- Disruption budgeting
Transitioning to K8s native scheduler
Titus Scheduling and Coordination
Current Soon
Mesos Based Kubernetes Based
Monolithic Distributed
Mostly Netflix Specific Netflix Extensions
Java Golang
Transitioning to “Real” Kubelet
Benefits
● Full user controlled workload composition (pods)
● Networking and storage partners integration
● Off-the shelf open source add-ons
Challenges
● User namespaces aren’t supported (security)
● Pod container start/stop ordering not guaranteed
● Changes in user experience (ssh, metric, etc.)
Exposing the K8s API?
Most Netflix engineers do not need to understand K8s
Platforms and partners
● Abstract the infrastructure layer and use Titus API
● Some can better integrate at the K8s level
Current considerations
● Could lock our users into this API and tools
● Without it, makes tools (kubectl) hard to use by infrastructure partners
● Need for reliable | scalable | operational | agile cluster aggregation
Machine Learning for Infrastructure
Resource Isolation (NUMA, etc.) for latency sensitive workloads
Easier for users + better efficiency
● Actual resource usage and overcommit
● Capacity of services, batch runtime length
● Compute, networking, memory, storage needs
Easier to operate
● Fleet wide capacity management
Q & A Time ...
Where is Carole
Baskin’s Husband?

More Related Content

What's hot

Kubernetes #2 monitoring
Kubernetes #2   monitoring Kubernetes #2   monitoring
Kubernetes #2 monitoring Terry Cho
 
Kubernetes a comprehensive overview
Kubernetes   a comprehensive overviewKubernetes   a comprehensive overview
Kubernetes a comprehensive overviewGabriel Carro
 
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...HostedbyConfluent
 
CD using ArgoCD(KnolX).pdf
CD using ArgoCD(KnolX).pdfCD using ArgoCD(KnolX).pdf
CD using ArgoCD(KnolX).pdfKnoldus Inc.
 
Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17Ryan Jarvinen
 
Terraform GitOps on Codefresh
Terraform GitOps on CodefreshTerraform GitOps on Codefresh
Terraform GitOps on CodefreshCodefresh
 
Kubernetes Summit 2021: Multi-Cluster - The Good, the Bad and the Ugly
Kubernetes Summit 2021: Multi-Cluster - The Good, the Bad and the UglyKubernetes Summit 2021: Multi-Cluster - The Good, the Bad and the Ugly
Kubernetes Summit 2021: Multi-Cluster - The Good, the Bad and the Uglysmalltown
 
Improve monitoring and observability for kubernetes with oss tools
Improve monitoring and observability for kubernetes with oss toolsImprove monitoring and observability for kubernetes with oss tools
Improve monitoring and observability for kubernetes with oss toolsNilesh Gule
 
Storm at Spotify
Storm at SpotifyStorm at Spotify
Storm at SpotifyNeville Li
 
The what, why and how of knative
The what, why and how of knativeThe what, why and how of knative
The what, why and how of knativeMofizur Rahman
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsWeaveworks
 
Service Mesh on Kubernetes with Istio
Service Mesh on Kubernetes with IstioService Mesh on Kubernetes with Istio
Service Mesh on Kubernetes with IstioMichelle Holley
 
Monitoring at the Speed of DevOps
Monitoring at the Speed of DevOpsMonitoring at the Speed of DevOps
Monitoring at the Speed of DevOpsDevOps.com
 
Microservice Architecture Patterns, by Richard Langlois P. Eng.
Microservice Architecture Patterns, by Richard Langlois P. Eng.Microservice Architecture Patterns, by Richard Langlois P. Eng.
Microservice Architecture Patterns, by Richard Langlois P. Eng.Richard Langlois P. Eng.
 
Kubernetes Architecture
 Kubernetes Architecture Kubernetes Architecture
Kubernetes ArchitectureKnoldus Inc.
 
Knative, Serverless on Kubernetes, and Openshift
Knative, Serverless on Kubernetes, and OpenshiftKnative, Serverless on Kubernetes, and Openshift
Knative, Serverless on Kubernetes, and OpenshiftChris Suszyński
 
Deploy 22 microservices from scratch in 30 mins with GitOps
Deploy 22 microservices from scratch in 30 mins with GitOpsDeploy 22 microservices from scratch in 30 mins with GitOps
Deploy 22 microservices from scratch in 30 mins with GitOpsOpsta
 

What's hot (20)

Kubernetes #2 monitoring
Kubernetes #2   monitoring Kubernetes #2   monitoring
Kubernetes #2 monitoring
 
Kubernetes a comprehensive overview
Kubernetes   a comprehensive overviewKubernetes   a comprehensive overview
Kubernetes a comprehensive overview
 
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...
 
CD using ArgoCD(KnolX).pdf
CD using ArgoCD(KnolX).pdfCD using ArgoCD(KnolX).pdf
CD using ArgoCD(KnolX).pdf
 
Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17
 
Terraform GitOps on Codefresh
Terraform GitOps on CodefreshTerraform GitOps on Codefresh
Terraform GitOps on Codefresh
 
Kubernetes Summit 2021: Multi-Cluster - The Good, the Bad and the Ugly
Kubernetes Summit 2021: Multi-Cluster - The Good, the Bad and the UglyKubernetes Summit 2021: Multi-Cluster - The Good, the Bad and the Ugly
Kubernetes Summit 2021: Multi-Cluster - The Good, the Bad and the Ugly
 
Meetup 23 - 03 - Application Delivery on K8S with GitOps
Meetup 23 - 03 - Application Delivery on K8S with GitOpsMeetup 23 - 03 - Application Delivery on K8S with GitOps
Meetup 23 - 03 - Application Delivery on K8S with GitOps
 
Improve monitoring and observability for kubernetes with oss tools
Improve monitoring and observability for kubernetes with oss toolsImprove monitoring and observability for kubernetes with oss tools
Improve monitoring and observability for kubernetes with oss tools
 
Storm at Spotify
Storm at SpotifyStorm at Spotify
Storm at Spotify
 
Kubernetes PPT.pptx
Kubernetes PPT.pptxKubernetes PPT.pptx
Kubernetes PPT.pptx
 
The what, why and how of knative
The what, why and how of knativeThe what, why and how of knative
The what, why and how of knative
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOps
 
Service Mesh on Kubernetes with Istio
Service Mesh on Kubernetes with IstioService Mesh on Kubernetes with Istio
Service Mesh on Kubernetes with Istio
 
Monitoring at the Speed of DevOps
Monitoring at the Speed of DevOpsMonitoring at the Speed of DevOps
Monitoring at the Speed of DevOps
 
Knative Intro
Knative IntroKnative Intro
Knative Intro
 
Microservice Architecture Patterns, by Richard Langlois P. Eng.
Microservice Architecture Patterns, by Richard Langlois P. Eng.Microservice Architecture Patterns, by Richard Langlois P. Eng.
Microservice Architecture Patterns, by Richard Langlois P. Eng.
 
Kubernetes Architecture
 Kubernetes Architecture Kubernetes Architecture
Kubernetes Architecture
 
Knative, Serverless on Kubernetes, and Openshift
Knative, Serverless on Kubernetes, and OpenshiftKnative, Serverless on Kubernetes, and Openshift
Knative, Serverless on Kubernetes, and Openshift
 
Deploy 22 microservices from scratch in 30 mins with GitOps
Deploy 22 microservices from scratch in 30 mins with GitOpsDeploy 22 microservices from scratch in 30 mins with GitOps
Deploy 22 microservices from scratch in 30 mins with GitOps
 

Similar to Herding Kats - Netflix’s Journey to Kubernetes Public

Pivotal Container Service (PKS) at SF Cloud Foundry Meetup
Pivotal Container Service (PKS) at SF Cloud Foundry MeetupPivotal Container Service (PKS) at SF Cloud Foundry Meetup
Pivotal Container Service (PKS) at SF Cloud Foundry Meetupcornelia davis
 
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsHybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsWeaveworks
 
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsHybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsSonja Schweigert
 
Google Cloud Fundamentals by CloudZone
Google Cloud Fundamentals by CloudZoneGoogle Cloud Fundamentals by CloudZone
Google Cloud Fundamentals by CloudZoneIdan Tohami
 
OSDC 2018 | Three years running containers with Kubernetes in Production by T...
OSDC 2018 | Three years running containers with Kubernetes in Production by T...OSDC 2018 | Three years running containers with Kubernetes in Production by T...
OSDC 2018 | Three years running containers with Kubernetes in Production by T...NETWAYS
 
Episode 1: Building Kubernetes-as-a-Service
Episode 1: Building Kubernetes-as-a-ServiceEpisode 1: Building Kubernetes-as-a-Service
Episode 1: Building Kubernetes-as-a-ServiceMesosphere Inc.
 
Open shift and docker - october,2014
Open shift and docker - october,2014Open shift and docker - october,2014
Open shift and docker - october,2014Hojoong Kim
 
How Kubernetes helps Devops
How Kubernetes helps DevopsHow Kubernetes helps Devops
How Kubernetes helps DevopsSreenivas Makam
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thingaspyker
 
Netflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsNetflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsAll Things Open
 
Pivotal Container Service Overview
Pivotal Container Service Overview Pivotal Container Service Overview
Pivotal Container Service Overview VMware Tanzu
 
DEVNET-1169 CI/CT/CD on a Micro Services Applications using Docker, Salt & Ni...
DEVNET-1169	CI/CT/CD on a Micro Services Applications using Docker, Salt & Ni...DEVNET-1169	CI/CT/CD on a Micro Services Applications using Docker, Salt & Ni...
DEVNET-1169 CI/CT/CD on a Micro Services Applications using Docker, Salt & Ni...Cisco DevNet
 
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...tdc-globalcode
 
Continuous Lifecycle London 2018 Event Keynote
Continuous Lifecycle London 2018 Event KeynoteContinuous Lifecycle London 2018 Event Keynote
Continuous Lifecycle London 2018 Event KeynoteWeaveworks
 
Migrating from Self-Managed Kubernetes on EC2 to a GitOps Enabled EKS
Migrating from Self-Managed Kubernetes on EC2 to a GitOps Enabled EKSMigrating from Self-Managed Kubernetes on EC2 to a GitOps Enabled EKS
Migrating from Self-Managed Kubernetes on EC2 to a GitOps Enabled EKSWeaveworks
 
Achieve Data & Operational Sovereignty: Managing Hybrid & Edge EKS Deployment...
Achieve Data & Operational Sovereignty: Managing Hybrid & Edge EKS Deployment...Achieve Data & Operational Sovereignty: Managing Hybrid & Edge EKS Deployment...
Achieve Data & Operational Sovereignty: Managing Hybrid & Edge EKS Deployment...Weaveworks
 
Introduction to containers, k8s, Microservices & Cloud Native
Introduction to containers, k8s, Microservices & Cloud NativeIntroduction to containers, k8s, Microservices & Cloud Native
Introduction to containers, k8s, Microservices & Cloud NativeTerry Wang
 
Overcoming Regulatory & Compliance Hurdles with Hybrid Cloud EKS and Weave Gi...
Overcoming Regulatory & Compliance Hurdles with Hybrid Cloud EKS and Weave Gi...Overcoming Regulatory & Compliance Hurdles with Hybrid Cloud EKS and Weave Gi...
Overcoming Regulatory & Compliance Hurdles with Hybrid Cloud EKS and Weave Gi...Weaveworks
 
Integration in the Cloud
Integration in the CloudIntegration in the Cloud
Integration in the CloudRob Davies
 

Similar to Herding Kats - Netflix’s Journey to Kubernetes Public (20)

Pivotal Container Service (PKS) at SF Cloud Foundry Meetup
Pivotal Container Service (PKS) at SF Cloud Foundry MeetupPivotal Container Service (PKS) at SF Cloud Foundry Meetup
Pivotal Container Service (PKS) at SF Cloud Foundry Meetup
 
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsHybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
 
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsHybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
 
Google Cloud Fundamentals by CloudZone
Google Cloud Fundamentals by CloudZoneGoogle Cloud Fundamentals by CloudZone
Google Cloud Fundamentals by CloudZone
 
OSDC 2018 | Three years running containers with Kubernetes in Production by T...
OSDC 2018 | Three years running containers with Kubernetes in Production by T...OSDC 2018 | Three years running containers with Kubernetes in Production by T...
OSDC 2018 | Three years running containers with Kubernetes in Production by T...
 
Episode 1: Building Kubernetes-as-a-Service
Episode 1: Building Kubernetes-as-a-ServiceEpisode 1: Building Kubernetes-as-a-Service
Episode 1: Building Kubernetes-as-a-Service
 
The rise of microservices
The rise of microservicesThe rise of microservices
The rise of microservices
 
Open shift and docker - october,2014
Open shift and docker - october,2014Open shift and docker - october,2014
Open shift and docker - october,2014
 
How Kubernetes helps Devops
How Kubernetes helps DevopsHow Kubernetes helps Devops
How Kubernetes helps Devops
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thing
 
Netflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsNetflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger Things
 
Pivotal Container Service Overview
Pivotal Container Service Overview Pivotal Container Service Overview
Pivotal Container Service Overview
 
DEVNET-1169 CI/CT/CD on a Micro Services Applications using Docker, Salt & Ni...
DEVNET-1169	CI/CT/CD on a Micro Services Applications using Docker, Salt & Ni...DEVNET-1169	CI/CT/CD on a Micro Services Applications using Docker, Salt & Ni...
DEVNET-1169 CI/CT/CD on a Micro Services Applications using Docker, Salt & Ni...
 
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
 
Continuous Lifecycle London 2018 Event Keynote
Continuous Lifecycle London 2018 Event KeynoteContinuous Lifecycle London 2018 Event Keynote
Continuous Lifecycle London 2018 Event Keynote
 
Migrating from Self-Managed Kubernetes on EC2 to a GitOps Enabled EKS
Migrating from Self-Managed Kubernetes on EC2 to a GitOps Enabled EKSMigrating from Self-Managed Kubernetes on EC2 to a GitOps Enabled EKS
Migrating from Self-Managed Kubernetes on EC2 to a GitOps Enabled EKS
 
Achieve Data & Operational Sovereignty: Managing Hybrid & Edge EKS Deployment...
Achieve Data & Operational Sovereignty: Managing Hybrid & Edge EKS Deployment...Achieve Data & Operational Sovereignty: Managing Hybrid & Edge EKS Deployment...
Achieve Data & Operational Sovereignty: Managing Hybrid & Edge EKS Deployment...
 
Introduction to containers, k8s, Microservices & Cloud Native
Introduction to containers, k8s, Microservices & Cloud NativeIntroduction to containers, k8s, Microservices & Cloud Native
Introduction to containers, k8s, Microservices & Cloud Native
 
Overcoming Regulatory & Compliance Hurdles with Hybrid Cloud EKS and Weave Gi...
Overcoming Regulatory & Compliance Hurdles with Hybrid Cloud EKS and Weave Gi...Overcoming Regulatory & Compliance Hurdles with Hybrid Cloud EKS and Weave Gi...
Overcoming Regulatory & Compliance Hurdles with Hybrid Cloud EKS and Weave Gi...
 
Integration in the Cloud
Integration in the CloudIntegration in the Cloud
Integration in the Cloud
 

More from aspyker

Season 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data ScientistsSeason 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data Scientistsaspyker
 
CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2aspyker
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemonsaspyker
 
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, KayentaNetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, Kayentaaspyker
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containersaspyker
 
SRECon Lightning Talk
SRECon Lightning TalkSRECon Lightning Talk
SRECon Lightning Talkaspyker
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Sourceaspyker
 
Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1aspyker
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17aspyker
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4aspyker
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integrationaspyker
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Programaspyker
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflixaspyker
 
Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3aspyker
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016aspyker
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016aspyker
 
Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1aspyker
 
CS80A Foothill College Open Source Talk
CS80A Foothill College Open Source TalkCS80A Foothill College Open Source Talk
CS80A Foothill College Open Source Talkaspyker
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015aspyker
 

More from aspyker (20)

Season 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data ScientistsSeason 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data Scientists
 
CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
 
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, KayentaNetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containers
 
SRECon Lightning Talk
SRECon Lightning TalkSRECon Lightning Talk
SRECon Lightning Talk
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integration
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Program
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflix
 
Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016
 
Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1
 
CS80A Foothill College Open Source Talk
CS80A Foothill College Open Source TalkCS80A Foothill College Open Source Talk
CS80A Foothill College Open Source Talk
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Herding Kats - Netflix’s Journey to Kubernetes Public

  • 1. Herding Kats - Netflix’s Kubernetes Journey Andrew Spyker - @aspyker 10/27/2020
  • 2. Hint: Don’t know him? Search: “Tiger King”
  • 3. About the Speaker (@aspyker) ● Manager, Compute Platform ○ OS Engineering (OS, Kernel, Images) ○ Container Management Platform (Titus) ○ Image Baking and Registry ○ 20 engineers who design, implement, operate ● Part of Cloud Infrastructure Engineering ○ Compute, Networking, Reliability, Performance, Traffic/Capacity ● Part of Platform Engineering ○ Infrastructure, Productivity, Data Platform, TPM ○ Supports all of Product Engineering Cloud Infrastructure Engineering
  • 4. Batch first, added services later with focus on developer productivity - Simpler compute management, deployment packaging, local dev As organic adoption grew, focused on scale and reliability - First very large scale users - Flink, Media Encoding - Netflix Streaming critical services onboarded Focus on efficiency - largest “cluster” in Netflix fleet - First to make Titus cheaper, later focus on workloads Titus and containers become supported first class (as much as VMs) Netflix Container Management History (Titus) 2015 2016 2019 2018
  • 5. Design Principles of Titus ● Runs any “Docker” container (IaaS, not PaaS) ○ Deep integration with existing Netflix platform services ● AWS workloads work unchanged ○ Not trying to abstract AWS EC2 away ○ Deep AWS Collaboration, making containers work well on EC2 ● Just what Netflix needs ○ Not extendable to all vendors/companies’ needs
  • 6. What runs on Titus today The Netflix Streaming Product Open Connect CDN monitoring/planning Machine learning and recommendations Workloads that power our Studios and Creatives Content Engineering (Encoding, Artwork personalization, etc.) Big Data platform (Presto, Notebooks, ETL, etc.)
  • 7. We started with Mesos and then moved on Mesos in 2015? ● Used at Netflix in other systems ● Already proven at scale for general use (Twitter, Apple, etc.) ● Just enough model, very adaptable to our needs ● Worked very well! NOT Mesos in 2019? ● Web Scale users moving away ● “Northbound” agent to control plane communication “Easy” to move as we don’t leverage Mesos features
  • 8. Goal: remove Mesos … bringing all existing workloads along seamlessly
  • 9. Existing Stack Agent Agent Existing Agents Mesos agent Titus container runtime User ContainersUser ContainersUser Containers Titus API (Federation) Titus Scheduling & Coordination Mesos etcd Agent Agent Existing Agents Virtual Kubelet Titus container runtime User ContainersUser ContainersUser Containers Titus Scheduling & Coordination API Server Kubernetes Powered Stack Move workloads over time
  • 10. Rollout timeframe Q1 2019 Q2 Q3 Q4 Q1 2020 Q2 Q3 Design Implementation and test env rollout Production rollout Major scale-up Multiple production issues Reliability Focus Mesos turned off
  • 12. Foundational change to our production systems during unprecedented time Already partially deployed by the start of shelter in place All team members impacted by world events across 2020 A story of learning and eventual success A story of success
  • 13. Kinds of failures we saw Scale ● API Server tuning for timeouts and caching for list and watch Distributed ● Load balancing for web hooks (timeouts, network security) ● Overloading etcd with custom CRDs Other “normal” challenges in running new infrastructure
  • 14. Time to Recovery >> Root Cause Recovery time was much longer than our previous system Why? ● Didn’t know the new system as well as old system ● It takes real incidents to learn all areas of a new system ● Recency bias in diagnosis and remediation actions Actions taken ● Implement recovery escape valves, use first, then diagnose ● Game days including breaking our systems
  • 15. From Mesos architecture - “two monoliths” ⮑ Kubernetes architecture - “distributed system” Likely rushed the productization of this change Realized we under-invested in ● Cross component dashboarding ● Centralized log analysis ● Distributed tracing Understanding and Observability
  • 16. Our current production architecture Cassandra Titus Control Plane ● API ● Scheduling ● Job Lifecycle Control EC2 Autoscaling Fenzo container container container docker Titus Hosts Virtual Kubelet Docker Docker Registry containercontainerUser Containers AWS Virtual Machines Kubernetes API Server and etcd Titus System ServicesBatch/Workflow Systems Service CI/CD
  • 18. Control-plane responsibilities - Fleet capacity management - Cluster ingress / load balancing - Batch job mgmt - Service cluster mgmt, autoscaling - Disruption budgeting Transitioning to K8s native scheduler Titus Scheduling and Coordination Current Soon Mesos Based Kubernetes Based Monolithic Distributed Mostly Netflix Specific Netflix Extensions Java Golang
  • 19. Transitioning to “Real” Kubelet Benefits ● Full user controlled workload composition (pods) ● Networking and storage partners integration ● Off-the shelf open source add-ons Challenges ● User namespaces aren’t supported (security) ● Pod container start/stop ordering not guaranteed ● Changes in user experience (ssh, metric, etc.)
  • 20. Exposing the K8s API? Most Netflix engineers do not need to understand K8s Platforms and partners ● Abstract the infrastructure layer and use Titus API ● Some can better integrate at the K8s level Current considerations ● Could lock our users into this API and tools ● Without it, makes tools (kubectl) hard to use by infrastructure partners ● Need for reliable | scalable | operational | agile cluster aggregation
  • 21. Machine Learning for Infrastructure Resource Isolation (NUMA, etc.) for latency sensitive workloads Easier for users + better efficiency ● Actual resource usage and overcommit ● Capacity of services, batch runtime length ● Compute, networking, memory, storage needs Easier to operate ● Fleet wide capacity management
  • 22. Q & A Time ... Where is Carole Baskin’s Husband?