An update from Netflix Compute's container management platform, Titus, covering the work to move from Mesos to Kubernetes. Lessons learned, next steps, and challenges.
3. About the Speaker (@aspyker)
● Manager, Compute Platform
○ OS Engineering (OS, Kernel, Images)
○ Container Management Platform (Titus)
○ Image Baking and Registry
○ 20 engineers who design, implement, operate
● Part of Cloud Infrastructure Engineering
○ Compute, Networking, Reliability, Performance, Traffic/Capacity
● Part of Platform Engineering
○ Infrastructure, Productivity, Data Platform, TPM
○ Supports all of Product Engineering Cloud Infrastructure
Engineering
4. Batch first, added services later with focus on developer productivity
- Simpler compute management, deployment packaging, local dev
As organic adoption grew, focused on scale and reliability
- First very large scale users - Flink, Media Encoding
- Netflix Streaming critical services onboarded
Focus on efficiency - largest “cluster” in Netflix fleet
- First to make Titus cheaper, later focus on workloads
Titus and containers become supported first class (as much as VMs)
Netflix Container Management History (Titus)
2015
2016
2019
2018
5. Design Principles of Titus
● Runs any “Docker” container (IaaS, not PaaS)
○ Deep integration with existing Netflix platform services
● AWS workloads work unchanged
○ Not trying to abstract AWS EC2 away
○ Deep AWS Collaboration, making containers work well on EC2
● Just what Netflix needs
○ Not extendable to all vendors/companies’ needs
6. What runs on Titus today
The Netflix Streaming Product
Open Connect CDN monitoring/planning
Machine learning and recommendations
Workloads that power our Studios and Creatives
Content Engineering (Encoding, Artwork personalization, etc.)
Big Data platform (Presto, Notebooks, ETL, etc.)
7. We started with Mesos and then moved on
Mesos in 2015?
● Used at Netflix in other systems
● Already proven at scale for general use (Twitter, Apple, etc.)
● Just enough model, very adaptable to our needs
● Worked very well!
NOT Mesos in 2019?
● Web Scale users moving away
● “Northbound” agent to control plane communication
“Easy” to move as we don’t leverage Mesos features
9. Existing Stack
Agent
Agent
Existing
Agents
Mesos agent
Titus container
runtime
User ContainersUser ContainersUser Containers
Titus API
(Federation)
Titus Scheduling &
Coordination
Mesos etcd
Agent
Agent
Existing
Agents
Virtual Kubelet
Titus container
runtime
User ContainersUser ContainersUser Containers
Titus Scheduling &
Coordination
API Server
Kubernetes Powered Stack
Move
workloads
over time
10. Rollout timeframe
Q1
2019
Q2 Q3 Q4 Q1
2020
Q2 Q3
Design Implementation
and test env rollout
Production
rollout
Major scale-up
Multiple production
issues
Reliability Focus
Mesos turned off
12. Foundational change to our production systems during unprecedented time
Already partially deployed by the start of shelter in place
All team members impacted by world events across 2020
A story of learning and eventual success
A story of success
13. Kinds of failures we saw
Scale
● API Server tuning for timeouts and caching for list and watch
Distributed
● Load balancing for web hooks (timeouts, network security)
● Overloading etcd with custom CRDs
Other “normal” challenges in running new infrastructure
14. Time to Recovery >> Root Cause
Recovery time was much longer than our previous system
Why?
● Didn’t know the new system as well as old system
● It takes real incidents to learn all areas of a new system
● Recency bias in diagnosis and remediation actions
Actions taken
● Implement recovery escape valves, use first, then diagnose
● Game days including breaking our systems
15. From Mesos architecture - “two monoliths”
⮑ Kubernetes architecture - “distributed system”
Likely rushed the productization of this change
Realized we under-invested in
● Cross component dashboarding
● Centralized log analysis
● Distributed tracing
Understanding and Observability
16. Our current production architecture
Cassandra
Titus Control Plane
● API
● Scheduling
● Job Lifecycle Control
EC2 Autoscaling
Fenzo
container
container
container
docker
Titus Hosts
Virtual Kubelet
Docker
Docker Registry
containercontainerUser Containers
AWS Virtual Machines
Kubernetes API
Server and etcd
Titus System ServicesBatch/Workflow
Systems
Service
CI/CD
18. Control-plane responsibilities
- Fleet capacity management
- Cluster ingress / load balancing
- Batch job mgmt
- Service cluster mgmt, autoscaling
- Disruption budgeting
Transitioning to K8s native scheduler
Titus Scheduling and Coordination
Current Soon
Mesos Based Kubernetes Based
Monolithic Distributed
Mostly Netflix Specific Netflix Extensions
Java Golang
19. Transitioning to “Real” Kubelet
Benefits
● Full user controlled workload composition (pods)
● Networking and storage partners integration
● Off-the shelf open source add-ons
Challenges
● User namespaces aren’t supported (security)
● Pod container start/stop ordering not guaranteed
● Changes in user experience (ssh, metric, etc.)
20. Exposing the K8s API?
Most Netflix engineers do not need to understand K8s
Platforms and partners
● Abstract the infrastructure layer and use Titus API
● Some can better integrate at the K8s level
Current considerations
● Could lock our users into this API and tools
● Without it, makes tools (kubectl) hard to use by infrastructure partners
● Need for reliable | scalable | operational | agile cluster aggregation
21. Machine Learning for Infrastructure
Resource Isolation (NUMA, etc.) for latency sensitive workloads
Easier for users + better efficiency
● Actual resource usage and overcommit
● Capacity of services, batch runtime length
● Compute, networking, memory, storage needs
Easier to operate
● Fleet wide capacity management
22. Q & A Time ...
Where is Carole
Baskin’s Husband?