Coburn Watson, Director of Performance and Reliability Engineering at Netflix discusses the differences between Cloud and traditional DC/On-Prem capacity planning models . He additional covers some of the distinct methodologies applied at Netflix to improve the rate of innovation, overall reliability, while keeping a pulse on efficiency.
4. ● > 83M households
● 190 Countries
● 35% of Internet traffic in US at peak
● Entirely on Cloud*, three regions
● Evacuate a region monthly...for 24 hours
● Capacity planning ~ 5 people! (in the room :-)
* Content served from homegrown OpenConnect CDN
5. Capacity Planning Concerns
● Facility considerations (Space, Power, Network, Cooling)
● Supply Chain Management Constraints and Relationships
● Hardware lifetime contour & failure rates (MTBF)
● Systems management staff
● Seasonal and unexpected burst considerations
● Workload colocation and performance demands
● Over-provisioning for reliability and rate of innovation
● Effective tooling
● Business continuity planning
6. (Cloud) Capacity Planning Concerns
● Facility considerations (Power, Network, Cooling)
● Supply Chain Management Constraints and Relationships
● Hardware lifetime contour & failure rates (MTBF)
● Systems management staff
● Seasonal and unexpected burst considerations
● Workload colocation and performance demands
● Over-provisioning for reliability and rate of innovation
● Effective tooling
● Business continuity planning
9. Netflix Model
● Depend on the AWS on-demand pool for elasticity
● Monitor insufficient capacity exceptions (ICEs) for boundaries
● Invest heavily in 3 year reservations
● Maintain relatively few, large reserved pools
● Cloud Capacity Analytics team develops tools for insight
● Leverage cross-account resource borrowing
10. The Triad Cloud Impact
Innovation
Reliability
Efficiency
Default Preferred
11.
12. Considerations of Scale
● Capacity required for critical footprint might require “guarantees”
● API-based observability has limits
● All resources have capacity limits/throttles
● Resource limits by default set for lowest common denominator
● Get creative with unused, but paid for capacity
● Billing file size!
14. Coburn Watson
● Director of Performance and Reliability at Netflix
○ Site Reliability Engineering, Performance and OS Engineering, Traffic Management, Chaos Engineering,
Capacity Planning, Cloud Network Engineering
● @coburnw, cwatson@netflix.com
● Looking for some great capacity planning-minded folks
● Performance and Reliability Youtube Channel