Delivered at re:Invent 2015.
Operating a massively scalable, constantly changing, distributed global service is a daunting task. We innovate at breakneck speed to attract new customers and stay ahead of the competition. This means more features, more experiments, more deployments, more engineers making changes in production environments, and ever-increasing complexity. Simultaneously improving service availability and accelerating rate of change seems impossible on the surface. At Netflix, operations engineering is both a technical and organizational construct designed to accomplish just that by integrating disciplines like continuous delivery, fault injection, regional traffic management, crisis response, best practice automation, and real-time analytics. In this talk, designed for technical leaders seeking a path to operational excellence, we'll explore these disciplines in depth and how they integrate and create competitive advantages.
22. Availability vs. Rate of Change
Rate of Change
Availability(nines)
6
5
4
3
2
1
0
1 10 100 1000
99.9999%
99.999%
99.99%
99.9%
99%
90%
31.5 seconds
5.26 minutes
52.56 minutes
8.76 hours
3.26 days
36.5 days
Quality vs. Velocity
23. Availability vs. Rate of Change
Rate of Change
Availability(nines)
6
5
4
3
2
1
0
1 10 100 1000
99.9999%
99.999%
99.99%
99.9%
99%
90%
31.5 seconds
5.26 minutes
52.56 minutes
8.76 hours
3.26 days
36.5 days
The Zero Sum Game
24. Availability vs. Rate of Change
Rate of Change
Availability(nines)
6
5
4
3
2
1
0
1 10 100 1000
99.9999%
99.999%
99.99%
99.9%
99%
90%
31.5 seconds
5.26 minutes
52.56 minutes
8.76 hours
3.26 days
36.5 days
The Zero Sum Game
25. Availability vs. Rate of Change
Rate of Change
Availability(nines)
6
5
4
3
2
1
0
1 10 100 1000
99.9999%
99.999%
99.99%
99.9%
99%
90%
Shifting the Curve
26. Operational Excellence is the continuous improvement
of the management, design, and function of operational
environments to achieve greater quality, velocity, and
competitive advantage.
30. Operations Engineering is the application of software
engineering practices and principles to achieve and sustain
operational excellence.
• automation
• modular components
• tools & services
• best practices
33. Data Center
● Delayed provisioning
● Hand-crafted servers
● Variations and complexity
Our Artisanal Past
Delivery
● Late night, manual deployments
● Repeated mistakes
● Painful delays to production fixes
44. • DES on time series
data
• Predict the future
based on history
• Favor recent history
• Threshold-based alerts
• 6-8 minute delay
Anomaly Detection
Alert!
62. Chaos Engineering is the discipline of experimenting on
a distributed system in order to build confidence in the
systems capability to withstand turbulent conditions in
production.
63. Cluster A Cluster D
Edge Cluster
Cluster B
Cluster C
Imagine a monkey loose in your data center…
64. Xen Hypervisor vulnerability – 9/25/14
218 out of 2700+ Cassandra nodes rebooted
22 did not reboot successfully
Automation handled the rest
A State of Xen – Chaos Monkey & Cassandra
65. Device Service B
Service C
Internet EdgeZuul
Service A
ELB
FIT
Fault-Injection Testing (FIT)
• Simulate service failures
• Override by device or account
• % of member traffic
66. Device Service B
Service C
Internet EdgeZuul
Service A
ELB
FIT
Fault-Injection Testing (FIT)
• Simulate service failures
• Override by device or account
• % of member traffic
82. Speaker When? Where?
Engineering Netflix Global Operations in the Cloud Josh Evans Wed @11am Palazzo N
Efficient Innovation: High-Velocity Cost Management at Netflix Andrew Park
Wed @
2:45pm
Palazzo C
Netflix Keystone: How Netflix Handles Data Streams Up to 8
Million Events Per Second
Peter Bakas
Wed @
2:45pm
San Polo
3501B
A Day in the Life of a Netflix Engineer Using 37% of the Internet Dave Hahn
Wed @
4:15pm
Venetian H
Availability: The New Kind of Innovator’s Dilemma Coburn Watson
Wed @
4:15pm
Marcello
4501B
Real-Time Analytics In Service of Self-Healing Ecosystems
Roy Rapoport
Chris Sanden
Wed @
4:15pm
Lido 3001B
Running Spark and Presto on the Netflix Big Data Platform Daniel Weeks Thu @ 11am Palazzo F
Splitting the Check on Compliance and Security: Keeping
Developers and Auditors Happy in the Cloud
Jason Chan Thu @ 11am
Marcello
4501B
@