Dianne Marsh, Director of Engineering at Netflix, discusses Netflix's DevOps practices for managing their large and growing global ecosystem. Key aspects include building a blameless culture where developers are responsible for operations, extensive automation using tools like Spinnaker and Atlas, and chaos engineering practices like Chaos Monkey to test system reliability. Netflix also leverages machine learning for tasks like anomaly detection and automated canary analysis to improve operations.
7. Approaching
Global
Reach
October - Spain, Portugal, Italy
Early 2016 - Korea, Taiwan, Singapore, Hong Kong
65m members à 100m
~60 counties à 200
8. Ne=lix
ecosystem
• 100s
of
microservices
• 1000s
of
daily
producBon
changes
• 10,000s
of
instances
• 100,000s
of
customer
interacBons/minute
• 1,000,000s
of
customers
• 1,000,000,000s
of
metrics
• 10,000,000,000
hours
of
streamed
36. • DES on time series
data
• Predict the future
based on history
• Favor recent history
• Threshold-based alerts
• 6-8 minute delay
Anomaly Detection
Alert!
47. Old Version (v1.0)
New Version
(v1.1)
Load BalancerCustomers
100 Servers
5 Servers
95%
5%
Metrics
Canary
Release
Process
48. Old Version (v1.0)
New Version
(v1.1)
Load BalancerCustomers
0 Servers
100 Servers
100%
Metrics
Canary
Release
Process
49. Automated
Canary
Analysis
Define
• Metrics
• A
threshold
Every
n
minutes
● Classify
metrics
● Compute
score
● Make
a
decision
50. Chaos
Engineering
the
discipline
of
experimenBng
on
a
distributed
system
in
order
to
build
confidence
in
the
systems
capability
to
withstand
turbulent
condiBons
in
producBon.
51. Cluster A Cluster D
Edge Cluster
Cluster B
Cluster C
Imagine a monkey loose in your data center…
52. Xen
Hypervisor
vulnerability
–
9/25/14
218
out
of
2700+
Cassandra
nodes
rebooted
22
did
not
reboot
successfully
AutomaBon
recovered
those
A State of Xen – Chaos Monkey & Cassandra
53. Device
Service
B
Service
C
Internet
Edge
Zuul
Service
A
ELB
FIT
Fault-Injection Testing (FIT)
• Simulate service failures
• Override by device or account
• % of member traffic
54. Device
Service
B
Service
C
Internet
Edge
Zuul
Service
A
ELB
FIT
Fault-Injection Testing (FIT)
• Simulate service failures
• Override by device or account
• % of member traffic
55. Monkey
–
Single
Instance
Gorilla
–
Availability
Zone
Kong
-‐
Region
More Chaos