Monitorama opening keynote talk on the challenges of Monitoring in a world where we need to deal with continuous delivery, cloud, and automated control feedback loops.
5. 5 | Battery Ventures
Twenty Years of Free and Open Source Monitoring
● 1994 The “SE Toolkit” and virtual_adrian.se
● 1998 Sun Performance Tuning, Java & The Internet Book
● 1999 Resource Management Sun Blueprint Book
● 2000 Capacity Planning for Web Services Sun Blueprint Book
● 2007 A. A. Michelson Award for Outstanding Contribution to
Computer Metrics, by the Computer Measurement Group
● 2004-2008 Capacity Planning with Free Tools Workshop at CMG
● 2014 Monitorama!
6. 6 | Battery Ventures
State of the Art for Free Tools in 2008
http://www.slideshare.net/adrianco/capacity-planning-with-free-tools
7. 7 | Battery Ventures
History Lesson
http://sourceforge.net/projects/setoolkit/
SE is a C interpreter with built-in access to all Solaris metric data sources
8. 8 | Battery Ventures
Topics for Today
Minutes
Monoliths
Milliseconds
Monitoring tools
Challenges for monitoring
Continuous delivery & microservices
Analysis and closed loop control systems
Tools for developers who operate code in production
Challenges of dynamic, ephemeral, distributed cloud applications
12. 12 | Battery Ventures
Rule #1: Spend more time working on code
that analyzes the meaning of metrics, than
code that collects, moves, stores and
displays metrics.
13. 13 | Battery Ventures
What’s wrong with minutes?
14. 14 | Battery Ventures
What’s wrong with minutes?
Takes too long to see a problem
0
1
2
3
4
5
Minute 1 Minute 2 Minute 3 Minute 4 Minute 5 Minute 6 Minute 7
Metric Threshold
Something
broke at 2m20
40s of failure
didn’t trigger
1st high metric
seen at agent
on instance
1st high metric arrives at
monitoring system
1st high metric
processed
(maybe)
1st high metric
seen on graph
Three datapoints
on user graph so
looks bad at 8m00.
15. 15 | Battery Ventures
Whoops! I didn’t mean that! Reverting…
Not cool if it takes 5 minutes to see it failed and 5 more to see a fix
No-one notices if it only takes 5 seconds to detect and 5 to see a fix
16. 16 | Battery Ventures
Try that again by the second
More confidence more quickly
0
1
2
3
4
Minute 1 Minute 2 Minute 3 Minute 4 Minute 5 Minute 6 Minute 7
Threshold
ThresholdSomething
broke at 2m20
Measurable
in 1s
1st high metric
seen at agent
on instance
1st high metric arrives at
monitoring system
1st high metric
processed
1st high metric
seen on graph
Three datapoints
on user graph so
looks bad at 2m25.
17. 17 | Battery Ventures
Continuous Delivery and DevOps Implications
●Changes are smaller but more frequent
●Individual changes more likely to be broken
●Changes likely to be deployed by developers
●Instant detection and rollback matters much
more
18. 18 | Battery Ventures
SaaS Based Products Show What Can Be Done
www.vividcortex.com and www.boundary.com
Seeing Problems In Seconds
19. 19 | Battery Ventures
NetflixOSS Hystrix / Turbine Circuit Breaker Monitoring
http://techblog.netflix.com/2012/12/hystrix-dashboard-and-turbine.html
Streaming metrics directly from front end services to a web browser
20. 20 | Battery Ventures
Rule #2: Metric to display latency needs to
be less than human attention span (~10s)
21. 21 | Battery Ventures
What’s Wrong With Milliseconds?
22. 22 | Battery Ventures
A Millisecond is a Very Long Time!
● Some JVM based tools measure response times in ms
Network round trip within a datacenter/zone is less than 1ms
SSD access latency is usually less than 1ms
Cassandra (a Java app) response times can be less than 1ms
● Rounding Errors
Quantization loses too much information
Automated threshold warning “One is infinitely larger than zero”!
JVM does have nanosecond resolution times available
23. 23 | Battery Ventures
Rule #3: Validate that your measurement
system has enough accuracy and precision.
Gauge Repeatability and Reproducibility matters, see
http://en.wikipedia.org/wiki/ANOVA_gauge_R%26R
24. 24 | Battery Ventures
Monolithic Monitoring Systems
Simple to build and install, but problematic…
Services Being Monitored
Monolithic Monitoring System
Services Being Monitored
Distributed Collection Systems
Analysis / Display Aggregators
25. 25 | Battery Ventures
Monolithic Monitoring Issues
● Scalability
Problems scaling data collection, analysis and reporting throughput
Limitations on number of distinct metrics that can be collected
Traffic storms can overload the system and take it down
● Availability
Monitoring system needs to stay up when everything else dies!
Downtime for upgrades is always inconvenient
Gaps in the metric history can trigger alarms and lose confidence
26. 26 | Battery Ventures
In-Band, Out-of-Band, or Both?
In-band means deployed using same tools and infrastructure as your services
Dependencies lead to common mode failures that can leave you blind
Best option is both in-house in-band, and external SaaS
Services
Monitoring
System Monitoring
System
SaaS Based Monitoring
In-Band Monitoring
Very unlikely to have both fail at the same time
27. 27 | Battery Ventures
Rule #4: Monitoring systems need to be
more available and scalable than the
systems being monitored.
29. 29 | Battery Ventures
Issues with Continuous Delivery and Microservices
● High rate of change
Code pushes can cause floods of new instances and metrics
Short baseline for alert threshold analysis – everything looks unusual
● Ephemeral Configurations
Short lifetimes make it hard to aggregate historical views
Hand tweaked monitoring tools take too much work to keep running
● Microservices with complex calling patterns
End-to-end request flow measurements are very important
Request flow visualizations get overwhelmed
30. 30 | Battery Ventures
Microservice Based Architectures
See http://www.slideshare.net/LappleApple/gilt-from-monolith-ruby-app-to-micro-service-scala-service-architecture
From a Gilt Groupe Presentation
31. 31 | Battery Ventures
“Death Star” Architecture Diagrams
As visualized by Appdynamics, Boundary.com and Twitter internal tools
Netflix Gilt Groupe (12 of 450) Twitter
32. 32 | Battery Ventures
Closed Loop Control Systems
33. 33 | Battery Ventures
Autoscaled Ephemeral Instances at Netflix (the old way)
● Largest services use autoscaled red/black code pushes
● Average lifetime of an instance is 36 hours
P
u
s
h
Autoscale Up
Autoscale Down
34. 34 | Battery Ventures
Scryer - Predictive Auto-scaling at Netflix
See http://techblog.netflix.com/2013/11/scryer-netflixs-predictive-auto-scaling.html
and http://techblog.netflix.com/2013/12/scryer-netflixs-predictive-auto-scaling.html
More morning load
Sat/Sun high traffic
Lower load on Weds 24 Hours predicted traffic vs. actual
FFT based prediction driving AWS Autoscaler to plan minimum capacity
37. 37 | Battery Ventures
Monitoring Tools for Developers
● Most monitoring tools are built to be used by operations people
Focus on individual systems rather than applications
Focus on utilization rather than throughput and response time
Fiefdoms of sysadmin, network admin, storage admin, database admin…
Hard to integrate and extend
● Developer oriented monitoring tools
Application Performance Measurement (APM) and Analysis
Business transactions, response time, JVM internal metrics
Logging business metrics directly (NetflixOSS Servo, Yammer Metrics)
APIs for integration, data extraction, deep linking and embedding
http://techblog.netflix.com/2012/02/announcing-servo.html and http://metrics.codahale.com/
39. 39 | Battery Ventures
Dynamic and Ephemeral Challenges
● Datacenter Assets
Arrive infrequently, disappear infrequently
Stick around for three years or so before they get retired
Have unique IP and Mac addresses
● Cloud Assets
Arrive in bursts – a Netflix code push creates over a hundred per minute
Stick around for a few hours before they get retired
Often re-use the IP and Mac address that was just vacated!
Use NetflixOSS Edda to record a full history of your configuration
http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html
41. 41 | Battery Ventures
Traditional vs. Cloud Native Storage Architectures
Business
Logic
Database
Master
Fabric
Storage
Arrays
Database
Slave
Fabric
Storage
Arrays
Business
Logic
Cassandra
Zone A nodes
Cassandra
Zone B nodes
Cassandra
Zone C nodes
Cloud Object
Store Backups
42. 42 | Battery Ventures
Distributed Cloud Applications Challenges
● Cloud provider data stores don’t have the usual monitoring hooks
e.g. no way to install an agent on AWS RDS MySQL, AWS DynamoDB
● Dependency on web services as well as code on instances
Integration of data sources like CloudWatch, measure use of S3 etc.
● Cloud applications span zones and regions
Monitoring tools need to span and aggregate zones and regions too!
● NoSQL data stores introduce new protocols and metrics
e.g. cross zone and cross regions replication traffic for Cassandra
43. 43 | Battery Ventures
Monitoring “New Rules” by @adrianco
1. Spend more time on analysis than data collection and display
2. Reduce key business metric latency to less than 10s
3. Validate your measurement system precision and accuracy
4. Be more available and scalable than the services being monitored
5. Optimize for distributed, ephemeral cloud native applications
44. 44 | Battery Ventures
Any Questions?
● Battery Ventures http://www.battery.com
● Adrian’s Blog http://perfcap.blogspot.com
● Slideshare http://slideshare.com/adriancockcroft
Appearances by @adrianco
● Migrating to Microservices – Qcon London - March 6th, 2014
● Monitorama Opening Keynote Portland OR - May 7th, 2014
● GOTO Chicago Opening Keynote May 20th, 2014
● DevOps Summit at Cloud Expo New York – June 10th, 2014
● Qcon New York – June 11th, 2014
● GOTO Copenhagen/Aarhus – Denmark – Oct 25th, 2014
Find me on LinkedIn or Twitter @adrianco