Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring Tools

Please, no More Minutes, Milliseconds,
Monoliths... or Monitoring Tools!
Adrian Cockcroft @adrianco #Monitorama May 2014

3 | Battery Ventures
Enterprise IT Adoption of Cloud
By Simon Wardley http://enterpriseitadoption.com/
You Are
Here

Why am I at Monitorama?

Twenty Years of Free and Open Source Monitoring
● 1994 The “SE Toolkit” and virtual_adrian.se
● 1998 Sun Performance Tuning, Java & The Internet Book
● 1999 Resource Management Sun Blueprint Book
● 2000 Capacity Planning for Web Services Sun Blueprint Book
● 2007 A. A. Michelson Award for Outstanding Contribution to
Computer Metrics, by the Computer Measurement Group
● 2004-2008 Capacity Planning with Free Tools Workshop at CMG
● 2014 Monitorama!

State of the Art for Free Tools in 2008
http://www.slideshare.net/adrianco/capacity-planning-with-free-tools

History Lesson
http://sourceforge.net/projects/setoolkit/
SE is a C interpreter with built-in access to all Solaris metric data sources

Topics for Today
Minutes
Monoliths
Milliseconds
Monitoring tools
Challenges for monitoring
Continuous delivery & microservices
Analysis and closed loop control systems
Tools for developers who operate code in production
Challenges of dynamic, ephemeral, distributed cloud applications

No more monitoring tools?

We have too many of them already…
What’s needed is more analysis tools.

#Analysorama?

Rule #1: Spend more time working on code
that analyzes the meaning of metrics, than
code that collects, moves, stores and
displays metrics.

What’s wrong with minutes?

What’s wrong with minutes?
Takes too long to see a problem
0
1
2
3
4
5
Minute 1 Minute 2 Minute 3 Minute 4 Minute 5 Minute 6 Minute 7
Metric Threshold
Something
broke at 2m20
40s of failure
didn’t trigger
1st high metric
seen at agent
on instance
1st high metric arrives at
monitoring system
1st high metric
processed
(maybe)
1st high metric
seen on graph
Three datapoints
on user graph so
looks bad at 8m00.

Whoops! I didn’t mean that! Reverting…
Not cool if it takes 5 minutes to see it failed and 5 more to see a fix
No-one notices if it only takes 5 seconds to detect and 5 to see a fix

Try that again by the second
More confidence more quickly
0
1
2
3
4
Minute 1 Minute 2 Minute 3 Minute 4 Minute 5 Minute 6 Minute 7
Threshold
ThresholdSomething
broke at 2m20
Measurable
in 1s
1st high metric
seen at agent
on instance
1st high metric arrives at
monitoring system
1st high metric
processed
1st high metric
seen on graph
Three datapoints
on user graph so
looks bad at 2m25.

Continuous Delivery and DevOps Implications
●Changes are smaller but more frequent
●Individual changes more likely to be broken
●Changes likely to be deployed by developers
●Instant detection and rollback matters much
more

SaaS Based Products Show What Can Be Done
www.vividcortex.com and www.boundary.com
Seeing Problems In Seconds

NetflixOSS Hystrix / Turbine Circuit Breaker Monitoring
http://techblog.netflix.com/2012/12/hystrix-dashboard-and-turbine.html
Streaming metrics directly from front end services to a web browser

Rule #2: Metric to display latency needs to
be less than human attention span (~10s)

What’s Wrong With Milliseconds?

A Millisecond is a Very Long Time!
● Some JVM based tools measure response times in ms
Network round trip within a datacenter/zone is less than 1ms
SSD access latency is usually less than 1ms
Cassandra (a Java app) response times can be less than 1ms
● Rounding Errors
Quantization loses too much information
Automated threshold warning “One is infinitely larger than zero”!
JVM does have nanosecond resolution times available

Rule #3: Validate that your measurement
system has enough accuracy and precision.
Gauge Repeatability and Reproducibility matters, see
http://en.wikipedia.org/wiki/ANOVA_gauge_R%26R

Monolithic Monitoring Systems
Simple to build and install, but problematic…
Services Being Monitored
Monolithic Monitoring System
Services Being Monitored
Distributed Collection Systems
Analysis / Display Aggregators

Monolithic Monitoring Issues
● Scalability
Problems scaling data collection, analysis and reporting throughput
Limitations on number of distinct metrics that can be collected
Traffic storms can overload the system and take it down
● Availability
Monitoring system needs to stay up when everything else dies!
Downtime for upgrades is always inconvenient
Gaps in the metric history can trigger alarms and lose confidence

In-Band, Out-of-Band, or Both?
In-band means deployed using same tools and infrastructure as your services
Dependencies lead to common mode failures that can leave you blind
Best option is both in-house in-band, and external SaaS
Services
Monitoring
System Monitoring
System
SaaS Based Monitoring
In-Band Monitoring
Very unlikely to have both fail at the same time

Rule #4: Monitoring systems need to be
more available and scalable than the
systems being monitored.

Continuous Delivery

Issues with Continuous Delivery and Microservices
● High rate of change
Code pushes can cause floods of new instances and metrics
Short baseline for alert threshold analysis – everything looks unusual
● Ephemeral Configurations
Short lifetimes make it hard to aggregate historical views
Hand tweaked monitoring tools take too much work to keep running
● Microservices with complex calling patterns
End-to-end request flow measurements are very important
Request flow visualizations get overwhelmed

Microservice Based Architectures
See http://www.slideshare.net/LappleApple/gilt-from-monolith-ruby-app-to-micro-service-scala-service-architecture
From a Gilt Groupe Presentation

“Death Star” Architecture Diagrams
As visualized by Appdynamics, Boundary.com and Twitter internal tools
Netflix Gilt Groupe (12 of 450) Twitter

Closed Loop Control Systems

Autoscaled Ephemeral Instances at Netflix (the old way)
● Largest services use autoscaled red/black code pushes
● Average lifetime of an instance is 36 hours
P
u
s
h
Autoscale Up
Autoscale Down

Scryer - Predictive Auto-scaling at Netflix
See http://techblog.netflix.com/2013/11/scryer-netflixs-predictive-auto-scaling.html
and http://techblog.netflix.com/2013/12/scryer-netflixs-predictive-auto-scaling.html
More morning load
Sat/Sun high traffic
Lower load on Weds 24 Hours predicted traffic vs. actual
FFT based prediction driving AWS Autoscaler to plan minimum capacity

Netflix Automatic Code Deployment Canary - Bad Signature

Happy Canary Signature

Monitoring Tools for Developers
● Most monitoring tools are built to be used by operations people
Focus on individual systems rather than applications
Focus on utilization rather than throughput and response time
Fiefdoms of sysadmin, network admin, storage admin, database admin…
Hard to integrate and extend
● Developer oriented monitoring tools
Application Performance Measurement (APM) and Analysis
Business transactions, response time, JVM internal metrics
Logging business metrics directly (NetflixOSS Servo, Yammer Metrics)
APIs for integration, data extraction, deep linking and embedding
http://techblog.netflix.com/2012/02/announcing-servo.html and http://metrics.codahale.com/

Challenges of Dynamic, Ephemeral,
Distributed Cloud Applications

Dynamic and Ephemeral Challenges
● Datacenter Assets
Arrive infrequently, disappear infrequently
Stick around for three years or so before they get retired
Have unique IP and Mac addresses
● Cloud Assets
Arrive in bursts – a Netflix code push creates over a hundred per minute
Stick around for a few hours before they get retired
Often re-use the IP and Mac address that was just vacated!
Use NetflixOSS Edda to record a full history of your configuration
http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html

Cloud Native Architectures

Traditional vs. Cloud Native Storage Architectures
Business
Logic
Database
Master
Fabric
Storage
Arrays
Database
Slave
Fabric
Storage
Arrays
Business
Logic
Cassandra
Zone A nodes
Cassandra
Zone B nodes
Cassandra
Zone C nodes
Cloud Object
Store Backups

Distributed Cloud Applications Challenges
● Cloud provider data stores don’t have the usual monitoring hooks
e.g. no way to install an agent on AWS RDS MySQL, AWS DynamoDB
● Dependency on web services as well as code on instances
Integration of data sources like CloudWatch, measure use of S3 etc.
● Cloud applications span zones and regions
Monitoring tools need to span and aggregate zones and regions too!
● NoSQL data stores introduce new protocols and metrics
e.g. cross zone and cross regions replication traffic for Cassandra

Monitoring “New Rules” by @adrianco
1. Spend more time on analysis than data collection and display
2. Reduce key business metric latency to less than 10s
3. Validate your measurement system precision and accuracy
4. Be more available and scalable than the services being monitored
5. Optimize for distributed, ephemeral cloud native applications

Any Questions?
● Battery Ventures http://www.battery.com
● Adrian’s Blog http://perfcap.blogspot.com
● Slideshare http://slideshare.com/adriancockcroft
Appearances by @adrianco
● Migrating to Microservices – Qcon London - March 6th, 2014
● Monitorama Opening Keynote Portland OR - May 7th, 2014
● GOTO Chicago Opening Keynote May 20th, 2014
● DevOps Summit at Cloud Expo New York – June 10th, 2014
● Qcon New York – June 11th, 2014
● GOTO Copenhagen/Aarhus – Denmark – Oct 25th, 2014
Find me on LinkedIn or Twitter @adrianco

Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring Tools

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring Tools

Similaire à Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring Tools (20)

Plus de Adrian Cockcroft

Plus de Adrian Cockcroft (20)

Dernier

Dernier (20)

Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring Tools