Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 95

Cloud-Native Observability

3

Share

Download to read offline

What is observability and how is it different from traditional monitoring? How do we effectively monitor and debug complex, elastic microservice architectures? In this interactive discussion, we’ll answer these questions. We’ll also introduce the idea of an “observability pipeline” as a way to empower teams following DevOps practices. Lastly, we’ll demo cloud-native observability tools that fit this “observability pipeline” model, including Fluentd, OpenTracing, and Jaeger.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Cloud-Native Observability

  1. 1. @tyler_treat Cloud-Native Observability Tyler Treat / Cloud Native - Madison / June 6, 2019
  2. 2. @tyler_treat
  3. 3. @tyler_treat Monitoring
  4. 4. @tyler_treat APM Debugger Profiler SSH grep
  5. 5. @tyler_treat APM Debugger Profiler SSH grep
  6. 6. @tyler_treat APM Debugger Profiler SSH System Behavior grep
  7. 7. @tyler_treat APM Debugger Profiler SSH System Behavior Actual Customer Impact grep
  8. 8. @tyler_treat Monitoring
  9. 9. @tyler_treat APM Debugger Profiler SSH grep
  10. 10. @tyler_treat APM Debugger Profiler SSH Testing in Production at Scale, Amit Gud grep
  11. 11. @tyler_treat APM Debugger Profiler SSH System Behavior Actual Customer Impact ???grep
  12. 12. @tyler_treat “Observability”
  13. 13. @tyler_treat Post Hoc vs. Ad Hoc
  14. 14. @tyler_treat Data Available Understanding
  15. 15. @tyler_treat Data Available Understanding Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit”
  16. 16. @tyler_treat Data Available Understanding Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”
  17. 17. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”
  18. 18. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”
  19. 19. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” FACTS
  20. 20. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” FACTS HYPOTHESES
  21. 21. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” ASSUMPTIONS FACTS HYPOTHESES
  22. 22. @tyler_treat Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” ASSUMPTIONS FACTS HYPOTHESES
  23. 23. @tyler_treat Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” HYPOTHESES MonitoringObservability
  24. 24. @tyler_treat Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” HYPOTHESES TestingExploring
  25. 25. @tyler_treat 
 Observability Data application logs system logs audit logs application metrics distributed traces events
  26. 26. @tyler_treat Some
 challenges… 
 Observability Data application logs system logs audit logs application metrics distributed traces events - Locked up inside a single vendor’s solution - Not readily available across the enterprise
 (or in some cases, too readily available) - Many tools and products needed for
 different data and use cases - Tool and data needs vary from team to
 team - Ever-changing landscape of tools, products,
 and services - Sheer volume of data can be overwhelming
  27. 27. @tyler_treat System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client Amazon Glacier S3 Client … Datadog Metrics Agent
  28. 28. System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Sp Un For Universal Analytics Client System System System System
  29. 29. @tyler_treat How big of a lift is it for your organization to change tools?
  30. 30. @tyler_treat How easy is it to experiment with new ones?
  31. 31. @tyler_treat Data Sources • VMs • Containers • Load balancers • Service meshes • Audit logs • VPC flow logs • Firewall logs • … Data Sinks • Centralized logging • SIEM • Monitoring • APM • Alerting • Cold storage • BI • … What data to send? Where to send it? How to send it?
  32. 32. @tyler_treat A decoupled approach
  33. 33. @tyler_treat What data to send? Where to send it? How to send it? Data Sources • VMs • Containers • Load balancers • Service meshes • Audit logs • VPC flow logs • Firewall logs • … Data Sinks • Centralized logging • SIEM • Monitoring • APM • Alerting • Cold storage • BI • … Observability Pipeline
  34. 34. @tyler_treat The Observability Pipeline
  35. 35. @tyler_treat Structure your damn data. 1. Data Specifications
  36. 36. @tyler_treat log.error(“User '{}' login failed”.format(user))
  37. 37. @tyler_treat ERROR 2019-04-05 13:26.42 User ‘tylertreat' login failed
  38. 38. @tyler_treat log.error(“User login failed”, event=LOGIN_ERROR, user=“tylertreat”, email=“tyler.treat@realkinetic.com”, error=error)
  39. 39. @tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “user”: “tylertreat”, “email”: “tyler.treat@realkinetic.com”, “error”: “Invalid username or password”, “message”: “User login failed” }
  40. 40. @tyler_treat JSON is fine.
  41. 41. @tyler_treat Pass a context object to everything.
  42. 42. @tyler_treat def login(ctx, username, email, password): ctx.set(user=username, email=email) ... log.error(“User login failed”, event=LOGIN_ERROR, context=ctx, error=error) ...
  43. 43. @tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “context”: { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”, “email”: “tyler.treat@realkinetic.com”, }, “error”: “Invalid username or password”, “message”: “User login failed” }
  44. 44. @tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “context”: { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”, “email”: “tyler.treat@realkinetic.com”, }, “error”: “Invalid username or password”, “message”: “User login failed” }
  45. 45. @tyler_treat Create standard specs for each data type collected (logs, metrics, traces).
  46. 46. @tyler_treat Specs can enforce required fields (e.g. user id, license, trace id) and data types.
  47. 47. @tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “INFO”, “event”: “user_login”, “context”: { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”,
 “user_id”: “3bb12f6c63274abe87fd1ee4ee37f3d2”,
 “license”: “942e6543f0844be680e72003d5e060fd”, “email”: “tyler.treat@realkinetic.com”, } }
  48. 48. @tyler_treat Specs alone aren’t enough! 2. Specification Libraries
  49. 49. @tyler_treat We need libraries.
  50. 50. @tyler_treat • Java: log4j • Go: logrus • Python: structlog • Ruby: ruby-cabin • .NET: serilog • JS: structured-log • etc. There are many existing libraries for structured logging.
  51. 51. @tyler_treat For tracing and metrics, there are vendor-neutral APIs like OpenTracing and OpenCensus.
  52. 52. @tyler_treat We need a lightweight agent that can collect data from hosts/containers. 3. Data Collector
  53. 53. @tyler_treat Collect data, perform transformations/ filters, and write it to the data pipeline.
  54. 54. @tyler_treat Typically runs as an agent on the host (DaemonSet in Kubernetes).
  55. 55. @tyler_treat Data is written to stdout/stderr or a Unix domain socket.
  56. 56. @tyler_treat Just use Fluentd or Logstash (+Beats).
  57. 57. @tyler_treat We need a scalable, fault-tolerant data stream to handle the firehose of observability data generated. 4. Data Pipeline
  58. 58. @tyler_treat This also provides a buffer that decouples producers from consumers.
  59. 59. @tyler_treat Lots of options…
  60. 60. @tyler_treat
  61. 61. @tyler_treat We need a component to consume data from the pipeline, perform filtering, and write it to the appropriate backends. 5. Data Router
  62. 62. @tyler_treat This is where the data spec comes into play.
  63. 63. @tyler_treat The data shape determines how incoming data is handled.
  64. 64. @tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics
  65. 65. @tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics
  66. 66. @tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics
  67. 67. @tyler_treat This is primarily a stateless component writing to APIs.
  68. 68. @tyler_treat Good fit for “serverless” solutions.
  69. 69. @tyler_treat Piecing It All Together
  70. 70. @tyler_treat
  71. 71. @tyler_treat You don’t need to build it out all in one go.
  72. 72. @tyler_treat There are quick wins along the way!
  73. 73. @tyler_treat Evolving to an Observability Pipeline • Adopt structured logging • Move log/data collection out of process • Use a centralized logging system • Introduce a streaming data solution • Start adding data consumers
  74. 74. @tyler_treat Dev/Ops/SRE Systems Production
  75. 75. @tyler_treat Dev/Ops/SRE Systems Production
  76. 76. @tyler_treat Dev/Ops/SRE Systems Production
  77. 77. @tyler_treat Dev/Ops/SRE Systems Production
  78. 78. @tyler_treat Dev/Ops/SRE Systems Production
  79. 79. @tyler_treat Dev/Ops/SRE Systems Production
  80. 80. @tyler_treat CI/CD Pre- Production
 (theorizing about known unknowns) Post- Production
 (learning from unknown unknowns) Observability
  81. 81. @tyler_treat Part 2: Demo
  82. 82. @tyler_treat Trip Service Flight Service Hotel Service Car Rental ServiceDynamoDB DynamoDB DynamoDB DynamoDB Book Trip
  83. 83. @tyler_treat Structured logging + context
  84. 84. @tyler_treat Kubernetes
  85. 85. @tyler_treat And now here’s some YAML…
  86. 86. @tyler_treat
  87. 87. @tyler_treat
  88. 88. @tyler_treat Kubernetes
  89. 89. @tyler_treat +
  90. 90. @tyler_treat Kubernetes Kinesis
  91. 91. @tyler_treat AWS Lambda
  92. 92. @tyler_treat Kubernetes Kinesis Lambda
  93. 93. @tyler_treat Kubernetes Kinesis Lambda CloudWatch Jaeger Stackdriver
  94. 94. @tyler_treat Code:
 https://github.com/RealKinetic/cloud-native-meetup-2019
  95. 95. @tyler_treat Thank You realkinetic.com
 bravenewgeek.com

×