URP? Excuse You! The Three Kafka Metrics You Need to Know

•Télécharger en tant que PPTX, PDF•

7 j'aime•3,977 vues

What do you really know about how to monitor a Kafka cluster for problems? Is your most reliable monitoring your users telling you there’s something broken? Are you capturing more metrics than the actual data being produced? Sure, we all know how to monitor disk and network, but when it comes to the state of the brokers, many of us are still unsure of which metrics we should be watching, and what their patterns mean for the state of the cluster. Kafka has hundreds of measurements, from the high-level numbers that are often meaningless to the per-partition metrics that stack up by the thousands as our data grows. We will thoroughly explore three key monitoring concepts in the broker, that will leave you an expert in identifying problems with the least amount of pain: Under-replicated Partitions: The mother of all metrics Request Latencies: Why your users complain Thread pool utilization: How could 80% be a problem? We will also discuss the necessity of availability monitoring and how to use it to get a true picture of what your users see, before they come beating down your door!

Ingénierie

URP? Excuse You!
Todd Palino
Senior Staff Engineer, Site Reliability
LinkedIn

• What is Kafka
• Encyclopedia of Monitoring
• Automation
What This
Talk Is Not

Monitoring is not Alerting
• Collect everything
• Alert on nothing
• Events are better than metrics
• Tests are better than alerts
• Sleep is best in life

• What’s an SLA?
• Availability
• Latency
• Customer Guarantees
Service
Level
Objectives

The Three Metrics You Need to Know
Partitions that are not
fully replicated within
the cluster
URP
The overall utilization
of an Apache Kafka
broker
Request
Handlers
How long requests
are taking, in which
stage of processing
Request
Timing

Under-Replicated Partitions
• Highly discussed
• Overall cluster health
• Replication is a consumer and producer

Under-Replicated Partitions
EXAMPLE: FAILED BROKER

Under-Replicated Partitions
EXAMPLE: CONSUMER PROBLEMS

Under-Replicated Partitions
EXAMPLE: PRODUCER PROBLEMS

Under-Replicated Partitions
• Overrated
• Doesn’t map to SLO
• Often not actionable
• Collect, but don’t alert

Everybody
In The
Pool
• Specialized thread pools
• Clients deal with network and
request pools
• Request handlers do most of
the work

Request
Handlers
• Decode and validate
• Perform task
• Wait for other brokers
• Assemble response

Request Handler Problems
• Anything that causes Kafka
to expend CPU cycles
• Includes problems related
to failing disks (IO wait)
• SSL and compression work
both can use a lot of CPU
CPU Time Timeout Deadlock
• Most often due to failing to
process controller requests
• Intra-cluster requests tend
to be bound by partition
counts
• Rapidly starves the pool of
threads
• Should always be a code
bug
• Usually looks exactly like a
timeout problem
• Rare, but hard to identify

Request Handler Problems
EXAMPLE: TIMEOUT OR DEADLOCK

Brokers Don’t Shouldn’t Do Compression
• Kafka brokers are running a new version
• Message format has been set to the new
version
• Clients haven’t upgraded
Up Conversion Down Conversion
• Kafka brokers are running a new version
• Message format is set to an older version
due to clients
• Producer clients update to new version

Request Timing
• Remote – Waiting for other brokers
• Response Queue – Waiting to
send
• Response Send - Send to client
• Total – Request handling, end to
end
• Request Queue – Waiting to
process
• Local – Work local to the broker

Request Timing
EXAMPLE: PRODUCE TOTAL TIME

Request Timing
EXAMPLE: PRODUCE LOCAL TIME

Request Timing
EXAMPLE: PRODUCE REMOTE TIME

Availability
Monitoring
• SLO, part 2
• Measured externally
• Client focused
• github.com/linkedin/kafka-monitor

Operating System
And Hardware
Metrics
• What do they mean?
• What application is causing
it?
• Don’t alert unless:
• 100% clear signal
• 100% clear response

Capacity
Planning
• Plan in advance
• Multi-factor
• Don’t alert for capacity

Capacity
Metrics
• Request Handler Idle Ratio
• Disk Utilization
• Partition Count
• Network Utilization

If You Remember Nothing Else…
• Define your service level objectives
• Monitor your service level objectives
• Metrics that cover many problems are noisy
• Buy Kafka: The Definitive Guide

Getting (and Giving) Help
• Kafka Monitor
• https://github.com/linkedin/kafka-monitor
• Burrow
• https://github.com/linkedin/Burrow
• Cruise Control
• https://github.com/linkedin/cruise-control
• kafka-tools
• https://github.com/linkedin/kafka-tools
LinkedIn Open Source Get Involved
• Community
• users@kafka.apache.org
• dev@kafka.apache.org
• Bugs and Work:
• https://issues.apache.org/jira/projects/KAFK
A

Recommandé

Kafka at Peak PerformanceTodd Palino

美团数据平台之Kafka应用实践和优化confluent

Apache Kafka at LinkedInDiscover Pinterest

Apache Kafka 0.8 basic training - VerisignMichael Noll

Producer Performance Tuning for Apache KafkaJiangjie Qin

A visual introduction to Apache KafkaPaul Brebner

Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...confluent

Disaster Recovery and High Availability with Kafka, SRM and MM2Abdelkrim Hadjidj

Recommandé

Kafka at Peak PerformanceTodd Palino

美团数据平台之Kafka应用实践和优化confluent

Apache Kafka at LinkedInDiscover Pinterest

Apache Kafka 0.8 basic training - VerisignMichael Noll

Producer Performance Tuning for Apache KafkaJiangjie Qin

A visual introduction to Apache KafkaPaul Brebner

Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...confluent

Disaster Recovery and High Availability with Kafka, SRM and MM2Abdelkrim Hadjidj

Kafka presentationMohammed Fazuluddin

No data loss pipeline with apache kafkaJiangjie Qin

Introduction to Kafka StreamsGuozhang Wang

A Kafka-based platform to process medical prescriptions of Germany’s health i...HostedbyConfluent

A Deep Dive into Kafka Controllerconfluent

Mainframe Integration, Offloading and Replacement with Apache KafkaKai Wähner

Kafka Streams for Java enthusiastsSlim Baltagi

kafkaAmikam Snir

Tips & Tricks for Apache Kafka®confluent

Apache Kafka Best PracticesDataWorks Summit/Hadoop Summit

Kafka Streams State Stores Being Persistentconfluent

Apache Kafka – (Pattern and) Anti-Patternconfluent

Apache Kafka IntroductionAmita Mirajkar

From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent

Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...HostedbyConfluent

Apache Kafka Fundamentals for Architects, Admins and Developersconfluent

Common issues with Apache Kafka® Producerconfluent

Introduction to Apache KafkaJeff Holoman

Apache kafkaSrikrishna k

Exactly-once Semantics in Apache Kafkaconfluent

Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Ontico

Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applicationsconfluent

Contenu connexe

Tendances

Kafka presentationMohammed Fazuluddin

No data loss pipeline with apache kafkaJiangjie Qin

Introduction to Kafka StreamsGuozhang Wang

A Kafka-based platform to process medical prescriptions of Germany’s health i...HostedbyConfluent

A Deep Dive into Kafka Controllerconfluent

Mainframe Integration, Offloading and Replacement with Apache KafkaKai Wähner

Kafka Streams for Java enthusiastsSlim Baltagi

kafkaAmikam Snir

Tips & Tricks for Apache Kafka®confluent

Apache Kafka Best PracticesDataWorks Summit/Hadoop Summit

Kafka Streams State Stores Being Persistentconfluent

Apache Kafka – (Pattern and) Anti-Patternconfluent

Apache Kafka IntroductionAmita Mirajkar

From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent

Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...HostedbyConfluent

Apache Kafka Fundamentals for Architects, Admins and Developersconfluent

Common issues with Apache Kafka® Producerconfluent

Introduction to Apache KafkaJeff Holoman

Apache kafkaSrikrishna k

Exactly-once Semantics in Apache Kafkaconfluent

Tendances (20)

Kafka presentation

No data loss pipeline with apache kafka

Introduction to Kafka Streams

A Kafka-based platform to process medical prescriptions of Germany’s health i...

A Deep Dive into Kafka Controller

Mainframe Integration, Offloading and Replacement with Apache Kafka

Kafka Streams for Java enthusiasts

kafka

Tips & Tricks for Apache Kafka®

Apache Kafka Best Practices

Kafka Streams State Stores Being Persistent

Apache Kafka – (Pattern and) Anti-Pattern

Apache Kafka Introduction

From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning

Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...

Apache Kafka Fundamentals for Architects, Admins and Developers

Common issues with Apache Kafka® Producer

Introduction to Apache Kafka

Apache kafka

Exactly-once Semantics in Apache Kafka

Similaire à URP? Excuse You! The Three Kafka Metrics You Need to Know

Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Ontico

Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applicationsconfluent

Kafka at scale facebook israelGwen (Chen) Shapira

Putting Kafka Into OverdriveTodd Palino

Monitoring Apache Kafkaconfluent

Resilience Planning & How the Empire Strikes BackC4Media

Make It Cooler: Using Decentralized Version Controlindiver

Fault Tolerance in Distributed EnvironmentOrkhan Gasimov

Asynchronous programming using CompletableFutures in JavaOresztész Margaritisz

Tuning kafka pipelinesSumant Tambe

Production Ready Microservices at ScaleRajeev Bharshetty

Benchmarking NGINX for Accuracy and ResultsNGINX, Inc.

Client Drivers and Cassandra, the Right WayDataStax Academy

Best practices for highly available and large scale SolrCloudAnshum Gupta

Adding Real-time Features to PHP ApplicationsRonny López

CoAP TalkBasuke Suzuki

Expect the unexpected: Prepare for failures in microservicesBhakti Mehta

Design Review Best Practices - SREcon 2014Mandi Walls

Continuous Delivery for the Rest of UsC4Media

Play With StreamsTianjian Chen

Similaire à URP? Excuse You! The Three Kafka Metrics You Need to Know (20)

Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)

Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications

Kafka at scale facebook israel

Putting Kafka Into Overdrive

Monitoring Apache Kafka

Resilience Planning & How the Empire Strikes Back

Make It Cooler: Using Decentralized Version Control

Fault Tolerance in Distributed Environment

Asynchronous programming using CompletableFutures in Java

Tuning kafka pipelines

Production Ready Microservices at Scale

Benchmarking NGINX for Accuracy and Results

Client Drivers and Cassandra, the Right Way

Best practices for highly available and large scale SolrCloud

Adding Real-time Features to PHP Applications

CoAP Talk

Expect the unexpected: Prepare for failures in microservices

Design Review Best Practices - SREcon 2014

Continuous Delivery for the Rest of Us

Play With Streams

Plus de Todd Palino

Leading Without Managing: Becoming an SRE Technical LeaderTodd Palino

From Operations to Site Reliability in Five Easy StepsTodd Palino

Code Yellow: Helping Operations Top-Heavy Teams the Smart WayTodd Palino

Why Does (My) Monitoring Suck?Todd Palino

Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Todd Palino

Running Kafka for Maximum PainTodd Palino

I'm No Hero: Full Stack Reliability at LinkedInTodd Palino

Multi tier, multi-tenant, multi-problem kafkaTodd Palino

More Datacenters, More ProblemsTodd Palino

Tuning Kafka for Fun and ProfitTodd Palino

Kafka at Scale: Multi-Tier ArchitecturesTodd Palino

Enterprise Kafka: Kafka as a ServiceTodd Palino

Plus de Todd Palino (12)

Leading Without Managing: Becoming an SRE Technical Leader

From Operations to Site Reliability in Five Easy Steps

Code Yellow: Helping Operations Top-Heavy Teams the Smart Way

Why Does (My) Monitoring Suck?

Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...

Running Kafka for Maximum Pain

I'm No Hero: Full Stack Reliability at LinkedIn

Multi tier, multi-tenant, multi-problem kafka

More Datacenters, More Problems

Tuning Kafka for Fun and Profit

Kafka at Scale: Multi-Tier Architectures

Enterprise Kafka: Kafka as a Service

Dernier

Crystal Structure analysis and detailed information pptxachiever3003

Designing pile caps according to ACI 318-19.pptxErbil Polytechnic University

Earthing details of Electrical Substationstephanwindworld

young call girls in Green Park🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

US Department of Education FAFSA Week of ActionMebane Rash

BSNL Internship Training presentation.pptxNiranjanYadav41

System Simulation and Modelling with types and Event SchedulingBootNeck1

THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian

multiple access in wireless communicationpanditadesh123

Engineering Drawing section of solidnamansinghjarodiya

Ch10-Global Supply Chain - Cadena de Suministro.pdfChristianCDAM

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani

Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774

Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra

Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913

Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr

Configuration of IoT devices - Systems managamentBharaniDharan195623

complete construction, environmental and economics information of biomass com...asadnawaz62

DM Pillar Training Manual.ppt will be useful in deploying TPM in projectssuserb6619e

Design and analysis of solar grass cutter.pdfTagore Institute of Engineering And Technology

Dernier (20)

Crystal Structure analysis and detailed information pptx

Designing pile caps according to ACI 318-19.pptx

Earthing details of Electrical Substation

young call girls in Green Park🔝 9953056974 🔝 escort Service

US Department of Education FAFSA Week of Action

BSNL Internship Training presentation.pptx

System Simulation and Modelling with types and Event Scheduling

THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION

multiple access in wireless communication

Engineering Drawing section of solid

Ch10-Global Supply Chain - Cadena de Suministro.pdf

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf

Arduino_CSE ece ppt for working and principal of arduino.ppt

Mine Environment II Lab_MI10448MI__________.pptx

Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg

Class 1 | NFPA 72 | Overview Fire Alarm System

Configuration of IoT devices - Systems managament

complete construction, environmental and economics information of biomass com...

DM Pillar Training Manual.ppt will be useful in deploying TPM in project

Design and analysis of solar grass cutter.pdf

URP? Excuse You! The Three Kafka Metrics You Need to Know

1. URP? Excuse You! Todd Palino Senior Staff Engineer, Site Reliability LinkedIn

2. • What is Kafka • Encyclopedia of Monitoring • Automation What This Talk Is Not

3. Why Talk About Monitoring?

4. Messages per Day at LinkedIn

5. What is Monitoring (not)?

6. Monitoring is not Alerting • Collect everything • Alert on nothing • Events are better than metrics • Tests are better than alerts • Sleep is best in life

7. • What’s an SLA? • Availability • Latency • Customer Guarantees Service Level Objectives

8. Key Kafka Metrics

9. The Three Metrics You Need to Know Partitions that are not fully replicated within the cluster URP The overall utilization of an Apache Kafka broker Request Handlers How long requests are taking, in which stage of processing Request Timing

10. Under-Replicated Partitions • Highly discussed • Overall cluster health • Replication is a consumer and producer

11. Under-Replicated Partitions EXAMPLE: FAILED BROKER

12. Under-Replicated Partitions EXAMPLE: CONSUMER PROBLEMS

13. Under-Replicated Partitions EXAMPLE: PRODUCER PROBLEMS

14. Under-Replicated Partitions • Overrated • Doesn’t map to SLO • Often not actionable • Collect, but don’t alert

15. Everybody In The Pool • Specialized thread pools • Clients deal with network and request pools • Request handlers do most of the work

16. Request Handlers • Decode and validate • Perform task • Wait for other brokers • Assemble response

17. Request Handler Problems • Anything that causes Kafka to expend CPU cycles • Includes problems related to failing disks (IO wait) • SSL and compression work both can use a lot of CPU CPU Time Timeout Deadlock • Most often due to failing to process controller requests • Intra-cluster requests tend to be bound by partition counts • Rapidly starves the pool of threads • Should always be a code bug • Usually looks exactly like a timeout problem • Rare, but hard to identify

18. Request Handler Problems EXAMPLE: TIMEOUT OR DEADLOCK

19. Request Handler Problems • Anything that causes Kafka to expend CPU cycles • Includes problems related to failing disks (IO wait) • SSL and compression work both can use a lot of CPU CPU Time Timeout Deadlock • Most often due to failing to process controller requests • Intra-cluster requests tend to be bound by partition counts • Rapidly starves the pool of threads • Should always be a code bug • Usually looks exactly like a timeout problem • Rare, but hard to identify

20. Brokers Don’t Do Compression

21. Brokers Don’t Shouldn’t Do Compression • Kafka brokers are running a new version • Message format has been set to the new version • Clients haven’t upgraded Up Conversion Down Conversion • Kafka brokers are running a new version • Message format is set to an older version due to clients • Producer clients update to new version

22. Request Timing • Remote – Waiting for other brokers • Response Queue – Waiting to send • Response Send - Send to client • Total – Request handling, end to end • Request Queue – Waiting to process • Local – Work local to the broker

23. Request Timing EXAMPLE: PRODUCE TOTAL TIME

24. Request Timing EXAMPLE: PRODUCE LOCAL TIME

25. Request Timing EXAMPLE: PRODUCE REMOTE TIME

26. Thank you?

27. What’s Missing?

28. Availability Monitoring • SLO, part 2 • Measured externally • Client focused • github.com/linkedin/kafka-monitor

29. Operating System And Hardware Metrics • What do they mean? • What application is causing it? • Don’t alert unless: • 100% clear signal • 100% clear response

30. Capacity Planning • Plan in advance • Multi-factor • Don’t alert for capacity

31.

32. Capacity Metrics • Request Handler Idle Ratio • Disk Utilization • Partition Count • Network Utilization

33. Wrapping Up

34. If You Remember Nothing Else… • Define your service level objectives • Monitor your service level objectives • Metrics that cover many problems are noisy • Buy Kafka: The Definitive Guide

35. Getting (and Giving) Help • Kafka Monitor • https://github.com/linkedin/kafka-monitor • Burrow • https://github.com/linkedin/Burrow • Cruise Control • https://github.com/linkedin/cruise-control • kafka-tools • https://github.com/linkedin/kafka-tools LinkedIn Open Source Get Involved • Community • users@kafka.apache.org • dev@kafka.apache.org • Bugs and Work: • https://issues.apache.org/jira/projects/KAFK A

36. Thank you

Notes de l'éditeur

Let me start off by telling you what we’re not talking about today. I won’t be going into the basics of what Kafka is – I assume that if you’re attending Kafka Summit, you have an idea of what it does and how it works. Regardless, you’re going to get some good data here on monitoring, even if you have very limited Kafka knowledge. However, this also won’t be an encyclopedic look at monitoring. I’m going to discuss a few key sets of metrics, and how to use them. But I won’t even be covering all the Kafka metrics you should look at, never mind all that exist. I encourage you to spin up a JMX tool of choice and explore what’s exposed for sensors in Kafka. I also encourage you to share with the class, whether in posts, talks, or tweets, any gems that you have for your own monitoring. I’m also not going to talk about automation, even as it relates to handling alerts. There are many fine talks out there about automating responses and runbooks, and we could spend hours talking about just that.
So why am I here today talking about monitoring? There are lots of topics that could be covered, especially in an ecosystem as large as Kafka. And I could always deliver yet another “here’s how we do it at LinkedIn” talk. However, today I’m choosing to share a look at where we’re moving right now. I recently wrote a post for DevOps.com about a term we use, “Code Yellow”. This is one of our tools for dealing with an application, or a team, in crisis. Typically this is due to something like communication problems, or a large amount of tech debt. Since I recently wrote this post, and you all know that I work on Kafka, you can probably guess that I’m currently in this state. In our case, it’s due to somewhat unexpected growth.
LinkedIn started using Kafka back in 2010, before it was open sourced. In September of 2015, we announced that we had hit a milestone, at one trillion messages a day produced into our Kafka clusters. Last year, at Kafka Summit in San Francisco, I noted that we had passed two trillion messages a day. At the beginning of the year, we clocked in at three trillion. And now, we’re over five trillion messages a day. That hockey stick at the end is the current source of my long days and sleepless nights. Top this off with the fact that our monitoring is currently very noisy, partly due to scale problems around this growth, and partly because we alert on many things that are not providing clear signals. We’re currently overhauling our monitoring as a result of this.
So why do we have such noisy alerting? We’ve forgotten that monitoring and alerting are not the same thing.
Today, we're going to be talking about monitoring, not alerting. What is the difference, you ask? In our case, monitoring refers to all the data we have available to us from Kafka and our underlying systems, from high level metrics like partition counts down to the most minute sensor that is available. Alerting, on the other hand, we will use to refer to the metrics that are used to tell us about an imminent problem. They're the metrics that wake us up at night. These should be carefully chosen, and they should be clear signals that demand an immediate response 100% of the time. Another thing to keep in mind that events are almost always superior to metrics when alerting. We know this, right? Kafka is all about events. And yet we still have measurements that are rates where they should be discrete counts of events. We normally can’t work with individual events, like a failed request, at scale. But we do want to know the actual number of failed requests, and not a requests per second metric where we miss data due to time windows. We also need to make sure that we’re testing the code before we deploy it. My team has fallen prey to reactive alerting – we find a new problem, like a socket leak, and we add a new alert for file handles in use so we can catch it before it goes critical. The bug gets fixed, but we keep the alert, just in case we run into it again. It would be much better for everyone if we added a release test that checks for the general case of increased file handle usage, and dropped the alert on the live systems. Alerting should always be aimed at maximizing the amount of sleep that your operations team gets. That means as few alerts as possible to keep everything running, and automating as much as possible.
When we're talking about alerting, the most important thing to watch is the metrics related to your service level objectives, or SLOs. Just as a note, an SLO and an SLA are not the same thing. A service level agreement is a contract: it's basically an SLO with teeth - a penalty. The SLO is the level of service that we're promising to our customers. For Kafka, this is typically going to be that the system will be available, and it will perform at a certain level for produce and consume requests. We'll cover what metrics to use for this in a bit. In addition to these, your SLOs are whatever you’re guaranteeing to your customers. This may include a minimum amount of retention. If you’re working to GDPR, or another privacy standard, you may specify a maximum amount of time that data will be retained for (here’s a hint, that’s not necessarily the retention in time that you set for the topic).
I've talked at length about the under replicated partition count metric. I dedicated a significant number of pages in a book you may have seen about how to respond to any non-zero value. At it's heart, this number tells you that the replication within the cluster is having a problems.
A stable count on all but one broker tells you that that broker is not working. It's either down, or the replication is not started
A variable count on a single broker tells you that that broker is having a problem servicing consume requests
A variable count on multiple brokers indicates a more overall problem. In this case, you'll need to enumerate the partitions that are falling behind (using the CLI tools) and see if there is a common thread, such as a single broker that is having problems replicating from multiple cluster members.
But the most important thing that the URP metric is, is overrated for alerting. That's right, I said it. I don't like getting paged for this metric. But why, you ask? If it illustrates so many problems, why wouldn't I want to get alerts for it? The problem is that it doesn't tell me that I'm breaching my SLO, and whatever problem it's telling me about is often not immediately actionable. More often than not, this metric tells me about two problems. The first is that a broker is down. I can detect that with a much clearer signal, however, by health checking the application. The other problem is that the cluster is operating over it's capacity. I don't want to be paged for that either because capacity is a proactive monitoring problem, not a reactive problem. We'll talk about that more in a few slides. Still, you should be collecting this metric, and you might want to consider generating warnings for it. It does illustrate a risky situation, because we depend on replication in the cluster for redundancy. When it's not zero, you have a problem that needs some attention.
As with most applications, Kafka has thread pools to do work. There are several different ones - network handlers, request handlers, log compaction, recovery (which are also used for handling log segments at startup and shutdown). When we’re talking about client traffic, the network and request handlers are the ones that do all the work, and the request handlers are far more important. This is because the network handlers just take care of the network connection, including reading and writing bytes on the wire.
The request handler does everything else for the client - it decodes and validates the protocol, handles produce and consume work, and assembles the response to send back. It even performs all of the broker internal work, responding to controller requests. This means that if you want a single indicator of how busy the broker is, you couldn’t ask for a much better measure than the utilization of the request handlers. But as with under-replicated partitions, there are a lot of different problems that could be indicated here
CPU - Slow disk performance, often due to a failing drive, is a particular problem for produce requests. As the request handler will have to take more time when writing to disk, it will manifest as higher utilization Timeouts and deadlocks look very similar Timeouts - all of the request handler threads are getting tied up. We most often see this when the broker is starting up, and it is failing to process requests from the controller within the controller socket timeout. Deadlock - But if that doesn’t solve it, you may have hit a deadlock condition in handling requests. We’ve seen this recently with some shutdown code, but it was related to the authorizer we were using and not Kafka directly.
Here are the produce TotalTime graphs for a broker that is working perfectly well. (Include 50th, 99th, and 999th). If the broker is running well, why is there such a discrepancy? The reason is that the amount of time required for a produce request varies widely depending on the content of the request.
Timeouts most often happen when controller requests are not processed within the controller socket timeout. What happens is that the controller sends the request, it times out, and then the controller sends the request again. You’ll see this especially when the broker is starting up, and the controller is trying to send it the state of the world with leader and ISR requests Deadlocks look almost identical, but they’re much more rare. We’ve seen them recently during shutdown, but that was caused by an issue in the authorizer module that we use, and not something that was endemic to Kafka itself. However, they’re almost always code issues. This makes them pretty tricky to debug.
Wait, the Kafka brokers don’t compress data anymore! We got rid of that with the bump to message format 1, and relative offsets in the produced batches. Right? Yeah, that’s what I thought, too. Turns out that there are a couple cases, which are not as rare as you might think, that will result in the broker having to rewrite the incoming message batches.
Another common culprit for the request handlers being over utilized, even at a low traffic volume, is due to compression. This happens when the client versions do not match the message format on disk. The (config name) is settable via a broker configuration, and controls how messages are written to disk. In an ideal world, the producer client version matches this configuration, such that the producer is sending the same message format. If the producer is an older version, the broker will have to upconvert the messages, and if the producer is using a higher message format version the broker will need to down convert. Both of these situations means the broker will be forced to recompress the message batch before writing it to disk (this also happens if your brokers are still using message format zero). This is an expensive operation, and should be avoided. It’s also worth noting that you can set the message format on disk as a per-topic override. You will want to be very careful if you feel the need to do this, as it means the logs on disk are inconsistent, and you could easily have compression you’re not expecting.
If you have slow request processing due to issues like this, you’re also going to have latency issues. Which gets us into the third set of metrics... For each protocol request type, Kafka provides a set of timing metrics. These describe the amount of time that the request spends in various states while being processed: Total time - this is the overall total time to process a request, from when it is received to when it is complete Request Queue Time - how long the request sits in queue before being picked up by a request handler for processing Local Time - The amount of local processing time required for the request. This can include a number of things, such as disk write time for produce requests Remote Time - The amount of time that the request waits on non-local steps. This includes acknowledgements from followers for produce requests Response Queue Time - how long the response for the request sits in queue before being sent to the client Response Send Time - how long it takes to send the response to the client. This only covers getting it into the send buffers locally, not network time. In addition to the time metrics, there is also a rate metric that gives you the number of requests of a particular type per second. The time metrics are provided as percentiles, and as such you can choose from 50th, 75th, 99th, and 99.9th percentiles, as well as an average and maximum value over the course of the running process. Request latency is typically going to be the first of your SLO measurements. Which means that you will probably want to be monitoring these metrics and possibly alerting off them. The problem comes in as you try to pick which attributes to monitor, and what the baseline values are.
Here are the produce TotalTime graphs, 50th percentile and 99.9th percentile, for a broker that is working perfectly well. It may be hard to see, but the scale of the first graph is in single digits, and the scale of the second is in thousands. If the broker is running well, why is there such a discrepancy? The reason is that the amount of time required for a produce request varies widely depending on the content of the request.
Let’s consider the local time. Again, these are the 50th percentile and the 99.9th percentile, and the first graph goes from zero to one, while the second graph is again in the thousands. What would impact the amount of time required to process the produce requests locally? In this case, most of our produce requests are really small - small batches, single topic - but some of them are very large. The bigger the produce request, the more time it takes to write the data to disk.
How about the remote time for the same produce requests? Yet again, these are the 50th and 99.9th percentile graphs, with the first one being from zero to two, and the second being in the thousands. The average value is small, but the 999th is multiple orders of magnitude higher. The most common cause here is that most of our requests are being produced with the required acknowledgements being set to 1, while some are requesting all acknowledgements. That easily drives up the amount of time spent in the remote step. This isn’t to say that you can’t use these metrics effectively for alerting. It just means that you need to define your SLOs appropriately. Stating simply that produce requests will be handled in 20ms or less may not be reasonable, but specifying that value for the average produce request may be fine.
OK, so we’ve covered our three metrics, and we’ve still got X minutes left in this talk. I could sit here and just stare at my phone for the rest of the time. Or …
We could talk about what’s missing, since we only covered a very small slice of monitoring for Kafka.
The other side of your service level objectives is probably going to be the availability of Kafka to handle requests. But as with any system, you can’t truly measure the availability of a Kafka cluster from the brokers themselves. There are many factors that go into availability, including whether or not the network is working. Looking at the broker itself may tell you that everything’s fine, meanwhile none of your clients can connect. For monitoring availability, you need to use something external to the Kafka cluster to look at it from the client’s point of view. This is why LinkedIn created, and open sourced, kafka-monitor (https://github.com/linkedin/kafka-monitor). This runs a producer and a consumer for each cluster, and assures that both requests work properly. It can assure that there is at least one partition on each broker in the cluster, so you check the entire cluster. It also provides latency metrics for requests, so you have an objective view of the request timings we were just talking about.
So what should we do about lower level OS and hardware metrics? Well, let me ask you this. I have a Kafka cluster that’s running at 95% CPU, what do I do? Well, if it’s serving requests properly and within the SLO, I go get a cup of coffee. I might need to look at it, but it’s not a crisis. Most metrics, OS or otherwise, are a great recipe for creating lots of alert noise that is not actionable. CPU and memory usage could be high due to other applications, and in most cases relate to overall capacity and not to the application’s performance or current state of functionality. You should definitely collect them so that you can go back and debug problems later. If you’re thinking about setting up an alert you need to ask yourself two things: Is this always actionable when the alert goes off? Is the action 100% clear? If the answer to either of these is something along the lines of “Yes, but…” you need to stop and rethink what you’re trying to accomplish. But, Todd! I need to monitor things like disk usage, don’t I? Yes, of course we do, but this falls under the heading of capacity planning.
My Kafka environment, like many of yours, is shared between many different applications. You may even have some of the tech debt that we have, where you have little control over when someone starts using it for a new service. This means that we should be keeping an eye on the capacity of the system, and preemptively adding more. Preemptively is the key word here. You want to deploy new brokers before you’ve hit 100% capacity, which means that you need to order them earlier than that.
I am no magician, contrary to the perception that many have of my ability to solve problems. It does me no good to get an alarm in the middle of the night that we’re approaching saturation, as I can’t magically make new hardware appear. And if I already have the hardware, it should have been added to the clusters so that I never hit a crisis point.
The metrics that I’m mostly interested in for judging capacity are: Request handler pool idle ratio Disk utilization Partition Count Network utilization You should be trending these metrics over time, and reviewing them on a regular basis. You may want to have some sort of alert once capacity is approaching a point where you need to get more, but that should be an email, or even better, and automatic work ticket in your system of choice. Additionally, make sure you’re making use of features like quotas and retention of messages by size so that you can minimize any surprises.
If you take nothing else away from today’s talk, leave with this. First, you must define what your service level objectives are for Kafka within your organization. Even if you’re running at a small scale, and with a limited number of customers. Even if you’re the only customer of your cluster. Make it clear what the expectations are, and hold to them. Next, once you have those SLOs, that is what you need to be monitoring. David Henke, who led Engineering and Operations at LinkedIn for many years, would often say “What gets measured, gets fixed.” If you do not monitor your SLOs, then they do not really count. But beware of metrics that inform you to many different problems. They are typically noisy, and they often make it difficult to determine what the underlying problem is. They are attractive, because it’s a single number that says “something is wrong”, but they will drive you crazy in the end. And lastly, buy yourself a copy of Kafka: The Definitive Guide. In fact, you should buy two or three. Because reasons.