What do you really know about how to monitor a Kafka cluster for problems? Is your most reliable monitoring your users telling you there’s something broken? Are you capturing more metrics than the actual data being produced? Sure, we all know how to monitor disk and network, but when it comes to the state of the brokers, many of us are still unsure of which metrics we should be watching, and what their patterns mean for the state of the cluster. Kafka has hundreds of measurements, from the high-level numbers that are often meaningless to the per-partition metrics that stack up by the thousands as our data grows.
We will thoroughly explore three key monitoring concepts in the broker, that will leave you an expert in identifying problems with the least amount of pain:
Under-replicated Partitions: The mother of all metrics
Request Latencies: Why your users complain
Thread pool utilization: How could 80% be a problem?
We will also discuss the necessity of availability monitoring and how to use it to get a true picture of what your users see, before they come beating down your door!
6. Monitoring is not Alerting
• Collect everything
• Alert on nothing
• Events are better than metrics
• Tests are better than alerts
• Sleep is best in life
7. • What’s an SLA?
• Availability
• Latency
• Customer Guarantees
Service
Level
Objectives
9. The Three Metrics You Need to Know
Partitions that are not
fully replicated within
the cluster
URP
The overall utilization
of an Apache Kafka
broker
Request
Handlers
How long requests
are taking, in which
stage of processing
Request
Timing
17. Request Handler Problems
• Anything that causes Kafka
to expend CPU cycles
• Includes problems related
to failing disks (IO wait)
• SSL and compression work
both can use a lot of CPU
CPU Time Timeout Deadlock
• Most often due to failing to
process controller requests
• Intra-cluster requests tend
to be bound by partition
counts
• Rapidly starves the pool of
threads
• Should always be a code
bug
• Usually looks exactly like a
timeout problem
• Rare, but hard to identify
19. Request Handler Problems
• Anything that causes Kafka
to expend CPU cycles
• Includes problems related
to failing disks (IO wait)
• SSL and compression work
both can use a lot of CPU
CPU Time Timeout Deadlock
• Most often due to failing to
process controller requests
• Intra-cluster requests tend
to be bound by partition
counts
• Rapidly starves the pool of
threads
• Should always be a code
bug
• Usually looks exactly like a
timeout problem
• Rare, but hard to identify
21. Brokers Don’t Shouldn’t Do Compression
• Kafka brokers are running a new version
• Message format has been set to the new
version
• Clients haven’t upgraded
Up Conversion Down Conversion
• Kafka brokers are running a new version
• Message format is set to an older version
due to clients
• Producer clients update to new version
22. Request Timing
• Remote – Waiting for other brokers
• Response Queue – Waiting to
send
• Response Send - Send to client
• Total – Request handling, end to
end
• Request Queue – Waiting to
process
• Local – Work local to the broker
29. Operating System
And Hardware
Metrics
• What do they mean?
• What application is causing
it?
• Don’t alert unless:
• 100% clear signal
• 100% clear response
34. If You Remember Nothing Else…
• Define your service level objectives
• Monitor your service level objectives
• Metrics that cover many problems are noisy
• Buy Kafka: The Definitive Guide
35. Getting (and Giving) Help
• Kafka Monitor
• https://github.com/linkedin/kafka-monitor
• Burrow
• https://github.com/linkedin/Burrow
• Cruise Control
• https://github.com/linkedin/cruise-control
• kafka-tools
• https://github.com/linkedin/kafka-tools
LinkedIn Open Source Get Involved
• Community
• users@kafka.apache.org
• dev@kafka.apache.org
• Bugs and Work:
• https://issues.apache.org/jira/projects/KAFK
A
Let me start off by telling you what we’re not talking about today. I won’t be going into the basics of what Kafka is – I assume that if you’re attending Kafka Summit, you have an idea of what it does and how it works. Regardless, you’re going to get some good data here on monitoring, even if you have very limited Kafka knowledge.
However, this also won’t be an encyclopedic look at monitoring. I’m going to discuss a few key sets of metrics, and how to use them. But I won’t even be covering all the Kafka metrics you should look at, never mind all that exist. I encourage you to spin up a JMX tool of choice and explore what’s exposed for sensors in Kafka. I also encourage you to share with the class, whether in posts, talks, or tweets, any gems that you have for your own monitoring.
I’m also not going to talk about automation, even as it relates to handling alerts. There are many fine talks out there about automating responses and runbooks, and we could spend hours talking about just that.
So why am I here today talking about monitoring? There are lots of topics that could be covered, especially in an ecosystem as large as Kafka. And I could always deliver yet another “here’s how we do it at LinkedIn” talk. However, today I’m choosing to share a look at where we’re moving right now.
I recently wrote a post for DevOps.com about a term we use, “Code Yellow”. This is one of our tools for dealing with an application, or a team, in crisis. Typically this is due to something like communication problems, or a large amount of tech debt. Since I recently wrote this post, and you all know that I work on Kafka, you can probably guess that I’m currently in this state. In our case, it’s due to somewhat unexpected growth.
LinkedIn started using Kafka back in 2010, before it was open sourced. In September of 2015, we announced that we had hit a milestone, at one trillion messages a day produced into our Kafka clusters. Last year, at Kafka Summit in San Francisco, I noted that we had passed two trillion messages a day. At the beginning of the year, we clocked in at three trillion. And now, we’re over five trillion messages a day. That hockey stick at the end is the current source of my long days and sleepless nights.
Top this off with the fact that our monitoring is currently very noisy, partly due to scale problems around this growth, and partly because we alert on many things that are not providing clear signals. We’re currently overhauling our monitoring as a result of this.
So why do we have such noisy alerting? We’ve forgotten that monitoring and alerting are not the same thing.
Today, we're going to be talking about monitoring, not alerting. What is the difference, you ask? In our case, monitoring refers to all the data we have available to us from Kafka and our underlying systems, from high level metrics like partition counts down to the most minute sensor that is available. Alerting, on the other hand, we will use to refer to the metrics that are used to tell us about an imminent problem. They're the metrics that wake us up at night. These should be carefully chosen, and they should be clear signals that demand an immediate response 100% of the time.
Another thing to keep in mind that events are almost always superior to metrics when alerting. We know this, right? Kafka is all about events. And yet we still have measurements that are rates where they should be discrete counts of events. We normally can’t work with individual events, like a failed request, at scale. But we do want to know the actual number of failed requests, and not a requests per second metric where we miss data due to time windows.
We also need to make sure that we’re testing the code before we deploy it. My team has fallen prey to reactive alerting – we find a new problem, like a socket leak, and we add a new alert for file handles in use so we can catch it before it goes critical. The bug gets fixed, but we keep the alert, just in case we run into it again. It would be much better for everyone if we added a release test that checks for the general case of increased file handle usage, and dropped the alert on the live systems.
Alerting should always be aimed at maximizing the amount of sleep that your operations team gets. That means as few alerts as possible to keep everything running, and automating as much as possible.
When we're talking about alerting, the most important thing to watch is the metrics related to your service level objectives, or SLOs. Just as a note, an SLO and an SLA are not the same thing. A service level agreement is a contract: it's basically an SLO with teeth - a penalty. The SLO is the level of service that we're promising to our customers. For Kafka, this is typically going to be that the system will be available, and it will perform at a certain level for produce and consume requests. We'll cover what metrics to use for this in a bit.
In addition to these, your SLOs are whatever you’re guaranteeing to your customers. This may include a minimum amount of retention. If you’re working to GDPR, or another privacy standard, you may specify a maximum amount of time that data will be retained for (here’s a hint, that’s not necessarily the retention in time that you set for the topic).
I've talked at length about the under replicated partition count metric. I dedicated a significant number of pages in a book you may have seen about how to respond to any non-zero value. At it's heart, this number tells you that the replication within the cluster is having a problems.
A stable count on all but one broker tells you that that broker is not working. It's either down, or the replication is not started
A variable count on a single broker tells you that that broker is having a problem servicing consume requests
A variable count on multiple brokers indicates a more overall problem. In this case, you'll need to enumerate the partitions that are falling behind (using the CLI tools) and see if there is a common thread, such as a single broker that is having problems replicating from multiple cluster members.
But the most important thing that the URP metric is, is overrated for alerting. That's right, I said it. I don't like getting paged for this metric. But why, you ask? If it illustrates so many problems, why wouldn't I want to get alerts for it? The problem is that it doesn't tell me that I'm breaching my SLO, and whatever problem it's telling me about is often not immediately actionable. More often than not, this metric tells me about two problems. The first is that a broker is down. I can detect that with a much clearer signal, however, by health checking the application. The other problem is that the cluster is operating over it's capacity. I don't want to be paged for that either because capacity is a proactive monitoring problem, not a reactive problem. We'll talk about that more in a few slides.
Still, you should be collecting this metric, and you might want to consider generating warnings for it. It does illustrate a risky situation, because we depend on replication in the cluster for redundancy. When it's not zero, you have a problem that needs some attention.
As with most applications, Kafka has thread pools to do work. There are several different ones - network handlers, request handlers, log compaction, recovery (which are also used for handling log segments at startup and shutdown). When we’re talking about client traffic, the network and request handlers are the ones that do all the work, and the request handlers are far more important. This is because the network handlers just take care of the network connection, including reading and writing bytes on the wire.
The request handler does everything else for the client - it decodes and validates the protocol, handles produce and consume work, and assembles the response to send back. It even performs all of the broker internal work, responding to controller requests. This means that if you want a single indicator of how busy the broker is, you couldn’t ask for a much better measure than the utilization of the request handlers. But as with under-replicated partitions, there are a lot of different problems that could be indicated here
CPU - Slow disk performance, often due to a failing drive, is a particular problem for produce requests. As the request handler will have to take more time when writing to disk, it will manifest as higher utilization
Timeouts and deadlocks look very similar
Timeouts - all of the request handler threads are getting tied up. We most often see this when the broker is starting up, and it is failing to process requests from the controller within the controller socket timeout.
Deadlock - But if that doesn’t solve it, you may have hit a deadlock condition in handling requests. We’ve seen this recently with some shutdown code, but it was related to the authorizer we were using and not Kafka directly.
Here are the produce TotalTime graphs for a broker that is working perfectly well. (Include 50th, 99th, and 999th). If the broker is running well, why is there such a discrepancy? The reason is that the amount of time required for a produce request varies widely depending on the content of the request.
Timeouts most often happen when controller requests are not processed within the controller socket timeout. What happens is that the controller sends the request, it times out, and then the controller sends the request again. You’ll see this especially when the broker is starting up, and the controller is trying to send it the state of the world with leader and ISR requests
Deadlocks look almost identical, but they’re much more rare. We’ve seen them recently during shutdown, but that was caused by an issue in the authorizer module that we use, and not something that was endemic to Kafka itself. However, they’re almost always code issues. This makes them pretty tricky to debug.
Wait, the Kafka brokers don’t compress data anymore! We got rid of that with the bump to message format 1, and relative offsets in the produced batches. Right?
Yeah, that’s what I thought, too. Turns out that there are a couple cases, which are not as rare as you might think, that will result in the broker having to rewrite the incoming message batches.
Another common culprit for the request handlers being over utilized, even at a low traffic volume, is due to compression. This happens when the client versions do not match the message format on disk. The (config name) is settable via a broker configuration, and controls how messages are written to disk. In an ideal world, the producer client version matches this configuration, such that the producer is sending the same message format. If the producer is an older version, the broker will have to upconvert the messages, and if the producer is using a higher message format version the broker will need to down convert. Both of these situations means the broker will be forced to recompress the message batch before writing it to disk (this also happens if your brokers are still using message format zero). This is an expensive operation, and should be avoided. It’s also worth noting that you can set the message format on disk as a per-topic override. You will want to be very careful if you feel the need to do this, as it means the logs on disk are inconsistent, and you could easily have compression you’re not expecting.
If you have slow request processing due to issues like this, you’re also going to have latency issues. Which gets us into the third set of metrics...
For each protocol request type, Kafka provides a set of timing metrics. These describe the amount of time that the request spends in various states while being processed:
Total time - this is the overall total time to process a request, from when it is received to when it is complete
Request Queue Time - how long the request sits in queue before being picked up by a request handler for processing
Local Time - The amount of local processing time required for the request. This can include a number of things, such as disk write time for produce requests
Remote Time - The amount of time that the request waits on non-local steps. This includes acknowledgements from followers for produce requests
Response Queue Time - how long the response for the request sits in queue before being sent to the client
Response Send Time - how long it takes to send the response to the client. This only covers getting it into the send buffers locally, not network time.
In addition to the time metrics, there is also a rate metric that gives you the number of requests of a particular type per second. The time metrics are provided as percentiles, and as such you can choose from 50th, 75th, 99th, and 99.9th percentiles, as well as an average and maximum value over the course of the running process.
Request latency is typically going to be the first of your SLO measurements. Which means that you will probably want to be monitoring these metrics and possibly alerting off them. The problem comes in as you try to pick which attributes to monitor, and what the baseline values are.
Here are the produce TotalTime graphs, 50th percentile and 99.9th percentile, for a broker that is working perfectly well. It may be hard to see, but the scale of the first graph is in single digits, and the scale of the second is in thousands. If the broker is running well, why is there such a discrepancy? The reason is that the amount of time required for a produce request varies widely depending on the content of the request.
Let’s consider the local time. Again, these are the 50th percentile and the 99.9th percentile, and the first graph goes from zero to one, while the second graph is again in the thousands. What would impact the amount of time required to process the produce requests locally? In this case, most of our produce requests are really small - small batches, single topic - but some of them are very large. The bigger the produce request, the more time it takes to write the data to disk.
How about the remote time for the same produce requests? Yet again, these are the 50th and 99.9th percentile graphs, with the first one being from zero to two, and the second being in the thousands. The average value is small, but the 999th is multiple orders of magnitude higher. The most common cause here is that most of our requests are being produced with the required acknowledgements being set to 1, while some are requesting all acknowledgements. That easily drives up the amount of time spent in the remote step.
This isn’t to say that you can’t use these metrics effectively for alerting. It just means that you need to define your SLOs appropriately. Stating simply that produce requests will be handled in 20ms or less may not be reasonable, but specifying that value for the average produce request may be fine.
OK, so we’ve covered our three metrics, and we’ve still got X minutes left in this talk. I could sit here and just stare at my phone for the rest of the time. Or …
We could talk about what’s missing, since we only covered a very small slice of monitoring for Kafka.
The other side of your service level objectives is probably going to be the availability of Kafka to handle requests. But as with any system, you can’t truly measure the availability of a Kafka cluster from the brokers themselves. There are many factors that go into availability, including whether or not the network is working. Looking at the broker itself may tell you that everything’s fine, meanwhile none of your clients can connect.
For monitoring availability, you need to use something external to the Kafka cluster to look at it from the client’s point of view. This is why LinkedIn created, and open sourced, kafka-monitor (https://github.com/linkedin/kafka-monitor). This runs a producer and a consumer for each cluster, and assures that both requests work properly. It can assure that there is at least one partition on each broker in the cluster, so you check the entire cluster. It also provides latency metrics for requests, so you have an objective view of the request timings we were just talking about.
So what should we do about lower level OS and hardware metrics? Well, let me ask you this. I have a Kafka cluster that’s running at 95% CPU, what do I do? Well, if it’s serving requests properly and within the SLO, I go get a cup of coffee. I might need to look at it, but it’s not a crisis.
Most metrics, OS or otherwise, are a great recipe for creating lots of alert noise that is not actionable. CPU and memory usage could be high due to other applications, and in most cases relate to overall capacity and not to the application’s performance or current state of functionality. You should definitely collect them so that you can go back and debug problems later. If you’re thinking about setting up an alert you need to ask yourself two things:
Is this always actionable when the alert goes off?
Is the action 100% clear?
If the answer to either of these is something along the lines of “Yes, but…” you need to stop and rethink what you’re trying to accomplish. But, Todd! I need to monitor things like disk usage, don’t I? Yes, of course we do, but this falls under the heading of capacity planning.
My Kafka environment, like many of yours, is shared between many different applications. You may even have some of the tech debt that we have, where you have little control over when someone starts using it for a new service. This means that we should be keeping an eye on the capacity of the system, and preemptively adding more.
Preemptively is the key word here. You want to deploy new brokers before you’ve hit 100% capacity, which means that you need to order them earlier than that.
I am no magician, contrary to the perception that many have of my ability to solve problems. It does me no good to get an alarm in the middle of the night that we’re approaching saturation, as I can’t magically make new hardware appear. And if I already have the hardware, it should have been added to the clusters so that I never hit a crisis point.
The metrics that I’m mostly interested in for judging capacity are:
Request handler pool idle ratio
Disk utilization
Partition Count
Network utilization
You should be trending these metrics over time, and reviewing them on a regular basis. You may want to have some sort of alert once capacity is approaching a point where you need to get more, but that should be an email, or even better, and automatic work ticket in your system of choice. Additionally, make sure you’re making use of features like quotas and retention of messages by size so that you can minimize any surprises.
If you take nothing else away from today’s talk, leave with this.
First, you must define what your service level objectives are for Kafka within your organization. Even if you’re running at a small scale, and with a limited number of customers. Even if you’re the only customer of your cluster. Make it clear what the expectations are, and hold to them.
Next, once you have those SLOs, that is what you need to be monitoring. David Henke, who led Engineering and Operations at LinkedIn for many years, would often say “What gets measured, gets fixed.” If you do not monitor your SLOs, then they do not really count.
But beware of metrics that inform you to many different problems. They are typically noisy, and they often make it difficult to determine what the underlying problem is. They are attractive, because it’s a single number that says “something is wrong”, but they will drive you crazy in the end.
And lastly, buy yourself a copy of Kafka: The Definitive Guide. In fact, you should buy two or three. Because reasons.