Observability-First Kafka: Engineering Visibility at Scale

Video size:

Abstract

Kafka doesn’t fail cleanly: it stalls, lags, and misfires beneath the surface. This talk cuts through the noise to show what signals actually matter, how to catch issues early, and how to make Kafka observable without drowning in metrics or noise.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

All right. Hey guys. Good day. I'm here to talk about observability with Kafka, and how do you make sure that you treat observability as a first class citizen when you're dealing with Kafka, and how do you ensure visibility at scale? Kafka, as we know, is a backbone for modern data streaming and any messaging ecosystems at this point. So anything as soon as. If you talk about Kafka, you'll want to talk about observability because if your Kafka cluster fails. It's gonna be a good problem. So let's talk about how you can add observability to your Kafka ecosystem. Using the MELT Stack as how I like to call it, melt is an acronym for metrics, events, logs, and traces. And we are gonna talk about all four of those. For Kafka let's start why Kafka needs observability. As you can see here, server rack in the photo here. Basically what this is trying to tell you is that Kafka is not just one server. It's almost always a conglomerate of servers working together. As a cluster to ensure high availability and reliability and replication for your data sets. It has become a critical infrastructure for a lot of data streaming systems. So an early detection is really critical. You wanna understand if there are, there is gonna be a broker crash. If there is gonna, there is a lot of memory pressure if there is a replication lag, if their consumers are lagging. Things like that. It also helps with fast troubleshooting, right? If you have proper observability set up for your Kafka clusters and your producers and consumers it makes it a little easier for anybody and everyone to actually troubleshoot. The problems when they occur, and maybe sometimes even before they occur. Yeah, troubleshooting. And then because you have all those metrics, you will be able to dig deep into the performance insights for the cluster and understand how your clusters performing. Is it doing really well? Where are the bottlenecks? Where are the issues? What the heat map looks like, things like that. So let's get into it. Let's talk about the meld stack, right? We, I just introduced it earlier and we are gonna talk about it a little more. In this pyramid, if you see the basis is the metrics, right? We wanna make sure that we gather all the metrics from Kafka and we are gonna talk about those what the metrics are. These are basically numerical measurements. Next is the events, and we are gonna talk about all of those in a little more detail. So bear with me while I go through these things. Events are things like, actually, let's talk about metrics first. Metrics are the numerical measurements of what, of whatever is happening within your system. Those are metrics which are exposed via the, Kafka brokers. They Kafka has a lot of, metrics that it exposes out of the box by default. And you can use things like Prometheus and Prometheus exporters to convert them into a proper open, telemetric compatible formats or Prometheus compatible formats and expose all those MBE outside. Those are just the metrics that you want to share. See, and those will give you the performance traits, the functionality traits, and what's going on within your ecosystem. Events are something like. Hey, my broker crashed. It's restarting. Something happened in the cluster. My disc is around 70% full. I should probably generate an alert so that I can add more disc space to the cluster. My network throughput is probably saturated. I need to think about either adding more servers and rebalancing on the server servers. Or I have to think about how do I scale up the network eco the network end points or the switches. So all of those could be events. Generally, the events from Kafka are not exposed by the brokers. They'll probably be part of the logs and metrics. So there will be no explicit discussion around those that we make, but. Think about it, right? That there are certain things that could be care categorized as events within Kafka. Nowadays people are even deploying Kafka on Kubernetes, so all the Kubernetes events could be related as well. When there is a network change, when the pod moves, when the pod crashes, when the pod restarts, all those events might be correlatable to your Kafka. Performance and impacts on their day-to-day working of that Kafka cluster as well. Event events are really helpful optional, but helpful logs not at all optional, right? You'll want logs from your Kafka brokers no matter what. You'll want your Kafka logs from somewhere your in your client codes as well. You want all of them talk? Take a look at my older older talks that I have. You will want all those logs aggregated into an aggregation system. Instead of having five aggregated systems, or five, five different systems for logs have maybe one. Where you get all that stuff back and you can actually track them and search them easily. Logs are critical, super critical traces. So these function, not at the Kafka broker level, but at the client level, more or less. The whole point of these things existing and traces are more like, Hey, I'm gonna add a watermark to every one of my message, and I'll see that wherever I see that message flowing, I'll be able to track the entire workflow from there. And that's the whole purpose of traces. One, it gives you how much time did it spend in particular. Specific client if it, if there was some processing going on. Second, it also traces, like I said, it traces the entire workflow for you, where was all, where all was this message located? What all hops did it take and how did it end up eventually? So traces optional, but again, good to have a detailed view of what your functional system is doing instead of the operat, instead of the operational system, how that's doing. That will give you your pyramid of the melt stack, the top end, the complete observability for 360 degree visibility in your Kafka cluster ecosystem. Tool chain recommendations are generally hard and I don't like to make tool chain recommendations. But this is where I'm diverging because I love Prometheus. Open source community sponsored, community run, community driven, log aggregation. TSDB platform, not log, but metrics, aggregation, TSDB time series database. Open telemetry. Again, as the name says, open, that's a standard at this point. Everybody wants to use that. Everybody uses that. And because it's a standard, it's easily you can move from Promeus to Prometheus to something else in future if you really need to. And still say compatible if they support the open telemetry formats. So that's something which I'll say is critical that you want to have in your ecosystem. Grafana is one of my recommendations just because I love Grafana. That's all right. It has nothing to do with the open source, although they have an open source view and they have an open source component that they allow you everybody to use in the world. But I just love Grafana. It's easy. It's much more manageable. It directly talks to Prometheus, it talks to Elastic if it really needed to. There are like multiple ecosystems that it can plug in with and still display your beautiful graphs as you need them. Grafana, it's, let's d start diving into the meld stack now. Metrics system, health indicators, what are they? As I just referencing earlier, these are like numerical measurements that indicate your health of the system, the performance of the system, what it's doing at that point, all those. Points of interest that you would have in the ecosystem. Those are co, those are basically metrics. How many requests did I process? Is my disc back pressure? Is my network thread count enough to tolerate the incoming brush of requests from the producers and consumers? That's where, that's the data that you want, and those are all coming from metrics. Replication, lags metrics, consumer lag metrics, all that stuff is metrics. They matter because they fail from the basis of your view and observability into Kafka. So they are super important forms. The baseline foundations offer pyramid of observability. So yeah, make sure you have metrics available. Events. These are the second things which we want, which we briefly talked about, right? Discrete state changes, right? For Kubernetes or crashes or restarts. There is a rollout, there is a new deployment. 500 different things can happen. Those are all events. Similar for Kafka, right? You could have a broker crash, you could have a deployment, you could have a maintenance window where your Kafka cluster is rolling. So all that stuff are events for you, right? And those will provide you the context to why your metrics are spiking at some point of time. So for example, let's say you started a initiated a rollout for your cluster and you're rolling one broker by after the other, right? So there will be latency spikes, which you'll see from your producers, and then twin latencies will suffer and they are suffering because you're clear rolling the cluster. There's no other correlation. More, more or less, right? And. If you see metrics by themselves in that window, you'll probably not be able to understand why that's happening. Once you add an event to it, the mixture, you'll be able to see, oh, okay. The latency spiked because my cluster was rolling. I had a new power restart, and it took up some time for it to get pro promoted itself to leaders. Things happened in, and the latency kicked up a notch for a few milliseconds, right? That's where it, that's important. Logs like we talked, it's, they're timestamped, they're detailed records of activities that the system was performing. So yeah, if there is any error, if there is a error stack, Java error stack, that all comes out through your server logs. Things like authentications, authorizations, whatever is happening within your system. Those will be tracked through the logs. Depends on the velocity of the logs that you choose. But yeah, again, you can go as granular as you want and keep or keep it as informatic, informational, as you want. They matter a lot because metrics, plus logs, plus events, those are the way how you determine what your system is doing, why your system is doing all that stuff. And it's important for Kafka as well. A big example of this could be, hey, if there was a controller election, what happened? If there is frequent client disconnections, what's going on? If there is a producer loop, when the producer is producing, if the network who was backing up, if your disc was backing up, what was going on, those kind of things can be exposed via the logs. Some of it is visible through the metrics. Sure. But once you get, get to see the symptom, you'll want to see why that symptom's occurring. And that's gonna happen with the logs, only with the logs, traces as we talked about it's at the functional layer, so you may consider it as optional, but you'll want to think about an end-to-end request flow, and that's where traces shine, right? It's not about. Kafka anymore. This is more like your application and your business logic. A lot of customers do use it and generally they had add headers to the Kafka message itself. So Kafka headers, every message has a header. You can probably basically add a header header row in there and have that watermark. Persists throughout the scope of that life, of that message. And wherever it goes, wherever it's read, that Kafka header can basically just say, Hey I came here. So that watermark will persist in all those processing systems. Whatever you use on top of Kafka, they will tell you that this message came here. So yeah, it helps with transaction flows specifically. Like I said, you would want to use all four of the MEL signals. They have complimentary perspectives. Not neither one of them are enough by themselves. Right metrics will help you determine anomalies in real time. Events will tie those anomalies to specific changes if there are logs, will provide you deep details for debugging and understanding if there was no anomaly, what happened in the system so that you can see what's going on. And traces will actually help you visualize the end-to-end paths above Kafka at some point of time in the business layer so that you can understand what's going on to give you a holistic view of what's going on in your system. Yeah, you want complete picture, you want complimentary perspectives. That's why all signals are more or less important. You choose which ones are the best ones to use. To start with, my recommendation would be start with metrics at a bare minimum and logs at a bare minimum, and then graduate from there into the events and the tracing systems beyond that. Alright, let's talk about, let's dig a little deeper and into the Kafka metrics. This is one of the big goals for observability and you will wanna make sure that you have enough Kafka metrics. But Kafka can sometimes overwhelm you with a lot of metrics. The last time I was counting and it was like a fair two, three years ago, Kafka had about 3,500 MBS that had exposed almost 3,500 mbs. Again, permutation and combinations included not just the unique MBE names, but yeah, permutations for all the data. And that was very bare minimum cluster that I was running on my home server. So yeah, it has a lot of data that it spits out a lot of time series that will create, and you will have to make sure. That you are cherry picking the right ones that you need. Not everything is as important as something like replication lag, for example, right? Or CPU consumption or network IO load. How much is the network thread usage? Those kind of things are really important to start with, the biggest key indicators, like I said, right? Replication, health, partition distribution, how well are they distributed? How well are the leaders distributed? So you'll want to make sure you understand those. You should not have a very hot broker where a lot of partitions just concentrate together as a leader and the others are sitting idle. You'll want those leadership to get distributed, and you can use various tools for all that, right? This is not a Kafka admin talk, which we are talking about, but you'll want to make sure that you have enough. Capacity available on each of these brokers and those capacity being utilized almost evenly as much as possible so that your cluster can shine the way it should as a cluster and should not get throttled by a single broker. Error rates very important. How many requests failed? How many fetch requests failed? How many produced requests failed? Maybe something failed in replication. I wanna understand that. What are the error indicators and what all those things mean? Probably something more important, but you'll prob you'll want to understand those things to get more clear picture of how your Kafka cluster is performing. I have mentioned network utilization like multiple times already, right? Network utilization is by far the most precious resource that you'll want to monitor for a Kafka cluster. Everything's network right in the end. Kafka is replicating a data back in the background. Between multiple brokers across the replica across the replicas that you allow it to, the number of replicas that you configure it for those will need network, right? Enough network capacity. So if you're producing at a thousand wags per second, or 125 megabytes per second, you are probably already using most of the capacity available in the one gigabit ethernet port. You would probably want a 10 gigabit that SFP port on your Kafka brokers. Again, not getting into the hardware, but you have to understand the capacity requirements. At what scale your Kafka cluster is operating. If it's a small scale cluster, it doesn't matter, right? One gigabit per second ethernet port probably will work. Probably need a bonded port for that, right? And network one gigabit per second bonded port. That might give you around 101, 1.5 gigabits per second. But it depends, right? Again, your use case, your scenarios, how much your utilization is that is gonna be told by, or that is the answer which you're gonna get from your metrics, right? So make sure you have those. How do you instrument those? So this is way more important. Now I've been talking about why those are important, but how do you instrument those? Kafka exposes all those mbe, they're called mbs, and those MBS are exposed via GMX. So you enable GMX metrics on all the brokers and clients. Technically if you want the client metrics, but brokers you definitely want, deploy a metrics scraper, right? Something like a Prometheus. JMX exporter is awesome. Recently it has been getting a lot more attention than it was getting like three years ago. So it's improving a lot. It's basically the whole point of A-A-G-M-X exporter is to convert the JMXM beans and convert them from a Java centric format to more. TTP compatible format, if I may, right? It's a very weird analogy, but yeah. The JMX gets converted into a metrics endpoint, which is an s ttp STTP endpoint on a specific port where it generates a single page and all those lines it generates in there are metrics. That's it right now when you deploy those. The Kafka brokers will expose those metrics, which are scrapable right now, your Prometheus endpoint or Prometheus server or some other server, whatever you're using, they'll have to go and scrape all those endpoints from all the brokers and gather them into one single server, right? So you'll want to make sure that is done. For clients specifically, and this is where, newer versions of Kafka Shines. Kafka added the feature where clients can actually ship their metrics, some of their metrics. The really important ones, at least to the Kafka broker for exposition, right? The, that's called Kip seven 14, and it was available in Kafka 3.7 and above. So if you're using 3.7 and above. You have Kip seven 14 already available. We'll probably wanna implement that one. One abstract class that they provided, and then it'll be able to expose it via open telemetry or Prometheus based exposition formats. So you'll be able to centrally collect the client metrics coming from all your clients directly from the brokers. So you would not have to instrument Kafka and all its consumers or all its producers. With JMX, you'll probably just have to do it for the Kafka cluster and it'll give you the most important metrics from your clients as well. Pretty recent change, but this has an awesome game changer event. So yeah, simple Flow Kafka, it exposes the metrics in JMX format. You go through the the telemetry collectors, open Telemetry or Prometheus, GMX, they are interoperable with the open telemetry format as well. Prometheus. Scrapes, the Kafka endpoint, the Kafka Open Telemetry, or the Prometheus Exporter endpoint and stores that time series data within itself, Grafana basically connects to Prometheus and exposes that data as pretty graphs or dashboards. We are not gonna talk about events too much because events are in between the metrics and the logs where some of the events can be. Decrypted for lack of a better term, from the metrics itself. So basically, if your system is not available or your broker's not online, it could mean that it either crashed or it went down, or this is an event for a broker role of some kind, right? Similarly there are things like there is a replication issue, probably logs are gonna tell you all that stuff much more cleaner. There is no inherent. It built way to expose Kafka events as such. So you'll have to rely on a mixture of logs and metrics to get some of those event related features available. So we, I'm gonna talk about logs directly and just wanna tell you why events is not present in the slide deck. Kafka logs. This is the observability another observability goal, right? Again, and one of the more important ones gives you detailed troubleshooting. Logs are really rich and this, they will tell you why the errors occurred, what you're seeing in the metrics. And so these are really important systems. Whenever there's an error, whenever there's an issue, any stack trace, every stack trace is important, and this is where logs will help you. RCAs cannot be completed without logs. Don't even think about it if you don't have the logs available. So yeah, they are really important. And another thing which I would say is when you are exposing logs. Probably expose them as chase on values if you can with and I think I'm probably gonna talk about that in a moment. Yeah, there you go. Alright. I'll get to that in a minute. Minute. So when you're talking about the logs they'll give you the error details, they'll give you the state changes again, goes back to the events. Controller elections, partition movements, leadership changes, all that stuff are, could be technically labeled as events, so you'll want to extract them from the logs itself. Hence logs are important. Client activity, it will tell you what the client activity looks like. If you're connecting to older brokers or older clients per se, those things, one of those big things, right? Then they'll come up as the logs also as well. So yeah, all those things are important. Instrumentation. So you want your logs to be really high quality, right? Log four J was the standard for Kafka. If I remember correctly with 4.0 and above I think they moved to lock four J two, but even the lock four J was patched version for Kafka specifically, it was not the standard lock four J version, which had the big CV a while ago. Use the JSON if you can structure it format like js ON because they are really helpful for any machine to parse those messages. Otherwise, you'll have to create all those parsers. Parcel formatting is not that easy. So I, believe me, I've tried to do that in metricbeat and things like that, or Logbeat. It takes a while to go through and ensure every use case is actually handled and enough. So yeah, set a log, four J properties well enough. If you have JSON available use JSO, it'll be much more easier to par for the machines and get that stuff done. Yes, it'll swell up your log volume. Yes. But it helps your machines and to understand all that stuff. Deploy something like filebeat, definitely right. You will want the collectors to stream it to a central location so that they don't have to go to each broker every time there is an issue. That will not be sustainable. Once you go beyond a single cluster, as soon as even if you have two clusters, it's gonna be untenable, right? You want them to be shipped to a central location filterable by the brokers, the clusters. All that stuff has to filterable, right? So it should be available. Retention strategy is important. Don't set a retention strategy to seven days. It's not enough. I can tell you very confidently because I've done that did not bode well for me. So yeah, seven days not a good idea. And that was like few years ago for me, so I was still naive. Hey, who needs more than seven days? Both the logs. That's not the way how it goes. You'll need logs to be more than seven days. Definitely. A good way to understand how long will you need the logs would be something like, Hey. What time do I allow my application teams to actually come back to me and ask for issues and referrals and debugs into the issues, right? Deep dives into the issues. So if it's something like around 30 days, you'll probably want like 45 days worth of logs to be retained and then phased out after that. There are ways to do maybe keep some of them in the heart. Storage, keep some in the warm storage so that, and maybe export them out to the cold storage after a while so that you don't have to delete all of them, but reduce costs. So that's also possible, but again, establish a retention strategy. You'll want to think about it. There are great talks on the internet around how do you want to talk about the retention strategy for logs and cost implications around that. So check those out. Simple implementation would be Kafka, log files, structure logs, json if you can something like an Open Telemetry collector or a filebeat or a, yeah, Filebeat is probably the right one. Grafana. Loki, I love Grafana. As I said, as a company. They introduced another product called Grafana. Loki. This is more like a Grafana, but not for metrics, but for logs, right? Excellent product. Check that out if you want. And then Grafana ui of course, to search and filter through those logs. This is not to say that you have to use this exact stack. You can mix and match how you, you can replace these components with something else that you already have or your enterprise. So please feel free. Do not think that these are. The set guidelines or parameters that you're gonna follow, right? You can change and mix and match how you, however you like. Traces, like I said, is the last part of this thing, right? The whole workflow. But this is more at the application level, tracking, not at the Kafka level tracking, but it gives you end-to-end visibility of what's happening, where each message is going, how much is the latency if there is bottleneck, which process, which service is causing all that bottleneck if there is a failure. Where is it? Where is that failure system, right? So that you can actually understand what is going on in your e ecosystem, Rob? The message journey, right? Like I said, producer produces a message. It should have a watermark. It goes to the Kafka broker. Broker persists it, it keeps it forever. There could be one consumer receiving that message. There could be 10 consumers receiving that message. So that can it could be a one to one to many flow, or one-to-one flow, or one to three flow. It doesn't matter the fan, or it could be as large as you want with Kafka. So that's the beauty. But with watermarking, you'll be able to actually trace all those paths as well. So really important there if you are into those kinds of things and your company and the organization wants that. From an instrumentation perspective, like what we talked about, you have to have that context propagation. So they will attach identifiers, watermarks to the message headers and consumers will continue to continue the same trace. They'll use the same identifier and add whatever need they need to add for the span. There is an auto instrumentation available with open telemetry. I don't wanna go into detail around that, but there is a really good blog about that in the Open Telemetry website. So check that out. So instrument clients with open telemetry collector will actually receive all that spans via are OTLP protocol. Things like jeager, zipkin or similar systems can actually store and process those trace data sets. And Grafana in the etiquette can, again, visualize those traces with span details. So again, on you, however you want to implement it, these are options which are available. You can go and find better options if you have. Please don't think that Grafana is the only option in the world. I, as I said, these are my preferences, which I use, but you can use whatever you like. Another example, so this is a good example for a trace. I just wanted to include it, that, hey, when you're tracing what happens right when you produce a message? Five milliseconds spent there in that trace, that span will say five milliseconds. It stayed within the Kafka queue for about 50 milliseconds. We message waiting in the broker. Then a consumer came in and it consumed the message. It pulled it and it started processing it. Maybe 10 seconds to consume, maybe 200 seconds to process, 200 milliseconds to process, sorry. So all that stuff adds up. Those spans will tell you where the hot area is, and if you have some processing performance issues, that's where you look at it. So melt, going back to the whole thing. Metrics shows you symptoms. Latencies, dropping latencies, replication, statuses, all that stuff. Metrics, events. Has my broker restarted? Is the leadership changed? Is there a partition reassignment ongoing? Is there a replication ongoing? All that stuff are events, logs, error messages, what's going on in my system? Are there client disconnections? Are there issues related to replication? Is there an issue related to the disc logs, traces? Where's my message? What's my message doing? Where is it all being consumed from and produced from? I wanna know all that stuff that's traces for you. A practical implementation, again, my example, my preferred way, because I've been dealing with these systems for a really long time. You can choose your, pick your poison, if I may. Open telemetry when neutral. So it's basically just default. For me at this point, 'cause it's vendor agnostic, it doesn't care. Prometheus has support for that. Prometheus also has A GMX exporter, which the community built pretty good. It's it has really good query language, it has really good alerting as well. And it's a really nice TSTB, which scales really well and performs really well under pressure as well. Grafana Unified Dashboarding. Any signals, alerts can also go into Grafana, so pick your poison. Promeus can do it. Grafana can do it. Choose whichever you one you like. Using Melt for Kafka, it'll get you better detection of errors, quicker analysis for root cal root causes. Probably up to around 40% fewer incidents depending on how you instrument it, how you alert on it, and all those things considered. And definitely way more confidence when you are managing, not one, not two, but many Kafka clusters at scale. Yeah. So key takeaways for me is always been the synergy of signals. Use the combination, right? The melt stack is important. Use the ones which are most important for you, and then instrument the other ones as needed, right? I love open source tooling. You can pick your choices however you like. It gives you much better visibility if you instrument the entire stack, but you can choose or start with metrics and logs and that will give you more than enough to start off and manage your Kafka clusters with confidence. These are some of the helpful resources, which are just, this is more like an appendix but I just wanted to call out. There is a JMX monitoring stat that me and one of some of my friends built. It's been, it has really good Kafka dashboards available and really good Kafka meets configurations, which are tuned for performance. Tuned for dashboarding and most critical signals and things like that. Take a look if that helps. Good. But there are other numerous dashboards, numerous Prometheus configurations available in the interwebs. So you can use your, choose your pick, whichever one you like. Some of the metrics that you wanna consider from a broker perspective, these are probably the ones which I consider as the most important metrics of all types for our brokers, right? There could be others, there are numerous other metrics, yes. But these are the things which I see is as the most important ones. Take a look, go through it, pause the video here and go through that. These are the common client metrics, which I say again, the most important ones, which you should have. It is a method out of the producer, the consumer, the connect, all three of them actually EMS out those metrics. Probably get those as a baseline if you're building your own right. So all these metrics are important if you are building your own right, if you're not using the JMX monitoring stack. JMX monitoring stacks already has most of these already in building, right? So if you go through that good, you're already set, right? But if you want to use your own, you want to use your own or roll your own and will learn, this is how you start. So coming back to the common client metrics, these are the ones which you'll want producer metrics. These are the ones which I consider as the most important. There are more more metrics than these. So yeah choose accordingly and enough. Consumer metrics. This is a little small, but yeah. These are the ones which I'll want you to start off with. Don't hesitate to co collect more from the clients because they don't have so many. For the Kafka brokers, there's too many. So just be careful about that. And that's the, that's me guys. Thanks a lot for listening, and if you have any questions, feel free to reach out to me. Thank you.

Slides

Download slides (PDF)

See all 61 talks at this event!

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Observability-First Kafka: Engineering Visibility at Scale

Video size:

Abstract

Summary

Transcript

Slides

Abhishek Walia

Staff Customer Success Technical Architect @ Confluent

Join the community!

Featured event

2026

2025

Info

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Observability-First Kafka: Engineering Visibility at Scale

Video size:

Abstract

Summary

Transcript

Slides

Abhishek Walia

Staff Customer Success Technical Architect @ Confluent

Join the community!