AI in SRE: Unlocking Prometheus Insights with Natural Language

Video size:

Abstract

While handling an incident, have you ever wished you could simply ask questions about the state of your systems and get immediate, actionable answers? I know I have.

As SREs, we know every second counts during an incident. So, what if we could skip over the complex queries and multiple dashboards to quick insights? I set out to answer the simple question: “What if you could chat with your monitoring metrics?” In this presentation, I will be sharing what I learnt working building an open source project that allows you to do just that!

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi, I am David. Welcome to this session, AI in SRE, unlocking permits Insights with Natural Language. In this session, I'll be discussing a project I worked on that try to answer the question, is it possible to chat with your mandatory metrics? I'm excited to share what I've learned and I ask her if pay attention as we go through it. So the talk today is divided into five sections, we do like a general introduction. We look at the approach by which the tool was built. The tool itself is com called Prompt Chart, so we'll spend some time reviewing the tool itself. Lastly, we'll discuss the result and then a closing out section. Okay, moving straight on to the first section. Introduction was a problem model solution. So as SRE, we understand the importance of speed when it comes to incident, and if you have a tool that allows you to be able to chat with your metrics, it can speed up. Your incident responds significantly. For me, I think there are two ways to think about it. The first is that it reduces cognitive load while responding to an incident where you know that there is pressure to restore user experience as fast as possible. So having a tool that. Helps you write prompt care without you having to think about it, reduces the cognitive load required with responding to incident on the other side as well. You're able to get quick insight without having to go through multiple dashboards, which is another way in which it can lead to faster incident response. The order usefulness of having a tool like this is that it makes data accessible to everybody. Traditionally to expose data from your Es. Instances, you'll build dashboards for people that are not capable of writing like their own prom qls. So you would build graph dashboard on things like graph, which they can go. But of course that means for each type of. Data, or each question that they have, you need to create the dashboard, which they can go and see. However, if they have a tool that they can basically ask questions, then you don't need to keep coming up with the new dashboard. Each time that there's a new query and a new question, they can basically just interact with the tool and get. It start answers back. The other advantage of having a tool like this is that it uses the learning curve for Prometheus. Starting up it, it's easy to write simple promeus queries as you need to get, write more completed queries. Things can get a bit harder, especially we are just learning d. Language. So having something like a tool like this that can essentially cooperate with you, like copilot when you're trying to write ProQ is good 'cause it helps you, is learning cover, especially via getting started. And then I think one of the major. Motivations for me to actually work on the project is I see a lot of projects in the data space that allows you to be able to chat with your data. So it terms Hey, ask questions about your database and things like that. So I started wondering, is it possible to do the same thing for metrics, especially in the monitoring sense. So that's like the motivation in trying to solve the problem and answering the. Question, is it possible to chat with your monitoring metrics? So the solution that came up with that is discussed in this project works this way. So the flow is the u We are building it too, that allows users to essentially ask questions in natural language and then. The AI agents take the natural language and generates a corresponding prompt QL query. The prompt QL query is then run on Prometheus to get the actual results back from Prometheus, but then that result before it gets back to the user is converted back into natural language. So the flow is that it starts with natural language and it ends with natural language. The user asks questions in natural language, and then they get a result presented back to them in natural language. So the second section of the talk talks about the approach. At this point, we'll dive deep into what. How we implemented it the architecture, and then we review a couple of the key components of the architecture. So this diagram shows architecture for the system. So on the far left side of it, we have the front end, which is how you interact with the system. So it is either you interacting with the tool via a web app or a Slack app basically, which allows you to enter your user query. And then that user query gets sent to a backend. The backend has rest API that makes it easy for the front end to interact and send questions down to it. As part of the backend as well, we have the AI agent. The AI agent basically handles the coordination with the LLM. So when the user query comes, it basically sends the request to the LLM and then the LLM as well uses what is called. Two, calling a function, callings to be able to talk to Prometheus. So on the Prometheus aspect, we need to be able to fetch metadata from Prometheus and run queries from Prometheus as well. As we go later into the talk, we'll talk more about the metadata such that you have better understanding of what entails. But this gives. I an eye level overview of what happens when you're trying to answer questions. So the flow is, the request comes in, let's say from the web app, it goes to the backend. The backend takes the question, sends it to the AI agent. The AI agent takes the question, sends it to the LLM, so the user query together with the metadata. So the metadata in this case is you can think of it like the scheme of your data. So it basically just contains information regarding. What kind of metrics are available in your Promeus instance and the descriptions for them. So you send that together with the user query to the LLM. The LLM then thinks about a prompt QL that would be able to answer that question correctly. Then when they ask the prompt, QL, send it back to the AI agent, which then runs that query on Prometheus to get the. Actual result. And then once the result comes back from Prometheus, it sparks back to the LLM such that the LLM can interpret the result in natural language. And then the result goes all the way back to the front end for the user to see. So that's what the. Overview of the system. Looks like we'll be looking at a couple of tools that allows, or the key components of the architecture as well. So the first one is Prometheus. Again, if you're not familiar with Prometheus I won't spend too much time diving into it, but you can check it out. What you need to know is that it's a time series data store. For Matrix and typically ma matrix stored in material have these formats where you have the matrix name and then you have the label. So in this case, the labels are method and handler, and then they have the corresponding value of post and messages. So this is an entry of Prometheus of a metric in Prometheus. So this metric is then script by Prometheus. Quite often based on whatever script interval you have. As the time series data is going to recall the time that it was script and then the value at that point in time, and that's how this part works. So part of the things that make this architecture works is that premises as this metadata API that allows you to get, again, as I described earlier, like a scheme of your data. In the protal systems. So what that entails is that it shows you the list of all the metrics that are available inside of your protal system. So in our case, for example, you can see C device or this device usage to CPU load average superior U system sequence. These are examples of metrics that are being scripted by Prometheus and is available in the Prometheus instance. So that means you can write queries or ask questions regarding this metrics. But not only is the name return, they also returns information, helpful information about them, which is the type of the metric itself based on the type of the metric. Again, it depends the kind of operations that you can perform on The metric depends on the type of the metric, right? So that's something to take up that needs to be taken into account while writing your Pro Creole. And then the help also gives information, additional information regarding each type. Of metric, which allows you to be able to interpret the metric correctly. Additionally, sometimes you can also add information about the labels. So for example, for this 80 TTP request total, the description can be something like, oh, you let you know the number of 80 TP requests that has happened within a time period. And he has following label like method. And so this metadata is the, additional bits that is sent as context along with the user query itself to the LLM. So that's the first major component. The second major component, as we explained, is the LLM. And then he makes use of a. Technical tools calling or function calling, which is basically he extends the LLM such that it's able to interact with the environment that is outside of the LLM itself using tools and to make that happen, we have built two tools in our example. So we have the query promises tool, which basically allows the LLM to. Execute Prometheus query on an instance and get response back before returning a response back to the user. The other tool also allows the LLM to be able to query Prometheus for the metadata information similar to the metadata information shown in the previous. In the previous page here. So basically the first two allows Prometheus to be able to run it degenerated from Q and get responses back from Prometheus while the second two allows Prometheus to be able to fetch metadata results from Prometheus itself. Metadata regarding the metrics available. So that's the two major. Component in our architecture. So moving on to the third part, which is we are looking at the tool that was built. So the tool itself is called Prompt chart, and basically we'll be exploring the tool from two points of view. So based on the data inside of the Prometheus instance, we've classified it into two examples. The first one uses data from no exporter, and then the second. Example of pro chats uses data from custom exporter to Promeus. So both of them are chat. You're chatting with Promeus. So what's just different is the source of data in the Promeus instance and which basically affects the type of queries that you can, or questions you can ask, and the type of queries that you, that will be generated as well. A little bit more about the node exporter. So the node exporter is essentially. Like a plugin that you can run on a vm. So you can run it on a single vm, you can run it on your euca machine, for example. Pretty much any like machine VM node, you can run it. So what does is you expose system metrics to permit your such a protus scan, script them at intervals. So for example, things like the CPU seconds, the available storage. Memory utilization, network traffic, all of that can be exposed by the node exporter. So there are standard system metrics and you don't need to configure anything. The node exporter basically just makes those available and then those data can be injected into premises. So these are sample of some of the metrics that are available. So as you can see, note CPU seconds to the the system available. Five system available in bys. And then this is on the number of network requests in bys that have been received as well. The average network traffic invites over the last minute. So examples of the metrics and then some of the cray so you can see, using like rates to get like averages over a time period of time. So this is what a Prometheus query looks like as well. So moving on. So this is showing the actual tool called pro chart and then responses that was generated by the system when it was asked certain questions. So in this example, the prompt chart is connected to the premises instance that has data. From a note exporter. So the information available or the kind of questions you can ask is based on the kind of metrics that is expected by the note exporter. So in this case we can see four exchanges or five questions into the, actually, so the first talks about DCP utilization. So this is a question. So in black here, you have the user question itself. While in white it is a response from the. From chart AI system is itself. So it looks first asking about the currency periodization on the node. We get this. He asked about the battery percentage on the node more around information always running on the node. So I did run this particular example on my laptop using the macros node exporter, so you can see gets that. The node is running macros, the digs. Space that is available across all the instances of the nodes. And then finally the memory utilization of the node is shown as well. Alright, so we'll look at, in, in follow up, would look at later, would actually see from the backend work queries are generated for these examples and what they look like. The second Prometheus instance, I'll be using. Is thanks to our, the friends@promlabs.com. So they have this Prometheus instance that is publicly accessible. So is from the team that wrote Prometheus and they make it available so that you can use it to experiment, to learn and interact with Prometheus. On that Promis instance, we have some custom metrics. So if you look at the first set of metrics that was exposed by, in our first example using the node exporter, those metrics are system metrics, meaning that there are metrics from regarding the machine itself. But more often than not, when we are trying to set up monitoring, we want to collect metrics about the state of our service or the application we are running as well. So in those cases. You would come up with like your own custom metrics describing the different states or events in your service. So the Prometheus on prom lab, the demo Prometheus on prom lab. So he runs a service called the demo service, and then the demo service exposes the following. Custom metrics which is what we based in a set of questions and interactions with from chat on. So a couple of interesting ones that you see is one is that there are batch jobs. So you can essentially ask questions around success rate of batch jobs that was done by the demo service. Can see questions around the HCTP request duration as well. Another interesting one is that the service itself as a matrix lets you know whether today is an holiday or not. So it's basically called demo is holiday and from the help information. So again, if you look at the metadata information, this is GUI, not from the API, but basically it's going to be similar to the metadata. EPI response showed earlier where you can see the name of the metric. The type of it, which is in this case for his early day. We can see the type of set as gauge, and then the description takes that when the value return is one, it means that the current days and early day, and when the value rate return is zero, it means that the current days one early days. So these are examples of custom metrics that has been made available by the demo service on this instance. So all of this. Metrics has been script and is available inside of the Promeus instance. So that means we can connect our pro chart tool to this Prometheus instance and ask it questions and let the AI generate the corresponding queries and shows answers. So let's see how that looks like. So in the next chat here we can see is, the prom chart two is connected to the prom labs protal instance, and the first question here is today an holiday? So the AI system response are based on metrics that I was able to find from demo service Two. It reports that today is an holiday. Again, just in the next couple of slides, we'll be looking at the actual query and area. Got to this answer, but basically this just shows an overview of what the interaction from the web interface looks like. The second question we ask is how many items have been shipped to this? So if a go back cop would see that there is a custom metric exposed called items shipped to the, which is a counter that keeps track of the number of items that have been shipped. So making use of this, we expect that the AI will make use of this. Counter to be able to answer the question of how many items have been shipped today. And then lastly, there's a third question here where we are trying to ask questions around the demo API and if it's taken longer than usual unfortunately we realize, oh, as you can see from the screenshots, the ICM response, that there is no data fund. So we look at that and try to figure out, okay, why did that happen as well. Behind the scenes as promised. So this is basically the logs from the backend system letting us know exactly what transpired. So in the case of the first example message, the user query is a sedan holiday. Then the AI agents generated this corresponding prompt Q query, which is demo is holiday, right? And then. This generated prom QR was run by the AI agent using function, calling on the permitter sensor and permit to return this response. So if you pay attention to this, you'll see the instance name is demo service two, which is why the response talks about demo service two, or the most important stuff to pay attention to is the value here. So you see the value here is one, because this value is one that means based on the information that we are able to get from the metadata, we know that one means only day. So you can see in this example, the AI agent is able to both write the right corresponding query based on the metadata, but also based on the information provided. In the metadata, it's able to interpret that. One means that today is an holiday, and as such, based on that interpretation is able to reply the user query back in natural language saying yes based on the data that we can see from demo service to values. And that means that today it's an holiday. So this is essentially what's going on behind the scene. In the case of the first question. In the case of the second question, this is a way more interesting. So we asked about how many items has been shipped today. This is the user query again generated from ql. As expected, it makes use of the metrics demo item shipped total. So it sets the time period. So one day this is right, and then it looks at the increase over the period of one day. So this is generated from QL query. Which makes sense. So again, using two scrolling that this query is executed against the Prometheus instance and Promeus returns, this results back to the AI agent. So now the AI needs to figure out how to give, return this in natural language. And this, I think this is interesting. But this an a particularly interesting example because we would see that the demo service actually runs three copies of it. So as you can see here, there are three instances of the demo service. There's demo service zero, which is the first instance. There is demo service one, which is the second instance here, and then there is. Demo service two, which is the third instance. So for each of these instance, they have been processing orders through the day and each of them maintain a count of the number of orders that they've processed. So if you look at the demo service zero, it returns around five, 455,000 orders have been processed by demo service one. And then if you look at demo service. No, the demo service zero rather, has processed 455,000. If you look at demo service one, the value four eight is 453,000. And then if you look at demo service two, the value process, so five a day is 453,000, close to 454,000 as well. Now. The LLM does something interesting because it gets this results back. It is intelligent enough to know that the value is interested in, is a, an aggregation of the value across each of these instances, and also applying the right grouping so it was able to figure out that it needs to sum the value from demo service. Zero demo service. One and demo service two. So if you look at this value, so you can do it later, but I've confirmed it. If you sum this value return for demo service one, I plus the value sum there, demo service. One this is the value return from demo service two plus the value return from demo service, one plus value return from demo service zero. You're going to get. The total value. Yeah, so the LLM was actually able to figure out the right aggregation for the list of data that was returned in the results. Sum it appropriately and give the right number back in. Natural language back to the user. So this is another example that shows how it really shines to get instant. And this happens that can stand. So again, you're able to get instant insightful data while making use of the two. Now looking at the third example in the case, which we are unable to get any response back. So again, this is the user request or the user query. The user asks a request to the demo. With PI taken longer than usual. Then the AI agents generated this query. Now the query in terms of the ax is valid. There's nothing wrong with it. I returns and mt data set. And that's not because there is no data available for the demo API, but how? The LLM has interpreted the question and has tried to go about writing the queries incorrect, and as such, we get an empty data back. Although it's a valid ProQ query, it doesn't answer the question that the user has asked. And doesn't give us any response back. So that's one of the limitations that we've discovered here. But again, looking at it there, where to get around this. So we'll discuss that in the next session. So in the next example, or as you can see here, so this is the prompt chart application itself. When you come here, you're able to modify the. LLM configurations where you can set what provider you want to use, what model you want to use as well. So in this case scenario is currently all the questions that we backed so far. We are using Google's Gemini model and then we are using the Google Gemini 1.5 flash, which is their first free model. So that's what we have used so far to answer a, our questions. But we see that Gemini 1.5 in this example was unable to generate the correct prompt, clear query for us. So then by changing the model type to the thinking model. So if you check here based on the confirmation prompt chat so I flipped the model to. Use the thinking model provided by Google, which is the Gemini 2.0 flash thinking model instead. So by making use of this model and asking exactly the same question, we are able to see that. Now if you look at this example, the AI agent is actually able to answer the question correctly. So we'll look at the backend and see what has changed. But in this case scenario, you basically ask the same question based on the newer. Thinking mod, he was able to provide this insight and say, oh, we can see that the slash API slash part seems to have a lot of 500 errors, and as such, latency is also significantly higher than the other part as well. Which suggest that the part has an issue. So as anr, you can imagine how insightful such a. And inside like this is when you are currently trying to debug an incident and you're trying to figure out what's going on. Maybe you get like a latency a lot or basically user start complain that your CM is slow. You can basically just. File the tool, ask and get insights for information like this. Now going back to the backend, we can see essentially what happened. So from the backend there, you would see pretty much the same thing, the same user request, but if you pay attention Yeah. To look at the. Generated prom ql. That this generated a different prom QL that is more appropriate is considering the rate now. And also based on the type of the data is able to use the Instagram counter. And as such, it was able to actually generate meaningful results this time around. So this is the full responses way much longer than this because there are a lot of metrics that match this. Essentially, based on all of this, you are able to get for each part you can get the status and then you can get the value, so based on that, the LLM was able to then go over all of this results similar to the first one. He was able to group them appropriately and then also identify the outlier in the list after the group. And as such was able to figure out that compared to the other guys, we can see that the slash a p slash par has 500 arrows. More, and then the viral latency as well is higher compared to the rest of the parts. So this shows basically how by changing the LLM model that is used, using a more powerful LM model, you're able to get a better result for queries that, or in cases where the simpler models were unable to write valid queries and compare the results that was written. So moving on to the fourth section where we basically just discuss a summary of everything that we have learned from interacting with Prompt Chat. So this is the first part basically talks about what are the lessons that we have learned, right? You would see throughout the, from the architecture and the rest of the presentation, there's no. Not at any point did we attempt to retrain any of the models or do any sort of fine tuning. So what that lets us know is that I'm sure that they are LLM models today are actually capable of writing prompt QL queries on their own. The other thing to note as well is that, the only change we had to do actually was to use one shot prompting, which is basically adding an example to ensure that the output that we get from the LLM is formatted exactly how we want it. And that's important because you initially were on IT project. We are getting into issues where the response coming from the LLM model, it would add additional like characters or talking to it. And then once you pass it to. The Prometheus instance, it won't be valid. It would no longer be valid from cure, and that will lead to like crashes or issues because Prometheus cannot interpret the query. But after using one shot prompting where we are basically able to show. DNLM exactly the format of the response. He started returning exactly just the prompt QR required without any additional characters or tokens around it, and that we basically were able to avoid having to manually. Try to extract out the prom cure out of the LLM response. Based on the last example that you see as well would realize that obviously the thinking models are better when it comes to trying to write complicated ProQ. So the lighter models work. For most cases as well. But when you want to do, ask more completed questions, it's more useful to, the more powerful the models are and the more time you spend thinking, the better the prompt cure that they write as well. So we've seen that again in the last example. So some of the limitations that we are so observed in the course of working on this project is that as you would see. The majority of being able to pull this off depends on the quality of documentation that you add to your metadata. So Promeus would always have the metadata, API available and it will tell you, okay, these are the metrics that I have. And then these are the types of those metrics. But if you don't add any help or information to interpret, for example, if, look at the case of when we had demo. Service is today all data metric. If the documentation do not contain information, saying A one means today's holiday, zero means today's an all day, then there would have not been any way for premature for the L to interpret correctly the premature response that I got. The other challenging bit as well is even for cases where you have the documentation. All the, all partex included as part of the documentation. Most of the time the labels are missing. Labels in this case are like, if you compare it to traditional databases, so maybe like the colons or the fields. So because you don't know what fields or what labels are available in your, in that particular matrix, it makes it harder for DLLM to be able to write. Correct queries, especially when you need to filter by things like the actual bill value. So that's one of the things where, again, that can easily be solved as is essentially going over the documentation and adding as much useful information there as possible. The other limitation notice as well is that sometimes you might feel you might having consistency in the results that you get because the query is generated. Differ define slightly. And that's important because if you frame your question slightly differently, then the LLM can interpret it differently and generate a correspondent different query, which would then give you a different result. But that can be eliminated by having more exact descriptions in your question. So for example, you're saying a, is there any endpoint currently between 500? If you don't put, a timeframe the first time, maybe you might do it over an hour or like maybe five minutes or maybe one minute. But if for example, you ask specifically with the timeframe, then generated prompt cure by the AI agents would contain that exact timeframe and then you get the same response back. That's another limitation. But again, how that can be improved upon is basically just puts in. More, the more exact your question is, the better the answers that you get. So in terms of the future improvement so better support for complex queries. So this better support includes things like being able to undo more completed queries, being able to even supply. Right now all the answers are coming back in texts in natural language, but of course it might be useful to maybe have a graph to look at from time to time. So for more competitive queries, for example, it might be useful to return both the natural language, but also some form of visualization for it. Also, right now the project is limited to just Prometheus, so that's the only matrix source that. It works with. So in terms of next steps, you're looking at expanding the project such that it supports more than just Prometheus as the source. And then lastly, I think another interesting improvement would be around the system being able to learn from user interactions. So imagine you ask a question and maybe for example, the, you didn't use the right labels, you and you, or it doesn't know the labels are available for that. Particular metrics. So he asks that, okay, I can answer this, right? And then you provide those labels. Now it would be useful if the, right now there is no memory in the system, so he actually doesn't store the information. So an extension would be such that next time you actually don't have to go back and get supply the same labels back to the system for you to answer. Correctly. So that's like a future improvement when it comes to, it's actually learning from user interaction. Another way that can go is when queries, for example, let's say the queries are wrong or like the wrong metrics was used and you correct it again, all those kind of interactions can be stored such that his user's context. Next time he's trying to answer questions, and as such, he can make use. Of that, and then the system gets better over time because it's learning from user interactions. So lastly, how can you contribute and join? The source code is available on our GitHub profile. So it's open the projects open source. If you've go to next I HQ on GitHub you see the source code for the prom chat app. So issues pls are welcomed to, as a means of contributing to the project. Also the web interface that I was playing with or that I was shown in the sites is available from chat do co so you can basically visit that is available. You don't need like an identification payment or anything. The only issues might be because it's using my own personal, aPI key. So there are limits to the number of like daily requests or sometimes maybe even a lot of people have been playing around with it earlier in the day. You might get, you might not be able to get any responses back from the API it would tell you that the LLM credits is exhausted. But yeah. But you can clone the project locally. Put your own EPI keys and run it if. You want to, or if you want to use the web interface, you can visit from chat, do nest slide.co as well, and you would be able to interact with it. So that's it. That's it from me. So thank you very much for listening to the session. I hope you've learned a lot and you have a better understanding. As regards how to implement something like this. And yes, the answer to the question is it is possible to chat with your mandatory metrics. And I do hope you enjoy the rest of the conference, but I thank you. I.

Slides

Download slides (PDF)

See all 109 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

AI in SRE: Unlocking Prometheus Insights with Natural Language

Video size:

Abstract

Summary

Transcript

Slides

David Asamu

Tech Lead Manager, SRE Team @ Nomba

Join the community!

Featured event

2025

2024

Info

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

AI in SRE: Unlocking Prometheus Insights with Natural Language

Video size:

Abstract

Summary

Transcript

Slides

David Asamu

Tech Lead Manager, SRE Team @ Nomba

Join the community!