Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, I am David.
Welcome to this session, AI in SRE, unlocking permits
Insights with Natural Language.
In this session, I'll be discussing a project I worked on that try to
answer the question, is it possible to chat with your mandatory metrics?
I'm excited to share what I've learned and I ask her if pay
attention as we go through it.
So the talk today is divided into five sections, we do
like a general introduction.
We look at the approach by which the tool was built.
The tool itself is com called Prompt Chart, so we'll spend some
time reviewing the tool itself.
Lastly, we'll discuss the result and then a closing out section.
Okay, moving straight on to the first section.
Introduction was a problem model solution.
So as SRE, we understand the importance of speed when it comes
to incident, and if you have a tool that allows you to be able to chat
with your metrics, it can speed up.
Your incident responds significantly.
For me, I think there are two ways to think about it.
The first is that it reduces cognitive load while responding to an incident where
you know that there is pressure to restore user experience as fast as possible.
So having a tool that.
Helps you write prompt care without you having to think about it, reduces the
cognitive load required with responding to incident on the other side as well.
You're able to get quick insight without having to go through multiple dashboards,
which is another way in which it can lead to faster incident response.
The order usefulness of having a tool like this is that it makes
data accessible to everybody.
Traditionally to expose data from your Es.
Instances, you'll build dashboards for people that are not capable of
writing like their own prom qls.
So you would build graph dashboard on things like graph, which they can go.
But of course that means for each type of.
Data, or each question that they have, you need to create the
dashboard, which they can go and see.
However, if they have a tool that they can basically ask questions,
then you don't need to keep coming up with the new dashboard.
Each time that there's a new query and a new question, they can basically
just interact with the tool and get.
It start answers back.
The other advantage of having a tool like this is that it uses
the learning curve for Prometheus.
Starting up it, it's easy to write simple promeus queries as you need
to get, write more completed queries.
Things can get a bit harder, especially we are just learning d. Language.
So having something like a tool like this that can essentially cooperate
with you, like copilot when you're trying to write ProQ is good 'cause
it helps you, is learning cover, especially via getting started.
And then I think one of the major.
Motivations for me to actually work on the project is I see a lot of projects
in the data space that allows you to be able to chat with your data.
So it terms Hey, ask questions about your database and things like that.
So I started wondering, is it possible to do the same thing for metrics,
especially in the monitoring sense.
So that's like the motivation in trying to solve the problem and answering the.
Question, is it possible to chat with your monitoring metrics?
So the solution that came up with that is discussed in this project works this way.
So the flow is the u We are building it too, that allows users to essentially ask
questions in natural language and then.
The AI agents take the natural language and generates a
corresponding prompt QL query.
The prompt QL query is then run on Prometheus to get the actual results
back from Prometheus, but then that result before it gets back to the user
is converted back into natural language.
So the flow is that it starts with natural language and it
ends with natural language.
The user asks questions in natural language, and then they
get a result presented back to them in natural language.
So the second section of the talk talks about the approach.
At this point, we'll dive deep into what.
How we implemented it the architecture, and then we review a couple of the
key components of the architecture.
So this diagram shows architecture for the system.
So on the far left side of it, we have the front end, which is
how you interact with the system.
So it is either you interacting with the tool via a web app or a
Slack app basically, which allows you to enter your user query.
And then that user query gets sent to a backend.
The backend has rest API that makes it easy for the front end to interact
and send questions down to it.
As part of the backend as well, we have the AI agent.
The AI agent basically handles the coordination with the LLM.
So when the user query comes, it basically sends the request to the LLM and then
the LLM as well uses what is called.
Two, calling a function, callings to be able to talk to Prometheus.
So on the Prometheus aspect, we need to be able to fetch metadata from Prometheus
and run queries from Prometheus as well.
As we go later into the talk, we'll talk more about the metadata such that you
have better understanding of what entails.
But this gives.
I an eye level overview of what happens when you're trying to answer questions.
So the flow is, the request comes in, let's say from the
web app, it goes to the backend.
The backend takes the question, sends it to the AI agent.
The AI agent takes the question, sends it to the LLM, so the user
query together with the metadata.
So the metadata in this case is you can think of it like the scheme of your data.
So it basically just contains information regarding.
What kind of metrics are available in your Promeus instance and
the descriptions for them.
So you send that together with the user query to the LLM.
The LLM then thinks about a prompt QL that would be able to
answer that question correctly.
Then when they ask the prompt, QL, send it back to the AI agent, which then runs
that query on Prometheus to get the.
Actual result.
And then once the result comes back from Prometheus, it sparks back to the
LLM such that the LLM can interpret the result in natural language.
And then the result goes all the way back to the front end for the user to see.
So that's what the.
Overview of the system.
Looks like we'll be looking at a couple of tools that allows, or the key
components of the architecture as well.
So the first one is Prometheus.
Again, if you're not familiar with Prometheus I won't spend too much time
diving into it, but you can check it out.
What you need to know is that it's a time series data store.
For Matrix and typically ma matrix stored in material have these
formats where you have the matrix name and then you have the label.
So in this case, the labels are method and handler, and then they have the
corresponding value of post and messages.
So this is an entry of Prometheus of a metric in Prometheus.
So this metric is then script by Prometheus.
Quite often based on whatever script interval you have.
As the time series data is going to recall the time that it was script
and then the value at that point in time, and that's how this part works.
So part of the things that make this architecture works is that premises
as this metadata API that allows you to get, again, as I described
earlier, like a scheme of your data.
In the protal systems.
So what that entails is that it shows you the list of all the metrics that are
available inside of your protal system.
So in our case, for example, you can see C device or this device usage to CPU
load average superior U system sequence.
These are examples of metrics that are being scripted by Prometheus and is
available in the Prometheus instance.
So that means you can write queries or ask questions regarding this metrics.
But not only is the name return, they also returns information,
helpful information about them, which is the type of the metric itself
based on the type of the metric.
Again, it depends the kind of operations that you can perform on The metric
depends on the type of the metric, right?
So that's something to take up that needs to be taken into account
while writing your Pro Creole.
And then the help also gives information, additional
information regarding each type.
Of metric, which allows you to be able to interpret the metric correctly.
Additionally, sometimes you can also add information about the labels.
So for example, for this 80 TTP request total, the description can be
something like, oh, you let you know the number of 80 TP requests that
has happened within a time period.
And he has following label like method.
And so this metadata is the, additional bits that is sent as context along
with the user query itself to the LLM.
So that's the first major component.
The second major component, as we explained, is the LLM.
And then he makes use of a. Technical tools calling or function calling, which
is basically he extends the LLM such that it's able to interact with the environment
that is outside of the LLM itself using tools and to make that happen,
we have built two tools in our example.
So we have the query promises tool, which basically allows the LLM to.
Execute Prometheus query on an instance and get response back before
returning a response back to the user.
The other tool also allows the LLM to be able to query Prometheus for the metadata
information similar to the metadata information shown in the previous.
In the previous page here.
So basically the first two allows Prometheus to be able to run it
degenerated from Q and get responses back from Prometheus while the second
two allows Prometheus to be able to fetch metadata results from Prometheus itself.
Metadata regarding the metrics available.
So that's the two major.
Component in our architecture.
So moving on to the third part, which is we are looking at the tool that was built.
So the tool itself is called Prompt chart, and basically we'll be exploring
the tool from two points of view.
So based on the data inside of the Prometheus instance, we've
classified it into two examples.
The first one uses data from no exporter, and then the second.
Example of pro chats uses data from custom exporter to Promeus.
So both of them are chat.
You're chatting with Promeus.
So what's just different is the source of data in the Promeus instance and
which basically affects the type of queries that you can, or questions you
can ask, and the type of queries that you, that will be generated as well.
A little bit more about the node exporter.
So the node exporter is essentially.
Like a plugin that you can run on a vm.
So you can run it on a single vm, you can run it on your euca machine, for example.
Pretty much any like machine VM node, you can run it.
So what does is you expose system metrics to permit your such a protus
scan, script them at intervals.
So for example, things like the CPU seconds, the available storage.
Memory utilization, network traffic, all of that can be
exposed by the node exporter.
So there are standard system metrics and you don't need to configure anything.
The node exporter basically just makes those available and then those
data can be injected into premises.
So these are sample of some of the metrics that are available.
So as you can see, note CPU seconds to the the system available.
Five system available in bys.
And then this is on the number of network requests in bys that
have been received as well.
The average network traffic invites over the last minute.
So examples of the metrics and then some of the cray so you can
see, using like rates to get like averages over a time period of time.
So this is what a Prometheus query looks like as well.
So moving on.
So this is showing the actual tool called pro chart and then responses
that was generated by the system when it was asked certain questions.
So in this example, the prompt chart is connected to the
premises instance that has data.
From a note exporter.
So the information available or the kind of questions you can ask is
based on the kind of metrics that is expected by the note exporter.
So in this case we can see four exchanges or five questions into the, actually, so
the first talks about DCP utilization.
So this is a question.
So in black here, you have the user question itself.
While in white it is a response from the.
From chart AI system is itself.
So it looks first asking about the currency periodization on the node.
We get this.
He asked about the battery percentage on the node more around information
always running on the node.
So I did run this particular example on my laptop using the macros node
exporter, so you can see gets that.
The node is running macros, the digs.
Space that is available across all the instances of the nodes.
And then finally the memory utilization of the node is shown as well.
Alright, so we'll look at, in, in follow up, would look at later,
would actually see from the backend work queries are generated for these
examples and what they look like.
The second Prometheus instance, I'll be using.
Is thanks to our, the friends@promlabs.com.
So they have this Prometheus instance that is publicly accessible.
So is from the team that wrote Prometheus and they make it available
so that you can use it to experiment, to learn and interact with Prometheus.
On that Promis instance, we have some custom metrics.
So if you look at the first set of metrics that was exposed by, in our first example
using the node exporter, those metrics are system metrics, meaning that there are
metrics from regarding the machine itself.
But more often than not, when we are trying to set up monitoring,
we want to collect metrics about the state of our service or the
application we are running as well.
So in those cases.
You would come up with like your own custom metrics describing the different
states or events in your service.
So the Prometheus on prom lab, the demo Prometheus on prom lab.
So he runs a service called the demo service, and then the demo
service exposes the following.
Custom metrics which is what we based in a set of questions and
interactions with from chat on.
So a couple of interesting ones that you see is one is that there are batch jobs.
So you can essentially ask questions around success rate of batch jobs
that was done by the demo service.
Can see questions around the HCTP request duration as well.
Another interesting one is that the service itself as a matrix lets you
know whether today is an holiday or not.
So it's basically called demo is holiday and from the help information.
So again, if you look at the metadata information, this is GUI, not from
the API, but basically it's going to be similar to the metadata.
EPI response showed earlier where you can see the name of the metric.
The type of it, which is in this case for his early day.
We can see the type of set as gauge, and then the description takes that when
the value return is one, it means that the current days and early day, and when
the value rate return is zero, it means that the current days one early days.
So these are examples of custom metrics that has been made available
by the demo service on this instance.
So all of this.
Metrics has been script and is available inside of the Promeus instance.
So that means we can connect our pro chart tool to this Prometheus instance and ask
it questions and let the AI generate the corresponding queries and shows answers.
So let's see how that looks like.
So in the next chat here we can see is, the prom chart two is connected to
the prom labs protal instance, and the first question here is today an holiday?
So the AI system response are based on metrics that I was able
to find from demo service Two.
It reports that today is an holiday.
Again, just in the next couple of slides, we'll be looking
at the actual query and area.
Got to this answer, but basically this just shows an overview
of what the interaction from the web interface looks like.
The second question we ask is how many items have been shipped to this?
So if a go back cop would see that there is a custom metric exposed
called items shipped to the, which is a counter that keeps track of the
number of items that have been shipped.
So making use of this, we expect that the AI will make use of this.
Counter to be able to answer the question of how many items have been shipped today.
And then lastly, there's a third question here where we are trying
to ask questions around the demo API and if it's taken longer than usual
unfortunately we realize, oh, as you can see from the screenshots, the ICM
response, that there is no data fund.
So we look at that and try to figure out, okay, why did that happen as well.
Behind the scenes as promised.
So this is basically the logs from the backend system letting
us know exactly what transpired.
So in the case of the first example message, the user
query is a sedan holiday.
Then the AI agents generated this corresponding prompt Q query,
which is demo is holiday, right?
And then.
This generated prom QR was run by the AI agent using function,
calling on the permitter sensor and permit to return this response.
So if you pay attention to this, you'll see the instance name is demo service two,
which is why the response talks about demo service two, or the most important stuff
to pay attention to is the value here.
So you see the value here is one, because this value is one that means
based on the information that we are able to get from the metadata,
we know that one means only day.
So you can see in this example, the AI agent is able to both write
the right corresponding query based on the metadata, but also
based on the information provided.
In the metadata, it's able to interpret that.
One means that today is an holiday, and as such, based on that interpretation is able
to reply the user query back in natural language saying yes based on the data that
we can see from demo service to values.
And that means that today it's an holiday.
So this is essentially what's going on behind the scene.
In the case of the first question.
In the case of the second question, this is a way more interesting.
So we asked about how many items has been shipped today.
This is the user query again generated from ql.
As expected, it makes use of the metrics demo item shipped total.
So it sets the time period.
So one day this is right, and then it looks at the increase
over the period of one day.
So this is generated from QL query.
Which makes sense.
So again, using two scrolling that this query is executed against the
Prometheus instance and Promeus returns, this results back to the AI agent.
So now the AI needs to figure out how to give, return this in natural language.
And this, I think this is interesting.
But this an a particularly interesting example because we would see that the demo
service actually runs three copies of it.
So as you can see here, there are three instances of the demo service.
There's demo service zero, which is the first instance.
There is demo service one, which is the second instance here, and then there is.
Demo service two, which is the third instance.
So for each of these instance, they have been processing orders through the day
and each of them maintain a count of the number of orders that they've processed.
So if you look at the demo service zero, it returns around five, 455,000 orders
have been processed by demo service one.
And then if you look at demo service.
No, the demo service zero rather, has processed 455,000.
If you look at demo service one, the value four eight is 453,000.
And then if you look at demo service two, the value process, so five a day
is 453,000, close to 454,000 as well.
Now.
The LLM does something interesting because it gets this results back.
It is intelligent enough to know that the value is interested in, is a,
an aggregation of the value across each of these instances, and also
applying the right grouping so it was able to figure out that it needs
to sum the value from demo service.
Zero demo service.
One and demo service two.
So if you look at this value, so you can do it later, but I've confirmed it.
If you sum this value return for demo service one, I plus the
value sum there, demo service.
One this is the value return from demo service two plus the value
return from demo service, one plus value return from demo service zero.
You're going to get.
The total value.
Yeah, so the LLM was actually able to figure out the right
aggregation for the list of data that was returned in the results.
Sum it appropriately and give the right number back in.
Natural language back to the user.
So this is another example that shows how it really shines to get instant.
And this happens that can stand.
So again, you're able to get instant insightful data
while making use of the two.
Now looking at the third example in the case, which we are
unable to get any response back.
So again, this is the user request or the user query.
The user asks a request to the demo.
With PI taken longer than usual.
Then the AI agents generated this query.
Now the query in terms of the ax is valid.
There's nothing wrong with it.
I returns and mt data set.
And that's not because there is no data available for the demo API, but how?
The LLM has interpreted the question and has tried to go about writing
the queries incorrect, and as such, we get an empty data back.
Although it's a valid ProQ query, it doesn't answer the
question that the user has asked.
And doesn't give us any response back.
So that's one of the limitations that we've discovered here.
But again, looking at it there, where to get around this.
So we'll discuss that in the next session.
So in the next example, or as you can see here, so this is the
prompt chart application itself.
When you come here, you're able to modify the.
LLM configurations where you can set what provider you want to use,
what model you want to use as well.
So in this case scenario is currently all the questions that we backed so far.
We are using Google's Gemini model and then we are using the Google Gemini 1.5
flash, which is their first free model.
So that's what we have used so far to answer a, our questions.
But we see that Gemini 1.5 in this example was unable to generate the
correct prompt, clear query for us.
So then by changing the model type to the thinking model.
So if you check here based on the confirmation prompt chat
so I flipped the model to.
Use the thinking model provided by Google, which is the Gemini
2.0 flash thinking model instead.
So by making use of this model and asking exactly the same
question, we are able to see that.
Now if you look at this example, the AI agent is actually able to
answer the question correctly.
So we'll look at the backend and see what has changed.
But in this case scenario, you basically ask the same question based on the newer.
Thinking mod, he was able to provide this insight and say, oh, we can see
that the slash API slash part seems to have a lot of 500 errors, and as
such, latency is also significantly higher than the other part as well.
Which suggest that the part has an issue.
So as anr, you can imagine how insightful such a. And inside like
this is when you are currently trying to debug an incident and you're
trying to figure out what's going on.
Maybe you get like a latency a lot or basically user start
complain that your CM is slow.
You can basically just.
File the tool, ask and get insights for information like this.
Now going back to the backend, we can see essentially what happened.
So from the backend there, you would see pretty much the same thing, the same user
request, but if you pay attention Yeah.
To look at the.
Generated prom ql.
That this generated a different prom QL that is more appropriate
is considering the rate now.
And also based on the type of the data is able to use the Instagram counter.
And as such, it was able to actually generate meaningful
results this time around.
So this is the full responses way much longer than this because there
are a lot of metrics that match this.
Essentially, based on all of this, you are able to get for each part you
can get the status and then you can get the value, so based on that, the
LLM was able to then go over all of this results similar to the first one.
He was able to group them appropriately and then also identify the outlier
in the list after the group.
And as such was able to figure out that compared to the other
guys, we can see that the slash a p slash par has 500 arrows.
More, and then the viral latency as well is higher compared
to the rest of the parts.
So this shows basically how by changing the LLM model that is used, using a
more powerful LM model, you're able to get a better result for queries that,
or in cases where the simpler models were unable to write valid queries and
compare the results that was written.
So moving on to the fourth section where we basically just discuss a summary
of everything that we have learned from interacting with Prompt Chat.
So this is the first part basically talks about what are the lessons
that we have learned, right?
You would see throughout the, from the architecture and the rest
of the presentation, there's no.
Not at any point did we attempt to retrain any of the models
or do any sort of fine tuning.
So what that lets us know is that I'm sure that they are LLM models
today are actually capable of writing prompt QL queries on their own.
The other thing to note as well is that, the only change we had to do actually
was to use one shot prompting, which is basically adding an example to ensure
that the output that we get from the LLM is formatted exactly how we want it.
And that's important because you initially were on IT project.
We are getting into issues where the response coming from the LLM
model, it would add additional like characters or talking to it.
And then once you pass it to.
The Prometheus instance, it won't be valid.
It would no longer be valid from cure, and that will lead to like
crashes or issues because Prometheus cannot interpret the query.
But after using one shot prompting where we are basically able to show.
DNLM exactly the format of the response.
He started returning exactly just the prompt QR required without any
additional characters or tokens around it, and that we basically were
able to avoid having to manually.
Try to extract out the prom cure out of the LLM response.
Based on the last example that you see as well would realize that obviously the
thinking models are better when it comes to trying to write complicated ProQ.
So the lighter models work.
For most cases as well.
But when you want to do, ask more completed questions, it's more useful to,
the more powerful the models are and the more time you spend thinking, the better
the prompt cure that they write as well.
So we've seen that again in the last example.
So some of the limitations that we are so observed in the course of working on
this project is that as you would see.
The majority of being able to pull this off depends on
the quality of documentation that you add to your metadata.
So Promeus would always have the metadata, API available and it will tell you,
okay, these are the metrics that I have.
And then these are the types of those metrics.
But if you don't add any help or information to interpret, for example,
if, look at the case of when we had demo.
Service is today all data metric.
If the documentation do not contain information, saying A one means today's
holiday, zero means today's an all day, then there would have not been any way for
premature for the L to interpret correctly the premature response that I got.
The other challenging bit as well is even for cases where
you have the documentation.
All the, all partex included as part of the documentation.
Most of the time the labels are missing.
Labels in this case are like, if you compare it to traditional databases,
so maybe like the colons or the fields.
So because you don't know what fields or what labels are available in your,
in that particular matrix, it makes it harder for DLLM to be able to write.
Correct queries, especially when you need to filter by things
like the actual bill value.
So that's one of the things where, again, that can easily be solved
as is essentially going over the documentation and adding as much
useful information there as possible.
The other limitation notice as well is that sometimes you might feel you might
having consistency in the results that you get because the query is generated.
Differ define slightly.
And that's important because if you frame your question slightly
differently, then the LLM can interpret it differently and generate
a correspondent different query, which would then give you a different result.
But that can be eliminated by having more exact descriptions in your question.
So for example, you're saying a, is there any endpoint currently between 500?
If you don't put, a timeframe the first time, maybe you might
do it over an hour or like maybe five minutes or maybe one minute.
But if for example, you ask specifically with the timeframe, then generated
prompt cure by the AI agents would contain that exact timeframe and
then you get the same response back.
That's another limitation.
But again, how that can be improved upon is basically just puts in.
More, the more exact your question is, the better the answers that you get.
So in terms of the future improvement so better support for complex queries.
So this better support includes things like being able to undo more completed
queries, being able to even supply.
Right now all the answers are coming back in texts in natural language, but of
course it might be useful to maybe have a graph to look at from time to time.
So for more competitive queries, for example, it might be useful to return
both the natural language, but also some form of visualization for it.
Also, right now the project is limited to just Prometheus, so
that's the only matrix source that.
It works with.
So in terms of next steps, you're looking at expanding the project
such that it supports more than just Prometheus as the source.
And then lastly, I think another interesting improvement would
be around the system being able to learn from user interactions.
So imagine you ask a question and maybe for example, the, you didn't use the
right labels, you and you, or it doesn't know the labels are available for that.
Particular metrics.
So he asks that, okay, I can answer this, right?
And then you provide those labels.
Now it would be useful if the, right now there is no memory in the system, so he
actually doesn't store the information.
So an extension would be such that next time you actually don't have to
go back and get supply the same labels back to the system for you to answer.
Correctly.
So that's like a future improvement when it comes to, it's actually
learning from user interaction.
Another way that can go is when queries, for example, let's say the queries
are wrong or like the wrong metrics was used and you correct it again,
all those kind of interactions can be stored such that his user's context.
Next time he's trying to answer questions, and as such, he can make use.
Of that, and then the system gets better over time because it's
learning from user interactions.
So lastly, how can you contribute and join?
The source code is available on our GitHub profile.
So it's open the projects open source.
If you've go to next I HQ on GitHub you see the source code for the prom chat app.
So issues pls are welcomed to, as a means of contributing to the project.
Also the web interface that I was playing with or that I was shown in the sites
is available from chat do co so you can basically visit that is available.
You don't need like an identification payment or anything.
The only issues might be because it's using my own personal, aPI key.
So there are limits to the number of like daily requests or sometimes maybe
even a lot of people have been playing around with it earlier in the day.
You might get, you might not be able to get any responses back
from the API it would tell you that the LLM credits is exhausted.
But yeah.
But you can clone the project locally.
Put your own EPI keys and run it if.
You want to, or if you want to use the web interface, you can visit from
chat, do nest slide.co as well, and you would be able to interact with it.
So that's it.
That's it from me.
So thank you very much for listening to the session.
I hope you've learned a lot and you have a better understanding.
As regards how to implement something like this.
And yes, the answer to the question is it is possible to
chat with your mandatory metrics.
And I do hope you enjoy the rest of the conference, but I thank you.
I.