Conf42 Site Reliability Engineering 2022 - Online

Exposing Log-Metrics To Prometheus With Best Practice

Video size:

Abstract

In this age of fast-growing advancement in cloud implementations, there is a great need to manage logs effectively. In some cases, you have to study the metrics and know what the system is about; it helps in understanding your system to take decisions, post-mortem analysis and several other interesting functions.

First off, you should understand that Vector is a high-performance, end-to-end (agent & aggregator) observability data pipeline that puts you in control of your observability data. It directly orchestrates the operation of collecting, transforming and routing all your logs, metrics, and traces to any vendors you want today or tomorrow. Vector enables dramatic cost reduction, novel data enrichment, and data security where you need it, not where is most convenient for your vendors. Additionally, it is open source and up to 10x faster than every alternative in the space.

Summary

  • There are a couple of ways to play around moving log metrics to Prometheus. We're going to highlight some of the best practices today. By the way, welcome to conf 42. I'm an infrastructure engineer.
  • In this age of fastgrowing, advancement in the cloud, there is very serious need for us to understand how logs and metrics works. Observability or log metric study gives an opportunity to transition between vendors without disrupting the real workflow of the application or the service.
  • Running another use case is enhancing data quality and improving insights. This is about consolidating agents and eliminating agent fatigue. Observability is about insights looking given an outlook of your application.
  • Logs has it in itself to be able to provide the communication between services in a server. The last line of its crash could tell you exactly what is wrong with that service or the server. Viewing metrics in SRE, each metric should serve a purpose.
  • vector is a product that makes it easy for you to ship logs to Prometheus. Prometheus does not accept logs, it accepts metrics. Before vector does the transformation to metrics, you need to find a way to pass the logs. Explore the Prometheus script to view on the dashboard.
  • A well defined alerting strategy can help you achieve effective performance monitoring. Setting up a threshold when you are sending an alert is very important. Set actionable alerts that require actions. A good monitoring system pays dividends.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello. I'm here today to talk about exposing log metrics to Prometheus with the best practice possible. I mean industry standard. I mean, there are a couple of ways to play around moving log metrics to Prometheus, but we're going to highlight a couple of best practices today and I really do hope you enjoy the conversation. By the way, welcome to conf 42. This is about me. I'm an infrastructure engineer. I work with o one labs. I work with the velocity team. So I am involved in ensuring that code developers code get to production as fast as possible. The name of our product is Mina protocol. That's my Twitter handle and that is my email, so you can reach out to me anytime. Well, let's get to what we have to do today. This is an introduction. In this age of fastgrowing, advancement in the cloud and debugging of applications and managing of services in the cloud, there is very serious need for us to understand how logs and metrics works. The software engineering field is not built one way. It's not built for you to just garbage in and garbage out. Some things have to happen in the process for it for you to get to where you need to get to. I'll give you an example. Assuming you need to build a house, there'll be need for you to do some testing, there'll be need for you to do load testing, there will be need for you to ensure that the size of concrete for a certain place is required for that certain spot. So it's the same thing in software engineering. It's not a straightforward thing. At some point you got to understand why some setting things fails and how you could correct those issues. And that's where log metrics come into play. I mean, it's a huge part of observability and it's a huge part of the success of any running application. So we're going to talk about different things, about understanding your system, creating postmortem analysis from what you studied in the logs and metrics, and yeah, several other functions. There are tons of ways to ship log and metrics of Prometheus. But today our case study is vector. Vector Dev is a very interesting product and by the time I'm done breaking down certain things, you would see the importance of using vector. Wherefore what this picture is explaining is the server, the pipeline, and then Prometheus. So we have the server, which is the blue stuff, and then you have the pipeline where actually the application runs to and then Prometheus. So in this case, we're talking about the logs and metrics in the pipeline before it gets to Prometheus. Some of the real life edge cases that are concerned with what we're talking about today is one reduction of total observability cost. Well, you could say the advantage, but I like to see it as a use case. And when it comes to observability, cost revamping or follow up, then I think what we are discussing today is very important. Secondly, if you want to improve observability, performance totally, and reliability, reliability in the sense that if you have good view of how your application runs the logs, the communication between the services, then at some point you'll be able to tell that this is where your infrastructure is and this is where you want to get it to. So a study of the logs at this point kind of gives you the opportunity to take it to where you want to get it to. So yeah, this is another use case, real life use case of our conversation today. Another thing is transitioning vendors without disrupting workflows. Workflows in this sense is in a system. I mean we've got different things that are involved in running our system and ensuring it gets where it's supposed to get to. But observability or log metric study kind of give an opportunity to transition between vendors without disrupting the real workflow of the application or the service. So you could change, say services that receive the logs, you could change different things, you could change different options on the system just because of the logs. You've seen the logs, you've seen exactly what is happening. You can exactly tell, okay, this application does not seem to do what we want. Can we change it without disrupting the real key workflow or the key function of the system as service? Running another use case is enhancing data quality and improving insights. Observability is about insights looking given an outlook of your application and if you have insights, you'll be able to tell where your application is going, where it is right now, and where it used to be before. So with this you can call it document, excel, spreadsheet, postmortem analysis and all that. This is about consolidating agents and eliminating agent fatigue. Agent in this case is not exactly a human being. Of course I speak everyone to understand that, but agent here just means something that represents a certain service. I'll give you an example, say ansible or you got to install agent for ansible on services or on servers, rather to be able to ensure that ansible can communicate with server, you understand. So in this case we're going to be talking about how to eliminate agent fatigue so you don't have to stress out any agent of some sort in the service or in the service. In this case, you'll be able to look at different options as far as observability is concerned. But this is very important from the log and metrics that we're going to get from the application. Okay. So coming down home, we're looking at, first of all, let's understand the relevance of logs in SRE. I mean, you understand what SRE means. SRE is just about maintaining production status in some companies, they're called production engineers. So SRE is just about maintaining production status of any application on the cloud. So we are talking about, let's understand, first understand the relevance of logs in SRE. Log data contains stories now, but log data contains information such as memory exception had these errors. This is very helpful to help us identify why behind a problem either that a user has brought to our attention or that we have uncovered. We here in this case is the engineering team. But what this is saying is the logs have an opportunity or have. What word will I use? Logs has it in itself to be able to provide the communication between services in a server. It also could tell you if there are memory logs, if there are hattix errors, if we have memory exceptions, if we need to increase the memory, if we need to read how setting calls C URL are done in the network, if we need to look at the time it failed. And in some cases, you may run a certain server and it crashes at some point. And when it crashes, you can't even get into the server to check what is wrong. So you could start up the server again. Ensure that you get into the server when it's running and then you look at the logs until it crashes. The last line of its crash could tell you exactly what is wrong with that service or the server. It could be the image running, it could be the server configuration itself, it could be memories, it could be whatever. But that last line kind of gives you an insight of what is happening with the server and why it's crashing. So that is the core relevance of logs in SRE. Brian Redmond said this. It was Brian Redmond that said, but one, that being an expert is more than understanding how a system is supposed to work. Expertise is gained by investigating why a system doesn't work. Everything around what brand just said is tied to logs, to understanding logs data or log data, rather. So, yeah. Viewing metrics in SRE, each exposing metric should serve a purpose. Resist the temptation of exporting a handful of metrics just because they are easy to generate. Exactly. This is just saying, when you're dealing with metrics, you don't just ship everything in the application or in the server, you don't do that. You have to ship what is relevant, you have to check what does the team need. Instead of giving us updates on how the servers are operating every time, give us updates on when it's down. Probably why it is down. So this conversation could elapse into alighting and all that. But yeah, just understand the point that when you're viewing metrics, it has to be exactly what needs to be viewed. What is going to help you get better hand of the infrastructure. Metrics is just a part of observability and that's why you have the picture there. You have traces, you have logs. So all these combine together to make observability a very successful run. I said that we're going to talk about vector and here it is. When you have vector, what does it mean? I mean I should have called this vector the dev, but the general name is vector. When you hear vector, what does it mean utilizing vector to expose metrics? From the top of my mind, vector is just a product that makes it easy for you to ship logs to Prometheus. But it catches this. Prometheus does not accept logs, it accepts metrics. So vector kind of provides that pipeline to convert logs to logmetrics and then take them to Prometheus easily with just a couple of exposure in the servers and all that. So that is a summary of what vector does. You could check about them. They have very interesting documentation. I've used them a couple of times in the past as well. So some of the best practices we are going to be dissecting is these are steps in the implementations, but I'm just going to point out a couple of tiny things that are going to help you when you're initializing vector to expose metrics. One of it is you need to set up web server configuration in vector toml t o ML. You can get this in documentation, it's pretty easy. A couple of five, six lines should get this running for you. The next thing is passing logs before transform to metrics. So before vector does the transformation to metrics, you need to find a way to pass the logs. You need to find a way to pass the logs. Remember, Prometheus does not accept logs but metrics. So if the logs are not passed properly, vector won't be able to transform. So you need to consider this as well because sometimes I have seen scenarios where people just want to get the logs, metrics and all that. It needs to be passed first parse, and then the next thing we're going to talk about is effectively counting log, component and strings. It is this response that we can imagine from EGOC will be collecting for any observability process. Now the logs that are going to be converted to metrics is going to contain so much information, but we need to understand what we want to see. Do we want to see request status? Do we want to see the service status? Is a 200? Is it file four? Is it 308? What exactly in the log do we want to see? So this is about effectively counting log, component and strings. Vector makes this very easy. You could say you could get a counter that could count setting components. How many times did the service fail? How many times did it count? 200? Is it an unlimited 200 and all those kind of responses? So this is kind of to skew whatever I sent to Prometheus. I give you solid information on what you need. You don't have to get everything all if you just do it, basically everything will run, everything will go in, but I think it will be difficult to decipher through the system or decipher through the lot exactly what you need. So this is about effectively collecting logs and it will give you proper visibility on exactly what happens. Explore the Prometheus exporter. Of course, this is general, whether vector or non vector, Prometheus has an exporter that you've got to use. Now you can bring in Prometheus into the equation after the URL has been exposed by vector for scripting, we can use Prometheus exporter sync feature provided by vector. That's incomplete, but okay. Vector has an exporter sync that you could use to work with Prometheus and all that. Explore the Prometheus script to view on the dashboard. I don't have to talk a lot about this. I mean, this is Prometheus scripting. It's just about scripting the metrics. That has been convenient. And then, yeah, go ahead. So the next thing I'm going to talk about is, which is the last thing now is that you need to set actionable alerts. A well defined alerting strategy can help you achieve effective performance monitoring. Now, the thing is, I have slightly talked about this before, but I'm going to talk about it more right now. Setting up a threshold when you are sending an alert is very important. I mean, you're just not going to send things to Prometheus, right? For Prometheus, I got to be able to be alerted, maybe from slack with some sort of webhook or on discord or on WhatsApp or my email, whatever the case may be, I got to be alerted of what is happening on Prometheus. Do you understand? But you can't alert him on every behavior of the system because anything could happen. And then probably we have a restart procedure or a replica set of his kubernetes we're talking about, and then the node or the port cloud restart itself. You'll be bugging me. If every time it restarts or every time it needs to create more the scaling procedures, I get an alert. It will be so much, it will be overwhelming. Google called it alert fatigue. So what we try to encourage, or what I'm encouraging is that you set actionable alert. Alerts that requires actions. So if we have a down failure, that's something to be aware of. If we have a crash loopback error, that's something to be aware of. Things like this that could not easily be handled automatically from the system. Let the people, let the engineering team be aware. So set actionable alerts. Alerts that will require your attention, not alert, that is going to tell you, oh, this is happening. So, yeah, you should ensure that notifications are properly configured to reach the appropriate team in a timely manner. In some teams you can have on call engineers that could helps in getting this running at the time they are on call. I think that's the last thing. There are lots of things to say, but I don't want to bog you with so much information. Just wanted to get the five concise ways that you could get this running and be at the top of your game when it comes to shipping logs or converting logs metrics using vector dev to prometheus. Now, in conclusion, the good monitoring system pays dividend. It's well worth the investment to pay to put substantial thought into what solution best meets your needs and to iterate until you get it right. The success of a good monitoring system, the success of observability is tied to how good the team could sort of logs. I have been in teams where we don't have so much of the best engineers, but they are so good at debugging and personal motor analysis and log and metrics and all this kind of gives them a hand and you think, oh, they are senior engineers because they could read logs and interpret what happens to a system, that kind of a thing. So we must understand the best practices when it cases to monitoring the system. Just like this conclusion says, a good monitoring system pays dividends. I think I've come to the end of my talk and, oh, yeah, gratitude to my core researcher, Edima Mark. My company overlaps. We have an opportunity to do these things in real life, and then, of course, the comfort to organizers. I really do appreciate the time, and I really hope that I've been able to teach someone a thing or two. Of course, if you have any questions, you could reach out to me and, yeah, let us know what we can do. Of course, you can always reach out to me, like I said here. So you can send me, tweet at me, send me a message, or send me an email. I'm always available. I use WhatsApp as well, but I just thought, why should I add my contact here? But anyways, it was really good speaking to everyone and I really do hope that we have a more interesting time. I'm listening to other speakers in this comfort zone. Thank you. Thank you. Thank you. Thank you.
...

Samuel Arogbonlo

Senior Infrastructure Engineer @ O(1) Labs

Samuel Arogbonlo's LinkedIn account Samuel Arogbonlo's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways