Conf42 Cloud Native 2024 - Online

Practical AI with Machine Learning for Observability in Netdata

Video size:


At SreCON19, Todd Underwood from Google gave a presentation with the title “All of Our ML Ideas Are Bad (and We Should Feel Bad)”. Let’s see a few ML ideas implemented in the open-source Netdata that may not actually be that bad.


  • The data is a monitoring tool that was born out of a need, out of frustration. In order to have a meaningful monitoring solution, you have to learn query languages. Netdata has 800 integrations to collect data from and actually auto discovers everything. Today we'll talk about machine learning in observability.
  • A similar methodology happens for logs. For logs, we rely on systemdjournal. Systemdate Journal itself has the ability to create multiple centralization points across the infrastructure. From this point you can have fully automated multi node dashboards and fully automated alerts for the whole infrastructure.
  • Google asked engineers what they expect from machine learning to do for them. None of their ideas worked. Machine learning is the simplest solution for learning the behavior of metrics. But noisy anomaly rates are triggering for a short period of time across multiple metrics. How can we use this to help troubleshoot problems more efficiently?
  • One query does everything, both samples and anomalies. Needle stands for nodes, instances, dimensions, and labels. You can change the aggregation across time. The same information is available as a tooltip.
  • Netdata also has a scoring engine. A scoring engine allows Netdata to traverse the entire list of metrics on a server and score them based on anomaly rate or similarity. This allows you to see the spread of an anomaly across your systems.


This transcript was autogenerated. To make changes, submit a PR.
Welcome. Today we are going to talk about machine learning in observability and what we did in the data. The data is a monitoring tool that was born out of a need, out of frustration. I got frustrated by the monitoring solutions that existed a few years ago, and I said, okay, what is wrong? Why monitoring have these limitations? Why it's so much of a problem to have high resolution monitoring, high fidelity monitoring across the board. And I try to redesign, rethink, let's say, how monitoring systems work. So traditionally, a monitoring system looks like this. So you have some applications or systems exposing metrics and logs. You push these metrics and logs to some database servers. So a time series database like Prometheus, or a monitoring provider like datadoc, dynatrace, new Relic, et cetera, or even for logs, elastic or lockey, or even their splunk commercial providers, et cetera. So you push all your metrics and logs to these databases, and then you use the tools that these databases have available in order to create dashboards, alerts to explore the logs, the metrics, et cetera. This has some issues. The biggest issues is the biggest issues come from the fact that as you push metrics to them, as you push data, not just metrics, but also logs to them, they become a lot more expensive. So you have to be very careful, you have to carefully select which metrics you centralize, how you collect and visualize how frequently you gave, to be very careful about the log streams that you push to them, the fields that you index, et cetera. And then, of course, in order to have a meaningful monitoring solution, you have to learn query languages. You have to build dashboards, metric by metric and alerts, metric by metric, et cetera. So the whole process, the first is that it requires skills. So you have to know what you are doing. You gave to have experience in doing that thing, knowing what you need to collect, how frequently you have to collect it, knowing what each metric means. Because at the end of the day, you need to provide dashboards, you need to provide alerts. So you need to have an understanding of the metrics and the logs, et cetera, that you have available. It has a lot of moving parts, so a lot of, I don't know, integrations or stuff to install and maintain database servers, visualizers and the likes. And what happens for most of the companies is that the skills of the engineers they have reflect actually the quality of the monitoring they will get. So if you have good engineers that have a deep understanding of what they are doing, they have experience in what they do. You are going to have a good monitoring. But if your engineers are not that experienced or they don't have much time to spend on monitoring, then most likely your monitoring will be childish, will be primitive, it will not be able to help you when you need it. And of course the whole process on how to work with monitoring follows a development lifecycle. So you have to design things, put the requirement, design that thing, test it, develop it, test it and then consider its production quality. What I tried in the data is zap, everything. So I said, oh come on, monitoring cannot work like this. So can we find another way to do it? So I created this application. It's an open source application that we install everywhere. This application, Netdata, has all the moving parts inside it. So it has 800 integrations to collect data from and actually auto discovers everything. So you don't need to go and configure everything. And it tries to collect as many metrics as possible. The second is that it collects everything per second. Of course there are metrics that are not per second, but this is because the data source does not expose this kind of granularity. So the data source updates the metrics that it exposes every five or every ten. Otherwise the data will get everything per second and it will get as many metrics as it can. It has a database, a time series database in it. So there is no separate application. It's just a library inside netdata that stores metrics in files. So it has a health engine that checks metrics against common issues that we know that exist for your applications and your systems. And it learns the behavior of metrics. And this is what we're going to discuss about machine learning and how it learns. And of course it provides all the APIs to query these metrics and these logs and visualize them and also has all the methodologies. So it's a nice citizen in the observability ecosystem. It has integrations to push metrics, even to Prometheus, even to other monitoring solutions, and to also stream metrics between the data servers. So it allows you to create centralization points within your infrastructure. So if we look how this works behind the scenes, it may seem complicated. It's not that much. So you have a local. Net data running on a Linux system, for example. It will discover all metrics using all the plugins that it has. It will collect data from these sources. This is with zero configuration. This is just the default behavior. This is what it does. It will automatically detect anomalies and store everything in its own time series database. Once the data are stored in the time series database, it provides all these features, so it learns from the metrics in order to detect, to feed the trained models into the anomaly detection. It checks the metrics for common issues, congestions, errors and the likes. It can score the metrics, so it can use different algorithms to find the needle in the haystack when you need it. It can query all the metrics and provide dashboards out of the box. And the data actually visualizes everything by itself. So every metric that is collected is also visualized, correlated and visualized in a meaningful way. It can export the metrics to third party time series databases and the likes. And it can also stream metrics to other netdata servers. So it can come here from one end data to another. So you can create metrics, centralization points on demand. You don't need to centralize everything on one node or across your infrastructure. You can have as many centralization points as required across your infrastructure. This provides both efficiency, cost efficiency, mainly because there are no egress costs, for example. But also it allows you to use netdata in cases where you have ephemeral service, for example. So you have a kubernetes cluster that nodes come up and go down all the time. Where are my data? If the data are in these servers and the data is offline, then where are my data? So you can have a parent a centralization point where your data are aggregated, that are permanently available even if the data collection server is not available. A similar methodology happens for logs. For logs, we rely on systemdjournal. Systemdjournal is an application that we all use. So even if we don't know, systemdjournal is there inside our systems. And system dig journal is amazing for logs. Why? Because it's the opposite of what all the other log solutions do. So for all log data, log database servers, the cardinality. So the number of fields that are there and the number of values that the fields have is important. And the more you have, the slower it gets, the more opensource are required, more memory, et cetera, et cetera. But for the system Dig journal, the cardinality of the logs is totally relevant. So the system dig journal is designed to actually index all fields and all values, even if all log lines, each log line has a different set of fields and a different set of values. So it doesn't care about the cardinality at all. It has been designed first to be secure. It has ceiling and tampering and a lot of features to improve security. And at the same time, it is designed to scale independently of the number of fields. The only, of course, drawback if you push huge cardinality to system digital is the disk footprint, but not CPU, not memory. So it's there, it's inside our system. So what we do is that we provide for the first is the data, can use journal files, can query journal files without storing, without moving the logs to another database server. So we don't have a logs database server, we rely on systemd journal. The first is this, and the second is that if you have text files and you want to push them to systemdjournal, we provide log to journal, a tool, a command line tool that you can configure to actually extract structured information from text log files and push them to systemdate journal. Systemdate Journal itself has the ability to create multiple centralization points across the infrastructure, much like Netdata. So while Netdata can do this for streaming, with streaming, to push metrics from one data agent to another, systemd journal has the same functionality. It provides system digital upload that pushes metrics to another journal d. And it provides also system digital remote that ingests this metric and stores them locally. If you put net data in a parent, let's say in a log centralization point, Netdata will automatically pick all your logs. So the idea with this setup is that you install netdata everywhere on all your servers. If you want to centralize, if you have ephemeral nodes, et cetera, you have the methodology to create centralization points, but not one. You can configure as many centralization points as is optimal for your setup in terms of cost or complexity or none if you don't require any. And then the whole infrastructure becomes one. How it becomes one. So it becomes one with the help of our SaaS offering that we have. Of course, it has a free tier too. So you install netdata everywhere and then all these are independent servers. But then Netdata cloud can provide dashboards and alerts for all of them for metrics and logs. And if you don't want to use the SaaS offering, you can do the same with any data parent. So the same software that is installed in your servers, you can build a centralization point. You can centralize here metrics and logs, and this thing will of course do all the mail stuff and whatever else needed. And from this point you can have fully automated multi node dashboards and fully automated alerts for the whole infrastructure. If your setup is more complex, you can do it like this. So in this setup there are different data centers or cloud providers. So this is a hybrid setup in this case or a multi cloud setup. Again, you have if you want multiple parents all over the place, and then you can use medada cloud to integrate the totally independent parents. If you don't want to use no data cloud, again you can use a data grandparent. But this time, this thing, the grandparent needs to centralize everything. Now what this setup provides is the following. The first thing is that we manage to decouple completely cardinality and granularity from the economics of monitoring of observability. So you can have as many metrics. The data is about having all the metrics available. If a metric is available, if there is a data opensource that expose a metric, this is the standard policies that we have. Grab it, bring it in, store it, analyze it, learn about it, attach alerts to it, et cetera. So all metrics in full resolution, everything is per second for all applications, for all components, for all systems, and even the visualization is per second. So the data collection to visualization latency. While in most monitoring solutions it's a problem in a data, you hit enter on a terminal to make a change and boom, it's immediately on the dashboard. It's less than a second. Data collection to visualization the time required from data collection to visualization. The second is that all metrics are visualized. So you don't need to do anything, you don't need to visualize metrics yourself. Everything is visualized, everything is correlated. So the moment we create plugins for Netdata, we attach to them all the metadata required in order for the fully automated dashboard and visualization to work out of the box for you. The next is that we're going to see this in a while. Our visualization is quite powerful. So you don't need to learn a query language. You can slice and dice the data, any data set actually on a data charts with just point and click. And actually in the data is the only tool that is totally transparent on where data are coming from, if there are missed samples somewhere. So all these work out of the box for you, including alerts. So in a data, when we build alerts, we create alert templates. We say for example, attach these alerts to all network interfaces, attach these alerts to all disk devices, to all mount points, attach these alerts to all NgINX servers or all postgres servers. So we create templates of alerts that are automatically attached to your instances, your data. And of course we don't use fixed thresholds. Everything is about rolling Windows and statistical analysis and the likes in order to figure out, ah, we should trigger an alert or not. Some people may think, okay, since this is an application that we should install everywhere on all our servers, and it has a database in it, it has a machine learning in it, then it must be heavy. No, it's not. Actually, it's lighter than everything. Here we have a comparison with Prometheus as a centralization point. And you can see that we tested it with 2.7 million metrics, samples per second, 2.7 time series, everything collected per second, 40,000 containers, 500 servers. And you see that Netdata used one third less CPU compared to Prometheus, half the memory, 10% less bandwidth, almost no disk I o compared to Prometheus. This means that my data, when it writes data to disk, it writes them at the right place. So it compresses everything. It writes in batches, small increments, and puts everything in the right place in one go. And at the same time, it has an amazing storage footprint. So the average sample for the high resolution tier, for the per second metrics is 0.6 bytes per sample. So every value that we collect, it just needs 0.6 bytes less than a byte, a little bit above half a byte on disk. Of course, this depends on compression, et cetera. And what I didn't tell you is that we have on our blog, we have a comparison with all the other agents that Datatrace has, or Datadog has, or Eurelli has, et cetera, a comparison on resources. What resources need data required from you when it is installed, and the data is among the lightest. So it's written in C, and the core is highly optimized with ML, with machine learning enabled. It's the lightest across all the agents. So this makes net data, let's say that you can now build with net data a distributed pipeline of all your metrics and logs without all the problems that you gave from centralizing metrics and logs. So you can have infinite scalability and at the same virtually infinite scalability. Don't be arrogant and at the same time have high fidelity monitoring and out of the box. So you don't need to know what you can install Netdata mid crisis, so you have a problem, you don't have a monitoring in place. Install net data, it will tell you what is wrong. So let's move on to AI in 2019. Google Todd Underwood from Google gave this speech about so what Google did is that they gathered several engineers, SREs, DevOps and the likes, and they asked them what they expect from machine learning to do for them. And it turned out that none of their ideas worked. Why it didn't work because when people hear machine learning, the expectations they have are a little bit different of what actually machine learning can do. So let's see, let's understand first how machine learning works. Machine learning. In machine learning, you train model based on sample data. So you give some samples to it, some old data to it, and it trains a model. Now, the idea is that if you give new data to it, it should detect if the new data are aligned with the patterns you saw in the past or if they are outliers. If they are outliers, then you have to train it more in order to learn the new patterns and repeat the process until you have the right model. What most people believe is that machine learning models can be served. So assume that you have a database server. You can train a machine learning on one database server and apply the trained machine learning model to another to detect outliers and anomalies. The truth is that it is not. So let's assume that we have two database servers, A and B. They run on identical hardware, they have the same operating system, they run the same application, a database server, postgres, same version. They have exactly the same data, so they are identical, both of them. Can we train a model on A and apply this model on B? Will it be reliable? Most likely not. Why not? Because the trained model has incorporated into it the workload. So if the workload on B is slightly different, it runs some statistical queries, some reports that a doesn't, or if the load balancer, that is, if there is a load balancer or the clustering software, it spreads the stuff a little bit, not completely equally among the two. Then the behavior that the machine learning model that was trained on a will not work on B. It will give false positive. So what can we do? If this is the case, what can we do? Let's understand the following. The first is that machine learning is the simplest solution for learning the behavior of metrics. So it can, given enough data, enough samples, it can learn the behavior of the metrics. So if you grab a new sample and you give it to it, it can tell you if this sample that you just collected is an outlier. Sorry, it's an outlier or not. This is what anomaly detection is. So you train the model, you collect a new sample, you check against the model you have true or false, if it is an anomaly or not. Now, the whole point of this is how reliable it is, if it is accurate, so it's not accurate. So if you have one machine learning model and you actually train one machine learning model and you actually give samples to it, just collect samples to it, it has some noise. So by itself, machine learning should not be something, an anomaly should not be something to wake up at 03:00 a.m. Because it will happen. Of course, you can reduce the noise by learning multiple machine learning models. So you train multiple models, and then when you collect a sample, you check against all of them. If all of them agree that this sample is an anomaly, then you can say, okay, this is an anomaly. Still, there are false positives. Still, you should not wake up at 03:00 a.m. But look what happens. What we realized is that if this inaccurate anomaly rates, these noisy anomaly rates, are triggering for a short period of time across multiple metrics. So it's not random anymore. It's multiple metrics that for a period of time, they all trigger together. They all say, I am anomalous together, then we know for sure that something anomalous is happening at a larger scale. It's not just a metric now. It's the system or the service level or the application level that triggers a lot anomalies across many metrics. So how this can help us? So we use, of course, this is what we did in the data. So we train multiple machine learning models, and we try to detect anomalies to make anomalies useful, not just because one metric had one anomaly at some point, this is nothing, but because a lot of metrics at the same time are anomalous, and we try to use this to help people troubleshoot problems more efficiently. So how we use it in the data, the first is, you understand, since the data is installed, the moment you run it, it comes up with hundreds of charts. You probably will see for the first time. So how do you know what is important? This is the first question that we try to answer. So you just see in front of this amazing dashboard, a lot of charts, hundreds of thousands of metrics, and hundreds of charts. And what is important, machine learning can help us with this, and we will see how. The second is, you face a problem. I know that the spike or dive or for this time frame, I know that there is a problem. Can you tell me what is there in most monitoring solutions to troubleshoot this issue? You go through speculation. So if you use, for example, Mithoso Grafana, you say okay. And you see for example a spike or a dive in your web server responses, or increased latency in your web server responses, a spike. There you start speculating. What if is the database server? Oh no, what if is the storage? And you start speculating, making assumptions and then trying to validate or drop these assumptions. What we tried with net data is to flip it completely. So you highlight the area, the time you are interested, and the data gives you an ordered list, a sorted list of what was most anomalous during that time, hoping that your aha moment is within the first 2030 entries. So the idea is that instead of speculating what could be wrong, in order to figure it out and solve it, what we do is we go to netdata and a data gives us a list of what is most anomalous during that time and our aha moment. The disk did that, the storage did that, or the database did that is in front of our eyes. And how we can find correlations between components. What happens when this thing runs? What happens when a user logs in? What happens when what is affected? Because if you have a steady workload and then suddenly you do something, a lot of metrics will become anomalous. And this allows you to see the dependencies between the metrics immediately. So you will see, for example, that the moment a cron job runs, a lot of things are affected. A lot of totally, seemingly independent, independent metrics get affected. So let's see them in action. Netdata trains 18 machine learning models for each metric. So if on a default Netdata you may have 3000 4000 metrics on a server, for each of these 3000 4000 metrics, it will train 18 machine learning models over time. Now these machine learning models generate anomalies, but the anomaly information is stored together with the samples. So every sample on disk has, okay, this is the value I collected. It was anomalous or not anomalous? A bit. It was anomalous or not anomalous. Then it calculates. The query engine can calculate the anomaly rate as a percentage. So when you view a time frame for a metric, it can tell you the anomaly rate of that metric during that time. It's a percentage, and it is the number of samples that were anomalous versus the total number of metrics. And it can also calculate host level anomaly score. So the host level anomaly score is when all the metrics get aligned as anomalies together that we were discussing before. Now the data query engine calculates the anomaly rates in one go. So this is another thing that we did. So with the moment you query charts for the samples, the chart that you want, et cetera, all the anomaly information, whatever anomaly information is there is visualized together. It's in the same output. One query does everything, both samples and anomalies. Now let's see, chart. This is a data chart. It looks like any chart, I think from any monitoring system, but there are a few differences. And let's see the differences. The first thing is that there is an anomaly rebound. Now this rebound shows the anomalies. How many samples were anomalous across time. So I don't know. This is for some time here. And you can see that at this moment there were anomalies. At this moment and at this moment there were anomalies. Now the Srecon thing is that we created this needle framework. Needle stands for nodes, instances, dimensions, and labels. Now look what this do. The moment you click nodes. So it clicked here. Nodes, you get this view. This view tells you the nodes the data are coming from. This is about transparency. So if you chart in the data and you immediately know which nodes contribute data to it, and as you will see, it's not just nodes, it's a lot more information. So the nodes that are contributing data to it, you can see here how many instances. So this is about applications, this is about 20. It comes from 18 nodes for a total of, you see number of metrics per node. You can see the volume. So this chart has a volume, has some volume in total. What's the contribution of each metric? Of each node? Sorry, to the total. So if you remove Bangalore from the chart, you are going to lose about 16% of the volume. And here is the anomaly rate. So widths of the nodes, of the metrics of the nodes, the anomaly rate of the metrics of the node. Of course we have mean average, maximum, et cetera for all metrics involved. And if you move on, the same happened for instances. So here that we have applications, you can see here that each application has two metrics and you can immediately see, okay, the SSH is anomalous on this server. And of course the same happens even for labels. So not only for label keys, but also for label values. So you can see again the volume, you can see the anomaly rate, minimum, average, maximum values for everything. Now the same information is available as a tooltip. So you can see this on the tooltip. You hover on a point on the chart, and together with the values that you normally see, you have the anomaly rate of that point, anomaly rate. And for each of the time series, of course, of the dimensions of the chart. Now, if we go back to the original chart that I show you, you have more control here. So you can change the aggregation across time. So if you zoom out the chart, it has to aggregate across time because your screen has 500 points. But behind the scenes in the database, if this per second, there are thousands and thousands of metrics. So you can change the aggregation across time. You can say here, minimum, maximum, so you can see, reveal the spikes or the dives, and you can change the group by and the aggregation. So you can pivot the chart. You can change the chart. It's like a cube. You see it from different angles. So let's continue. Netdata also has a scoring engine. A scoring engine allows Netdata to traverse the entire list of metrics on a server and score them based on anomaly rate or similarity. We have many algorithms there. Now we also have a metric correlation algorithm that tries to find similarity in changes. So you highlight a spike and you say, correlate this with anything else and it will find the dive. Because of the rate, the change is similar. The rate of change is similar. Now, how we use this, the first thing is that this energeta dashboard, it has one chart below the other and a menu where all the charts are segmented, as you can see. I don't see the number, but I think it's 500 charts or something like that there. So out of these charts, you press a button. These are all in sections. You press a button and the data will score them according to their anomaly rate to tell you in which sections. Which sections are anomalous and how much this allows you if you have a problem, for example, you just go to the Netdata dashboard that reflects the current, says, the last five minutes or the last 15 minutes. You press that button and you will immediately see which metrics across the entire dashboard are anomalous so that you can check what's happening, what's wrong. The next is the host anomaly rate. Now, for the host anomaly rate, what we do is that we calculate the percentage of metrics on a server that are anomalous concurrently. Concurrently. What we realize then is the following, that anomalies happen in clusters. So look at this, for example. These are servers. Every line is a different server, but you see that the anomalies happen close together. This is up to 10%. So 10% of the metrics of all the metrics collected on a server were anomalous at the same time. And as you see, for each server, it happened with a little delta here, it happened concurrently, so one server spiked to 10%, but a lot other servers spiked to 5%. Now look what happens when you view this dashboard. What you can do is highlight an area. So here we have highlighted from here to there and the data, what it will do. It score the metrics. So it will traverse the entire all the metrics one by one. Score them for that little time frame, calculate the anomaly rate, and then provide a sorted list of what changed over. What is more important, what is more anomalous for that time frame. The whole point of this is to provide the AHA moment within the list, their top 2030 items. So instead of speculating what could be wrong to have this issue there, Netdata tries to figure out this for you and gives you a list of the most anomalous things for that time frame so that your aha moment is there within that list. Now the highlights is that the data in ML is totally unsupervised, so you don't need to train it. It is trained for every metric, multiple models for every metric, and it learns the behavior of metrics for the last few days. So you just let the data run, you don't need to tell it what is good or what is bad, and the data will start automatically detect anomalies based on the behavior of metrics of the last two or three days. It is important to note that this is totally unsupervised. You don't need to do anything. Of course if an anomaly happens it will trigger it, but then after a while it will learn about it, so it will not trigger it again. But if it happens for the first time in the last few days, two days it will detect it and reveal it for you. The second is that the data can immediately, within minutes. So you install it and after ten or 15 minutes it will start triggering anomalies. So it doesn't need to train all 18 models to detect anomalies. But as time passes it becomes better and better and better, so it eliminates noise. So even one model is enough to trigger anomalies. The second is it happens for all metrics. So every single metric, from database servers, web servers, disks, network interfaces, system metrics, every single metric gets this anomaly detection. The anomaly information is stored in the database. So you can query the anomaly of yesterday. Not based on today's models, on yesterday's models. So as the models were at the time the anomaly was triggered. There is a scoring engine that allows you to score metrics across the board. So you are looking for what is anomalous. Now, what is most anomalous now, what is most anomalous for that time frame? Or I want to find something that is similar to this. So all these queries are available with Netdata, and it has the host level anomaly score that allows you to see the strength and the spread of an anomaly across your systems, across each system, inside each system, but also across systems. So what we are next to do this solidity there, this works, you can try in the data, it's open source software. And actually it's amazing because you don't have to do anything, just install it, it will work for you. We are adding machine learning profiles, so we see users that are using machine learning in the data for different purposes. So some people want it for security, some people want it for troubleshooting, some people want it for learning special applications, training special applications, et cetera. So we are trying to make these, to create profiles that users can create, can create different settings for machine learning according to their needs. Of course, there are many settings now available, but they are applied to all metrics. Everything is the same. The second is that we want to segment this across time. So instead of learning the last two days and then detecting anomalies based on the total of the last two days, to learn Mondays, to learn Tuesdays, so detect anomalies based on Monday's models or Monday morning models. So this profiling will allow better to have a better control on many industries that the days are not exactly similar. So they have some big spikes on Wednesdays and the systems are totally idle on Tuesdays. That's it. So thank you very much. I hope you enjoyed it.

Costa Tsaousis

Founder & CEO @ Netdata

Costa Tsaousis's LinkedIn account Costa Tsaousis's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways