Observability Standardization: the elephant is still in the room!

Video size:

Abstract

Observability standardization: Everyone is talking about open standards and protocol compatibility, but the elephant is still in the room. Are we really that different to require a custom-made monitoring solution to monitor our infrastructure?

Summary

A typical observability setup looks like this. We have time series databases or logs databases and dashboards. Some complexity exists also for alerting. What you need is a deep understanding of the metrics. You need to have a lot of best practices.
Most of our monitoring tools, most of the monitoring tools that exist today force us to think like this. This lack of standardization is actually the major shift in our focus. Instead of improving our infrastructure, we spent most of our time in a sea of challenges around the monitoring itself.
Netdata is an open source tool that provides opinionated observability. The idea of net data was to solve all these problems that we saw so far. Data says that we all have a lot in common. Even for custom applications today, this becomes increasingly true.
Netdata aims to kill the console for monitoring. High fidelity monitoring means everything collected every second. All the metrics visualized directly on the dashboard. Unsupervised anomaly detection. Can be installed mid crisis.
Data is a lot faster compared for example to Prometheus and requires a lot less resources. This is because we don't need a wall Sony that doesn't write all the time. We rely on streaming and replication for high availability. NetData also is one of the most power energy efficient platforms.
The first goal of machine learning in a data is to understand, to learn the patterns of the metrics. All 18 machine learning models need to agree that a sample is an outlier in order to say that it is. And a data dashboard is like this. Every metric can be visualized in a fully automated way.
Anomaly advisor is a tool that we developed in order to find the needle in the haystack. It will score all the metrics and figure out what's the anomaly rate for each chart within them. This allows you to quickly spot anomalies.
Spare data on a single node requires just 5% of a single nodes utilization and about 100 megabytes of ram. We want this to be extremely thin compared to the whole thing, so that it can be affordable for everyone.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi. Today we are going to discuss but observability standardization, and we are going to see what I call the elephant that I believe still in the room. So a typical observability setup looks like this. So we have some kind of applications or systems that are exposing metrics. Then we have time series databases or logs databases where we centralize all our logs and all our metric data. And of course then we have a tool to provide dashboards. This tool usually queries this time series data or logs data and creates beautiful charts. And then we have alerting, an alerting engine that again performs the same. So it queries time series data and sends notifications if something exceeds the threshold. Now this has some issues and some complexity may sound simple, but there is a hidden complexity here. And this is what the hidden complexity is. In the time series databases, you have to check, you have to maintain first the database server itself. You have to take care of the high cardinality and the memory requirements of that thing, the disk I o requirements. And you have to think about clustering and high availability. In the logs database you have the similar thing, but also on top of it you have a query performance issue, so that in many cases you have to maintain the indexes of the log streams that are available. You have to maintain retention, and you also have the same problem of clustering availability, high availability, et cetera. Then on the dashboarding tool, you have to first learn the query languages of the other database, of the database servers so that you can query them and create useful dashboards with the data they have. These query engines are usually good in order to convert counters to rates or aggregate metrics together, correlate metrics together with some aggregation functions, pivot them, group them in a way in order to present them, et cetera. So you have to learn a query language and you also have to learn what the visualization tool can do, how, what kind of visualizations you have available, what kind of customizations are there available. So in order to have at the end a meaningful monitoring solution and something similar exists also, some kind of complexity exists also for alerting. So while building this, what you need is a deep understanding, first of the metrics that you have. So you need to know what the database has and what labels are there, how you should query them. In order to maintain this processing pipeline, you need to go through various steps of configurations in different phases. So in order, for example, to reduce the cardinality of some metrics, you may have to relabel some of the metrics. At some point you have to learn the query language that, the languages that we said, you need to understand the tools and have a very good understanding of the tools. And of course you need experience. So if you don't know, if you have never done this before, most likely you are doomed. You know, you need to have an understanding of what is a heat map, what's a histogram, how to visualize this kind of data. You need to have a lot of best practices in order for this to be actually useful for people while troubleshooting. If you are at scale, then you have additional problems. So you have to go through scalability challenges, how the database servers can be scaled, how I can have high availability there. You have to work on the data overload and the noise that may exist in all these dashboards, in all these alerts, et cetera. You have a cost management aspect that is important and you have to manage because you may need at some point to cherry pick what you want, what you really want, what is really useful for you, because otherwise the cost of the monitoring infrastructure will skyrocket. And of course you have to check also compliance and security. So you have to check what kind of information is in your logs and who has access to them and all this kind of stuff. If you're in a team, you have an additional layer, then you have to come up with some kind of monitoring policies. You have to agree on a unified monitoring framework. You have to have some kind of change management and shared responsibility on the dashboards and the alerts and the data in general that are in these database servers. You have to take care of documenting everything and letting others know how this thing works, what they should do, how this needs to be changed, et cetera. And you need to have quite some discipline in order to follow your design principles always. Otherwise it usually becomes a mess. So to understand this, I asked Chachi PT to come up with a few phrases that actually they are the challenges, let's say that these engineers that work on a monitoring setup have in their minds every day what they usually think, what are the challenges they face every day. CHPT came up with this, so I made it a word cloud for us to see. So CHPT, you rate them with a frequency, how frequently its phrase comes up, and this is what it came with. So balance, scrape, interval structure, multidimensional query test and validate queries, monitoring exporter, health time series, database I o bottlenecks, dynamic alert, threshold, regular expressions for label rematching, federation efficiency, fine tune, chart access and scales, filter metrics and exporters. You see, it's all the details that people have to go through in order to have a monitoring, to build a monitoring with such a setup. Now, I also asked satypt to tell me what people that are monitoring infrastructure are usually thinking of independently of the monitoring system. So what people should be thinking in general when they are responsible for the performance and the stability and the availability of some kind of infrastructure. And touch APD came up with this list. Now, this list is quite different. Now it says, incidents, response, application stability, data encryption, cost optimization, service availability, client satisfaction, operational efficiency. It's completely different. It's another level, actually. Now this is what I think is the gap of the current monitoring tools. Most of our monitoring tools, most of the monitoring tools that exist today force us to think like this. This is the challenges we face every day, how to do this little thing, how to create a heat map query for quantiles, how to configure external labels. These are details of the internals of the monitoring system. While the other list, this list is what we need our engineers to think. This is their primary role. This is what they need to achieve. This is what they need to focus on. Now, some may think, okay, wait a moment, we have a lot of standardization bodies that are taking care of this. Of course, we do have, and we have, for example, open telemetry, we have CNCF, we have w three c, et cetera. But you know what? If you actually check what all these standardization bodies do, is the following. Do you remember this chart, this graph that we started with? They focus here. So what they do, all our standardization effort is above data exchange is, okay, let everyone have all the data. If you need the data, okay, take them. But all the complexity that we have is outside this. Of course, this is an enabler. We need it. It's good to have data compatibility and being able to exchange data, because otherwise it's a lot more complex. But I think that the most important aspects of efficient monitoring are not there. So actually we don't have any standardization. This lack of standardization is actually the major shift in our focus. So instead of improving our infrastructure, we spent most of our time in a sea of challenges around the monitoring itself, about how to achieve something with the monitoring system. Now, I also researched what analysts say. So for analysts, they have this DevOps role. The DevOps role that is supposed to be the clue, is supposed to fix the gaps in this thing. But now what analysts believe is that the DevOps, the DevOps guy, a DevOps guy is a data scientist who understands software engineering and is an IT network architect at the same time. So he understands technology as a sysadmin, software as a developer, and data science as a data scientist, and it's a combination of these three. So what the analysts are saying is actually that there is a guy or a girl, of course, that knows what is coefficient of variation, that knows what is IPC instructions per cycle, and somehow can figure out if an application, a running capital texture production application, is memory bound or cpu bound, and at the same time can understand what is the disk utilization of an NVMe or an SSD disk, what's I ops and under which conditions such a disk, for example, can get congested. To my understanding, the whole thing is a utopia. So these people that can actually know data science, the vast area of data science, have huge experience in software engineering, and have vast experience in IT infrastructure and network architecture in some companies, bigger companies solve this problem by having a lot of roles. So they have an army of people to take care of. A few things like the monitoring, but for smaller or medium sized companies, and also for a lot of force 500 companies that don't invest in this, this is a breakpoint. So this kind of engineers that have all these magical knowledge, experience and skills, simply does not exist. Now the result for all of us, the result for most companies is something like this. So monitoring is very expensive and a constant strangy. So it's never complete, never finished, never. Okay, never enough. While being extremely expensive, it requires a lot of skills. And actually for most of the companies, the kind, the quality of the monitoring they have reflects the skills of the engineers. So if they have good engineers, they have good monitoring. If the engineers are not that experienced, do not have that many skills, then the monitoring is a toy. It's a joke. Frequently we see, even in very big companies, that monitoring is severely under engineered. So it's next to zero. They have some kind of visibility on the workload, but that's it. This is where it starts, then attends. And it's frequently illusional, and I say illusional. I was the first that to experience this. So before I built any data, I spent quite some time in monitoring, and I had quite some money and time and effort in monitoring, and I had built a great team. But at the end of the day, I had the impression that everything that I have built, all the dashboards and all the tools and everything that has been installed, is there just to make me happy, because I cannot troubleshoot anything with it. It's not useful for what I needed. So in many cases, I still see this today. With many companies that we cooperate with, that the monitoring gaps, and it leads them, the monitoring inefficiency leads them to wrong conclusions, to increase their time, to resolution, a lot of frustration, a lot of issues, et cetera. Lost money, of course. So now I'm going to talk to you about Netdata. Netdata is an open source tool that provides opinionated observability. So the idea of net data was to solve all these problems that we saw so far. The idea of netdata, let me tell you a few things. So the first thing is that it was born out of a need. So I needed the dumb thing. I had problems. I had issues with some infrastructure and I couldn't solve them. I spent quite some time there. The problems remained. So after several months of frustration and lost money and lost effort, I decided that I should do something about it. Since the monitoring tools that exist today, they are not sufficient enough for this kind of job, then we need a new monitoring tool. Initially, it was out of curiosity, so I said, okay, let's build something. Let's see if we can build something. Because I couldn't believe that all the monitoring tools have this design out of, I don't know, accident. So I was thinking that, okay, they tried different ways and this is the only way that actually worked. So it was curiosity why they didn't do monitoring real time, why they don't ingest all the metrics, why cardinality is such a big problem, why monitoring systems don't work out of the box, they don't come predefined with the dashboards and the likes that we all have, et cetera. So I started building it, and after a couple of years, I released it on GitHub. It was on GitHub from the first day, but I actually pressed the button to release it. Anyway, nothing happened. It was a funny story. Nothing happened. So one day on Reddit, I posted and boom, it skyrocketed. So the data way says that we all have a lot in common. So my infrastructure, your infrastructure, their infrastructure, we are using the same components, the same parts of a Lego. So we are using the same or similar physical and virtual hardware, similar operating systems. These are finite sets of stuff. It's not infinite. The combinations are infinite. So we can combine these Lego things, these building blocks, the way we see fit. But all the building blocks are pretty much packaged today, even for custom applications, applications that we built ourselves. In most of the cases, we use standard libraries, and these standard libraries expose their telemetry in a standard and predictable and expected way. So even for custom applications today, this becomes increasingly true, incrementally true. As time passes, even these custom applications will be completely packaged like packets. They will provide a finite and predictable and expected set of metrics that we can all consume to actually say if the application is healthy or not. Of course, there will be common metrics, sorry, custom metrics all the time. So we have a lot in common. The next thing is that I wanted high fidelity monitoring. High fidelity monitoring means everything collected every second. Like the console tools. My original idea was to kill the console for troubleshooting. What most monitoring tools provide is a helicopter view, so you can see that the road is congested, you can see the dive or the spike on that thing, but you cannot actually see what's happening. Why is that? This is a helicopter view. So this happens mainly because they want the minimum all, including commercial providers, including companies that provide monitoring solutions as a service. They do this because it's expensive to ingest all the information, so they prefer to select which information to ingest and maintain so that you can have the helicopter view that you need. But once it comes to the actual details of what is happening there, why this is happening like this, they don't have any information to help. And this is where the console tools usually come in place. So with Netdata, I wanted the console tools, I wanted to kill the console for monitoring. So every metric that you can find in console, it should be available in the monitoring system without exceptions, even if something is extremely rare or is not very common to be used. The next thing is that I wanted all the metrics visualized, so I don't, come on, it's an Nginx, it's a database postgres. I don't want to configure visualization for this packaged application myself. Why do you do that? It exposes workload and errors and this and that, and tables and index performance and whatever it is. But I want this visualization to come out of the box. The next is that I didn't want to learn a new query language, so I wanted to have all the controls required directly on the dashboard to actually slice and dice the data the way I see fit. So in the data, we have added a nice ribbon above every chart that allows you to filter the data, slow slice the data by label or by whatever node, instance, whatever it is there, but also to group them differently. So it's like a cube to see different aspects of the cube. And of course we added unsupervised anomaly detection. And this is an innovation we have among all the other monitoring solutions, mainly because our anomaly detection happens for all metrics unconditionally and it's totally unsupervised, so you don't need to provide feedback to it. This is what unsupervised means. You don't have to train it yourself. Also, it is trained at the edge, so we don't train somewhere what a good postgres means and actually give you the model to apply it to your database. You will never, it will be full of false positives. Mainly because in monitoring you have the workload that determines the actually queries that you send to a database server determine what the metrics will do. So it's impossible to learn to share models. The only viable solution is to train models for each metric individually and out of the box. Alerts for alerts what I wanted is to actually have predefined alerts that once they see a network interface, they attach to it. They see this, they attach disk alerts attached to it, they see, I don't know, hardware sensors, they attach to them, they see a postgres database, they see a web server, they attach to them automatically. So in a data today sips with hundreds. Actually we counted a few days before, it is 344 distinct alerts that are all dynamic. So there are no fixed threshold there. There are all of them rolling windows, et cetera, et cetera. So they compare different parts of the metric to understand if there is a big spike or a big dive or an attack or something wrong. And of course there are plenty of alarms that are just counting errors or things that are mathematically, even if there is one, there is an error condition of it and users need to be alerted. Now, Netdata, I wanted also netdata to be able to be installed mid crisis. So you have a crisis, you never installed Netdata before. You can install Netdata right there while the thing is happening and Netdata will be able to help you figure it out. You are not going to have the help of the anomaly detection because this thing needs to learn what is normal in order to help you. But Netdata has a lot of tools, additional tools on top of anomaly detection that will help you identify what is wrong, correlate what is happening, find similarities in metrics, et cetera. It will also allow you to explore the journal logs directly on the servers. So all this is about removing the focus of the monitoring system of what the monitoring system internals from users and putting some extra knowledge into the tool. So unlike, for example, if you take Prometheus and Grafana, when you get them and you install them, they are blank, they cannot do anything. They are just a database server. And the visualization engine, great. They are great database server and great visualization engine, but there is no use of them if you don't go through the process of configuring them, setting them up, pushing metrics to them, et cetera, in the data. The story is quite different. So data knows the metrics, it knows already when we ship it. It knows how to collect cpu metrics, memory metrics, container metrics, database metrics. It comes with pre configured alerts, so it knows how to visualize the metrics correlate and come up with meaningful visualizations. So the idea is that data is a monitoring out of the box, it's ready to be used. The internals of what is happening there and why this is like this, and how to convert a founder, CEO a rate. And all this kind of stuff is already baked into the tool for each metric individually, including the monitoring of each component individually. So for the data, when you have a disk, it's an object in the data. It has their metrics attached to it, alerts attached to it. So we monitor infrastructure bottom up, we don't go helicopter view, we go down deep, we deep dive to the highest level we can, we monitoring and set alerts at that level and we start building, see, because even Nadeda has a lot of innovations, even in deployment, for example. So Netdata is an open source software, as I said, you can use it for free. So the software you are going to get is a monitoring in a box. So the moment you install the data agent, it's not an agent, it's not the same as an exporter in Prometheus. And a data agent is like something like exporters time series databases. So you have Prometheus, you have the visualization engine, you have the alert manager, and you have also machine learning and the likes, including logs, everything combined into one application. Now this application is modular. So for each installation you do, you have the ability, of course, you can use it by itself. So you can install it on one server and use it on one server. So it has an API, it has a dashboard, you can see the dashboard there, you can explore the metrics, et cetera. But when you can build, sorry, I can show you this, you can also use them. So you have a number of servers, you install them, you install netdata on all of them, then you can use our SaaS offering to actually have a combined view of all of them. If you don't want that, you can use the same software, the Netdata agent, as a parent. So in this parent, now, this parent receives all the metrics in real time from all the servers. So all the other servers are streaming in real time to it. This server now can have all the functions of the others, so it can alert for them, anomaly for them, visualization for them, everything required. This allows you to offload the other servers. So these extra features, let's say take some cpu, take some disk space so you can offload if you want the other servers and use only the parent. Of course this is infinitely scalable. So you can have data centers, different data centers, or many places all over the world where you have infrastructure, you can have a parent there and then use the data cloud to aggregate the parents now, or you can have a grandparent. So you can have a grandparent, or a grand grand grand grand grandparent. So it's infinite, you can scale it as you see fit, as your infrastructure grows. The whole point with this now is that data is a lot faster compared for example to Prometheus and requires a lot less resources. So for Netata, we distress tested Netdata in Prometheus. We set it up 500 servers with 40,000 containers. We had about 2.7 million metrics collected every second. And we configured actually Prometheus to collect all these metrics in real time also per second. And we measured then the resources that were required on this data and Prometheus. And the result is this, 35% less cpu utilization. So one third less cpu utilization, half the memory of Prometheus, 10%, 12% less bandwidth, 98% less disk I o. This is because we don't need a wall Sony that doesn't write all the time. We rely on streaming and replication for high availability. So each of the parents can be a cluster, it's very easy, you just set them up in a loop. You can have three parents in a cluster, four parents in a cluster, and all of them are in a loop. So the idea is that instead of committing data to disk and trying to have something that can sustain failures on each server, we rely on replication and streaming to make sure that we will not lose data in case of failures. So 98% less disk I o and also on retention. So you see that net data, we say there 75% less storage footprint, but actually it is actually a lot more. The problem here is that the key characteristic of net data is that it can downsample data as time passes. So it has tiers, it can have up to five tiers. We ship it with three, but you can configure up to five, where you downsample data from tier to tier. Now, NetData also is one of the most power energy efficient platforms out there. So last month, the University of Abstention did a research we didn't know, we saw it when it was published that they said Netdata excels in energy efficiency, is the most energy efficient tool, and excels in cpu usage, ram usage and execution time when it monitoring docker containers. The whole study was about Docker containers. Let's move on to AI, to artificial intelligence, and what is there? What happens there? So in 2019, Todd Underwood from Google made this speech. This pitch says that actually all the male ideas that Google engineers had were bad. They couldn't have the expected outcome. So the engineers set some goals and they tried. They put them down and they tried to do it, but the goals were not there. It was impossible to achieve them. So all the ML's ideas are bad, and we should also feel bad, as Todd says here. Now in the data, we have mls. Now what we do, the first goal of machine learning in a data is to understand, to learn the patterns of the metrics. So we didn't want to. That's the first goal. Can we understand the pattern of the metrics so that the next time we collect a sample, we can know reliably if the collected sample is an outlier or not. And we wanted this unsupervised, so we didn't want to do, to provide any feedback to the training. We train at the edge, or as close to the edge, so you can train as the parents if you want. But the whole point was to understand if we can have a way to detect if a collected sample, just collected sample, is anomalous or not. And I think we have achieved that. So in a data train, say 18 models, it learns the behavior of each medic individual for the last 57 hours. This is two points and a half. Let's say it detects anomalies in real time. And all 18 machine learning models need to agree. This is how we remove the noise from ML. Because it's noise, it has false positives. So all 18 models have to agree that a sample is an outlier in order to say that it is an outlier. And we store the anomaly rate in the collected data together with the collected data in the time series. So it's like anomaly rate is an additional time series for every other time series. It's like having all the time series twice, one for anomalous, not anomalous. And we also calculate a host level anomalous score that we will see how it is used. Now in a data. One of the innovations we did in a data is that we added a scoring engine. A scoring engine tries to score the metrics, tries to understand given some parameters, it tries to understand which metrics, out of the thousands or millions of metrics available, are the more relevant to the query we do. So it can do scoring based on two windows to find the rate of change. It can score based on the anomaly rate. So which metrics were the most anomalies from that time to that time? We have also metric correlations that tries to correlate metrics together. So by similarity or by volume. So it tries to understand how the metrics correlate together. Now let's see how this appear in a data dashboard. And a data dashboard is like this. It's a single dashboard, one chart is below the other, infinite, scrolling hundreds of charts. Of course, there is a menu here that groups everything into sections so that you can quickly jump from section to section and see the charts. Now, this dashboard is fully automated. You don't have to do anything about it. Of course, if you want to cherry pick charts and build custom dashboards and change the visualization, all this is there. But the whole point is that we wanted all of it to be, every metric to be visualized in a fully automated way. Each chart we will see has a number of controls. So the charts that you see in the data are a little bit different compared to the others. Let's go to that thing. The first thing is that when you are in this dashboard that you have, in this case, I think it says five, 90, 500, almost 600 charts in this dashboard. It says it here when you are there and you can press this button, this AR button that's here, this is in Zoom and netdata. What will do is that it will fill the sections with their anomaly rates. So for the daytime picker from the duration you want it, it will score all the metrics and figure out what's the anomaly rate for that duration for each chart within them, and then for each section, this allows you to quickly spot. So if you have a problem and you want to find you don't know what it is, you have a spike or a dive on a web server, something is wrong. You don't know even what is wrong. You can just go there and hit that AR button and the data will tell you where the anomalies are. So you will be able to immediately see that, oh, my database server, my storage layer, or an application is doing something, has a crash or something. Now, this is in a data chart, and a data chart looks like all the charts out there, but not quite. So. The first thing that you will see is that there is this anomaly reborn. The anomaly ribbon is this. This purple color indicates on the top the anomaly rate of all the metrics that are included here. In this case, the metrics are coming from seven nodes, 115 applications, and there are 33 labels there. So the entire, all of them, all of them together. The anomaly rate, the combined anomaly rate is visualized here. Of course, you can have individual anomaly rate like this. So you click the nodes. This model comes in. It has the seven nodes one by one. It says, how many instances, how many components. If this is about disks, this is disks. If it is applications, these applications, these processes in this case. So how many applications are there? How many metrics are available? What's the volume, the relative volume. So in the chart that you see, some of them contribute more than the others. This is sorting by the volume, a sorting by anomaly rate. If there are related alerts to context switches, in this case about applications, you would see them here. And you can see the minimum, average, and maximum value per node for this kind of data. The same happens for applications, for dimensions or for labels. So it's a similar model that shows all applications in a list where you can see the volume, the anomaly rate, the minimum average and maximum, et cetera. And of course, you can use the group by. Sorry, I didn't tell you that. You can filter from here. So if you want to include or exclude something, you can just include it or exclude it here, and it will automatically change the chart. So similarly, you can use the group by feature to change how data are grouped. So you can group by node, you can group by application, you can group by dimension. In this case, it is read or write or whatever it is. You can also group by any label, whatever label is there, and actually by combinations, two labels, three labels, nodes and labels. So you can do all the combinations and group the data. See the different aspects of the data from this menu. Now, this menu is standard on every chart. Then we have the anomaly advisor. Anomaly advisor is a tool that we developed in order to find the needle in the haystack. So you have a problem. There is an anomaly. We saw this AR button that you press it and you can actually see the anomaly rate of each section. But how can I find the individual metrics or the most anomalous metrics that exist for a given time frame, current or past. So we use the host anomaly rate. The host anomaly rate looks like this. So this is a chart, you see here that it is a percentage, and it shows the number of metrics of the host, of each host. These are hosts here of the nodes, the number of metrics in the node that are concurrently anomalous. You can see that when you have anomalies, they are widespread. So you see a lot of metrics in that node become anomalous. So if there is a stress on the database server disk will have increased, I O CPU will be a lot more, probably network interface will have a lot more bandwidth. So all this combined together with all the individual metrics that we check, like the containers and the page faults, how many memory process shall locate, et cetera, or the text switches or whatever happens, even interrupts how interrupts are affected in the system. So all this information comes together and is aggregated here to see a huge spike when something anomalous happen. Now when something anomalous happens like this, what you can do is highlight this area. So there is a toolbox here that you can highlight this area. And immediately the data will give you a list of all the metrics sorted by relevance for that highlighted window. So for that highlighted window, it goes through all the metrics, no matter how many they are. It scores them according to their anomaly rate, sorts them, and gives you the list in a sorted way. So you can see for example that the whole point of this is that in the top 1020 items that you have there, you should have your aha moment. So you should have ho someone ssh to this server or ho we have tcp resets, something is broken somewhere else and this doesn't play. So the whole point is to have your aha moment within the top few items. Now the way I see it, if I go through, is that, the way I understand it is that we have really a lot in common. Our infrastructures under the hood are quite similar. We all deserve to have real time high fidelity monitoring solutions like net data keep up to this promise. So we spread net data like this in a distributed fashion, mainly to avoid the bottlenecks that all other monitoring solutions face. So net data should be scalable better than anything else. The data cloud, for example, is our SaaS offering today works at about, I don't know, less than 1% of its capacity, and it has 100,000 connected nodes. And it's just a Kubernetes cluster, not much, a small one actually, a few nodes. So the idea is that we want monitoring to be high resolution, to be high fidelity, to be real time. We open sourced everything. So Netdata is a gift to the world, and we open sourced, even advanced machine learning techniques, everything we do, all the innovations in observability are baked into the open source agent. And even when you view one agent or a parent with 2 million metrics, the dashboard is the same. We don't change dashboards. It's one thing, the same as the cloud. Netdata cloud has exactly the same dashboard as the agent. And monitoring, to our understanding, should be simple. It should be easy to use, easy to maintain. The data is maintenance free, doesn't require anything. Of course there are a few things to learn. How the tool behaves like this, and how I do streaming, how to build parent. You need to learn a few things, but even there is nothing to maintain in indexes. Most of the stuff are zero configuration and work out of the box. And at the same time we believe that the monitoring tool should be a powerful tool at the experts, at the hands of experts, but a strong educational tool for newcomers. So people should be using these kind of tools like Netdata, to learn, to troubleshoot, understand the infrastructure, feel the pulse of the infrastructure, and at the same time we are trying to optimize it all over the place. So we want my data to be a thin layer compared to the infrastructure it monitors. It should never become huge. This is why we wanted to spread over the infrastructure, to utilize the resources that are already available. And spare data on a single node requires just 5% of a single node utilization of a single core, sorry, of a single core cpu utilization and about 100 megabytes of ram. So we want this to be extremely thin compared to the whole thing, so that it can be affordable for everyone. Thank you very much for watching, try new data and see you online.

See all 57 talks at this event!

Conf42 DevOps 2024 - Online

January 25 2024

Observability Standardization: the elephant is still in the room!

Video size:

Abstract

Summary

Transcript

Costa Tsaousis

Founder & CEO @ Netdata

Join the community!

Featured event

2025

2024

Info

Conf42 DevOps 2024 - Online

January 25 2024

Observability Standardization: the elephant is still in the room!

Video size:

Abstract

Summary

Transcript

Costa Tsaousis

Founder & CEO @ Netdata

Join the community!