Conf42 Site Reliability Engineering 2022 - Online

Kubernetes monitoring - why it is difficult and how to improve it

Video size:


The popularity of Kubernetes changed the way how people deploy and run the software. It also brought additional complexity of Kubernetes itself, microservice architecture, short release cycles - all these became a challenge for monitoring systems. The truth is, adoption and popularity of Kubernetes had severe impact on monitoring ecosystem, on its design and tradeoffs.

The talk will cover what are monitoring challenges when operating Kubernetes, such as increased metrics volume, services ephemerality, pods churn, distributed tracing, etc. And how modern monitoring solutions are designed specifically to address these challenges and at what cost.


  • Aliaksandr Valialkin: Kubernetes exposes huge amounts of metrics on itself. Many users of kubernetes struggle with the complexity and monitoring issues. He explains why it is difficult and how to improve it.
  • There is no established standards for metrics at the moment. Community and different companies try to invent own standard and promote them. This leads to big amounts of metrics in every application. This also leads to outdated dashboards for kubernetes Grafana. New entities like distributed traces needs to be invented.
  • Kubernetes increases complexity and metrics footprint of current monitoring solutions. Most complexities are active time series churn rate and huge volumes of metrics for each layer. Victoria Matrix believes there must be a standard for kubernete monitoring.


This transcript was autogenerated. To make changes, submit a PR.
Hello everybody, today I will talk you about Kubernetes monitoring, why it is difficult and how to improve it. Let's meet I'm Aliaksandr Valialkin and our telemetrics founder and code developer. I also known as Go contributor and our author of popular libraries for Go such as Fast RTP, fast cache and Fixing plate. As you can see that libraries start from fast and quick prefixes. This means that these libraries are quite fast. So I'm fond of performance optimizations. What is Victoria metrics this time? Services, database and monitoring solution. It is open source, it is simple to set up and operate, it is cost efficient, highly scalable and it is cloud ready. We provide helm charts and parameters for victory metrics for kubernetes according to recent surveys. From this survey, for instance, you can see that the amounts of monitoring data increases two three times faster than the amounts of actual application data and this not so good. For instance, some people twitters also monetized it and say that these not so good because these costs for storing monitoring data increases much faster comparing to cost for storing application data. According to the recent CNCF survey, many users of kubernetes struggle with the complexity and monitoring issues. As you can see, 27% of these users don't like the state of monitoring in kubernetes. So why kubernetes monitoring is so challenging? The first thing is that kubernetes exposes big amounts of metrics on itself. You can look at this link and see how many kubernetes companions expose huge amounts of metrics and the number of this exposed metrics grows over time. Let's look at these graph this graph shows that the number of unique metric names which are exposed by Kubernetes companions has been grown from 150 in 2018 in Kubernetes version one point ten to more than 500 in Kubernetes 1.24 which has been released recently. The number of metrics unique metrics which are exposed by applications grow not only in Kubernetes services, by in any application, for instance, not the export services component which is usually used in kubernetes and for monitoring hardware also increases the number of unique metrics. For instance, the number of metrics in node exporter increased from 100 to more than 400 in the last five years. Every kubernetes node exports at least 2500 series and this doesn't count the number of application metrics. This metrics includes nodexperter's metrics, kubernetes and these advisor metrics. And according to our study, we see that these average number of such metrics per each node Kubernetes node is around 4000. So if you have 1000 pods then your kubernetes cluster will expose 4 million metrics which should be collected. What is these source of such big amounts of metrics? This is because of multilayer architecture of modern systems. Let's look at this picture. You see, you can see that hardware server contains virtual machine and each virtual machine contains pods and each pods contains applications and each application contains container. And all these levels must have some observability. This means that they need to expose some metrics. And if you have multiple containers in kubernetes, multiple pods in kubernetes, then the number of exposed metrics increases with these numbers of pods and containers. Let's look at the simple example. When you deploy Nginx these leprechauns Genix in kubernetes, they already generate more than 600 new time series according to advisor. And these metrics don't count application metrics. These means that metrics which are exposed by Nginx itself. Another issue with kubernetes monitoring is time series charm when old series are substituted by new ones. And monitoring solutions don't like high charm rate because it leads to memory issues, memory usage issues and cpu time issues. Kubernetes tends to generate high churn rate for active time series became of two things. The first thing is frequent deployments. When you deploy new deployment then a new set of metrics for these deployments are generated because every such metric usually contain pods metric and this pod metric is usually generated automatically by kubernetes. And another source of high churn rate is pods autoscale events. When pods scale then new pod names appear and metrics for these new pods should be registered in monitoring system and this generates high churn rate and the number of new metrics which are generated with each deployment or portal scale event can be estimated as the number of container start metrics for this application. For each instance of this application plus the number of application metrics and this number should be multiplied by the number of replicas for this deployment and these number of deployments. And as you can see if these churn rate increases with the number of deployment and the number of replicas, do we need all these metrics? The answer is not so easy. Like you see some people say that no we don't need all these metrics because our monitoring systems uses only a small fraction of the collected metrics. But others say yes we need this collecting all these metrics because these metrics can be used in the future. How to determine the exact set of needed metrics there is a mimir tool from Grafana which scans your recording and alerts and rules, and also scans your dashboard queries and decides which metrics are used and which metrics aren't used and then it can generate hollow list for user metrics. And for instance, Grafana says that if you have kubernetes cluster with three pods, this cluster exposes 40,000 active time series by default. And if you run memir tool and apply all lists to labeling rules, this always reduces the number of active time series from 40,000 to 8000. This means more than five times less. So what does it mean? It means that existing solution slots like kubernetes prometheus stack collect too many metrics and most of these are unused. This chart shows that only 24% of collected metrics from Kube Prometheus tech are actually used by alerts and recording rules and these boards and 75% of metrics never used by current monitoring solutions. This means that you can reduce your spend expenses on monitoring solutions by 76%. That's more than five four times. Let's talk about monitoring standards. Unfortunately, there is no established standards for metrics at the moment. Community and different companies try to invent own standard and promote them. For instance, Google promotes four golden signal standard, Brendan Greg promotes used standard for monitoring and weave works promotes red standard but so many different standards fits to the following situation that nobody follows a single standard and everybody follows different standards or doesn't follow any standard. This leads to big amounts of metrics in every application and these metrics changed over time. And you can read many articles and opinions about most essential metrics and there is no single source of truth for monitoring. This also leads to outdated dashboards for kubernetes Grafana for instance, the most popular dashboards in Grafana are now outdated. Grafana and Kubernetes provokes you to generate to use microservices architecture and microservices architecture has some challenges. Every microservice instance needs to own metrics. The users need to track and correlate events across multiple services. FML services also makes situation worse. FMLC means that every mega service can be started, redeployed, stopped at any time and because of this situation, new entities like distributed traces needs to be invented and used in founder CTO improve the observability station for microservice microservice talk to each other via network so you need to improve networking to monitor networking service allocation on one, not create a nosy neighbor problem. And this problem also needs to be resolved and service mesh introduces yet another layer of complexity which needs to be monitored. How kubernetes affects the monitoring as you can see from previous slides, kubernetes increases complexity and metrics footprint current monitoring solutions such as parameters, victory metrics, tunnels, cortex are busy with overcoming complexities introduced by kubernetes. These most complexities are active time series churn rate which are generated from the service and huge volumes of metrics for each layer. And service developers of current monitoring solutions spent big amounts of efforts for adopting these monitoring tools for kubernetes. Because of this, maybe if there was no kubernetes we won't need distributed traces and examplers because distributed traces and examplers are used only solely for microservices and kubernetes. And maybe if there was no kubernetes all this time on overcoming difficulties in current monitoring solution could be invested into more useful observability tools such as automated protocols analysis or metrics correlation who knows how Kubernetes deals with millions of metrics? These answer is that kubernetes doesn't deal, doesn't provide good solutions. It provides only two flux which can be used for blacklisting, dissolving some metrics and other label values. That's not so good solution. How does Prometheus deals with Kubernetes challenges? Actually Prometheus version two has been created because of kubernetes because it needs to solve kubernetes challenges with high number of time series and high churn rate. You can read the announcement of Prometheus version two in order to understand how they redesigned internal architecture of Prometheus solely for solving Kubernetes issues. But still Kubernetes issues such as high churn rate and cardinality aren't solved in Prometheus and other metronic solution. Actually Victoria metrics also deals kubernetes changed. Actually Victoria metrics has been appeared as a system which solves cardinality issues in Prometheus version one. It is optimized for using lower amounts of memory in this space when working with high card analysis series. It also provides optimizations to overcome time series charm which is common in Kubernetes and Victoria metrics. Also, we at Victoria Matrix also don't know how to reduce the number of time series and Victoria. New versions of Victoriametrics increased the number of exported time series over time. You can see that the number of new metric names which are exposed by Victoria metrics growth around three times during the last four years and only 30% of these metrics are actually used by Victoria metrics, dashboards and alerts and records and rules. How can we improve the situation? We believe that kubernetes monitoring complexity must be reduced and the number of exposed metrics must be reduced. The number of histograms must be reduced because the histogram is the biggest offender of cardinal which generates many new time series, the number of parametric labels must be reduced. For instance, in kubernetes it is common practice to put all the labels which are defined at pod level to all the metrics exposed by this pod and probably this not correct. We should change these situation. Time series churn rate must be reduced the most common time series churn rate source in kubernetes is horizontal port auto scaling and deployments. And we should think hard how to reduce churn rate for these sources. And we believe that community will come up with a standard for kubernetes monitoring which will be much lightweighter and will need to collect a much lower number of metrics compared to the current state of monitoring of kubernetes. So let's do it together. Now you can ask questions.

Aliaksandr Valialkin

Founder & CTO @ VictoriaMetrics

Aliaksandr Valialkin's LinkedIn account Aliaksandr Valialkin's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways