Conf42 Rustlang 2025 - Online

- premiere 5PM GMT

Beyond Static Dashboards: How Rust-Powered Dynamic UIs Transform Cloud Infrastructure Management

Video size:

Abstract

Static dashboards are killing your uptime! Discover how Rust-powered dynamic UIs slash incident response by 63% and prevent 73% of critical failures. Real data, real results, real-time monitoring that transforms cloud chaos into operational excellence.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. I'm Tashi Kirk. I am from Juniper Networks and today we are going to address a very serious problem in the cloud infrastructure management, the hidden limitation of static dashboards. Dashboards are supposed to be our eyes into the system, but in reality so many operate on a delay. That delay create blind spots, and those blind spots can cause. Uptime revenue, trust, a lot of issues. So in this talk, I will ex, I'll explain why this happens. How Rust, help us eliminate those blind spots with predictable low latency mon monitoring. And what happens when the organization make that shift? So let's see the agenda for the day. We will start by looking at the operational impact of the static dashboard. Then I'll explain why Rust with its concurrency model. Deterministic performance is ideal for realtime telemetry processing. We'll also explore some real word results, including a large scale case study. And we'll end with a phase map for the adoption. So let's begin with the static the hidden cost of using static dashboards. So most static dashboard, they rely on polling, for example, collecting data at the fixed interval. It could be 32nd, it could be one minute. It depends on the widget. That we are using the dashboards that's being deployed. So now that means that even can happen well before they appear on the ui. So that is called the data stillness. So we, me, we have measured that across environments. 67% of the network operator reports detection delays, averaging almost 8.5 minutes from the incident onset. To the first visibility to the network operator. And because most dashboards they use binary or threshold based health checks, 73% of the anomalies, like intermittent packet loss, degraded latencies, they're never flag at all. So when we combine staleness with these missed anomalies, we end up in a very dangerous blind spots. So for that, we did they come this reactive monitoring trap. So what happens when these blind spots occurs? They forces teams into the reactive pusher. So reactive pressure is when you are reacting. After the act after the incident occurs, so they respond only when the issue is already impacting customers. Statistically reactive monitoring correlates with 2.8 times more service disruptions and 7.4 hours of unplanned downtime per month in the production system. In contrast to this the event driven real time dashboards, the kind that we are talking today, cut that downtime to 2.1 hours by detecting issues as they happen to get there. Though we need a system that can process massive telemetry volumes in real time with consistent low latency, and that's where Rust fits in. So how does changes everything? So when we look at the platform for real time monitoring, two things are non-negotiable. One is predictable low latency. And second is high readability. Reliability under load and rust deliver both of them. So if I keep the points read out the points, it is zero cost abstraction that let us push through high throughput data processing without runtime over. Memory safety without garbage collection means there is no unpredictable GC pauses. When there is a telemetry spike ownership model and borrower checker borrow checker that prevents the race conditions and concurrent workloads. Compile time guarantees that eliminates entire classes of runtime manner like NU D references. So this combination make Russ uniquely well suited for latency sensitive workloads like containers monitoring. Now the Russ ecosystem, although the language is very powerful, but the ecosystem is what? It makes it completely production ready to get it to the production stage. We need maybe Tokyo for asynchronous runtime, optimized for millions of concurrent IO bound tasks. That is perfect for the telemetry ingestion. So for zero copy serialization, decentralization for massive js or proto or even aro pay payloads act as any low latency bear framework that can deliver low live updates to the dashboards or crossbeam that can have safe fund currency primitives for multi-core scaling of these telemetry processing with this type. We can process millions of data points per second while keeping latency under control. So let's see how this transformational results the transformational result when this kind of ecosystem was deployed. So in a real world deployment where we are replacing the static dashboards with the rust powered event driven system, we saw that there is almost 63% reduction in MTTR. So it was from 52 minute. Down to 19 minutes, there were 37% improvement for critical incident, 44% fewer operator errors, thanks to the instant feedback, 31% less downtime year over year. And these aren't the hypothetical improvement. They come directly from the production environment, from where the study well, where these deployments were done. Going forward let's take a case study where we have a global financial service provider with an environment with almost 12,000 there more than 12,000 instances that are running across multiple cloud providers. The problem was faced that. Almost 11 times average detection time, 9.4 hours of downtime per month. So after migrating to a RAs base, realtime monitoring backend the pro the process 3.2 million telemetry data points per second, with less than five milliseconds of pipeline latency, the anomalies were detected 65% faster. The incident detection was reduced from 11 minutes to 47 seconds, and the downtime, it was dropped to almost 2.3 hours per month. So from 11 minutes sorry, 9.54 minutes. Two, 2.3 hours per minute per month. That was a huge difference. So that's the operational differences that these real time make in it's seen in the real time. So along with Rust, there are other things that can be done. There are a lot of emerging trends these days when we have real time telemetry. There can be powerful capabilities that can be added on top of it. It could be AI driven anomaly detection. More ML models can be can be added on top of it that can catch these deviations that are invisible to those static thresholds. NLP based query interfaces can be added. When natural language query interface is provided, it can translate la questions in the common language into a structured telemetry searches, and even on the ui a more sp more graph based visualization, more interactive math. Maybe maps of dependencies that can cut diagnosis or time over time by over 40%. So these can be added on top of it, but the prerequisite, if we consider is low latency, reliable telemetry pipeline, which just already enables. So let's see how we want to implement this. So a typical rust power monitoring stack may include agent layer, which is lightweight, rust Aries, maybe less than 15 mbs. Less than 2% CPO use. And sending telemetry over GR PCs or web sockets streaming pipelines that are acing processing of multi-core parting real time API layer that can deliver updates to the dashboard in less than 50 milliseconds. And then the reactive ui that's where it is instead of polling it is getting events from web soc, web sockets, or SSE based dashboards that can update automatically. So there, there are three common. Now while adopting this stack, there will be three common challenges. One is. Integrating with the leg legacy environment, so that can be solved with the just FFI or GRPC bridges so that you don't have to replace everything at once. The definitely a skill gap that we can start with one self-contained component to build confidence. And then the data volume, the filter and aggregate or add the agent before sending data upstream. So these challenges are real, but manageable if it is rolled out in stages. Considering the roadmap, we, it is always recommended a phased approach. So in phased approach. The first month maybe can be for dedicated for auditing all these blind spot, defining baseline m TTRs and downtime. Then couple of months. To deploy agent in staging, validating latencies, throughput detection, accuracy. Post that a quarter maybe in rolling out to production, integrating with the incident response tools. And beyond that it can be dedicated for all these emerging trends. Integration like adding ml, anomaly detection, domain specific visualization. So adding, doing in a phase approach will keep the risk low while proving its value lit early. The key takeaways is static dashboard. They create dangerous visibility caps due to polling delays and limited anomaly detection. Rust deterministic performance, memory, safety concurrency model make it ideal for building real time monitoring pipelines in production. The this translate to faster detection, fewer errors. Significant, significantly low downtime. So Rust Power dashboards aren't just I would say tech upgrade. They are more as an operational shift. They let team move from reacting. To protecting from, like firefighting to preventing. So once we experience this detecting incident in under a minute, there won't be no, there won't be any going back. That's all from my side today. Thank you.
...

Tashi Garg

Senior Staff Software Engineer @ Juniper Networks

Tashi Garg's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)