Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
I'm Tashi Kirk.
I am from Juniper Networks and today we are going to address a
very serious problem in the cloud infrastructure management, the hidden
limitation of static dashboards.
Dashboards are supposed to be our eyes into the system, but in
reality so many operate on a delay.
That delay create blind spots, and those blind spots can cause.
Uptime revenue, trust, a lot of issues.
So in this talk, I will ex, I'll explain why this happens.
How Rust, help us eliminate those blind spots with predictable
low latency mon monitoring.
And what happens when the organization make that shift?
So let's see the agenda for the day.
We will start by looking at the operational impact
of the static dashboard.
Then I'll explain why Rust with its concurrency model.
Deterministic performance is ideal for realtime telemetry processing.
We'll also explore some real word results, including a large scale case study.
And we'll end with a phase map for the adoption.
So let's begin with the static the hidden cost of using static dashboards.
So most static dashboard, they rely on polling, for example, collecting
data at the fixed interval.
It could be 32nd, it could be one minute.
It depends on the widget.
That we are using the dashboards that's being deployed.
So now that means that even can happen well before they appear on the ui.
So that is called the data stillness.
So we, me, we have measured that across environments.
67% of the network operator reports detection delays, averaging almost
8.5 minutes from the incident onset.
To the first visibility to the network operator.
And because most dashboards they use binary or threshold based health
checks, 73% of the anomalies, like intermittent packet loss, degraded
latencies, they're never flag at all.
So when we combine staleness with these missed anomalies, we end up
in a very dangerous blind spots.
So for that, we did they come this reactive monitoring trap.
So what happens when these blind spots occurs?
They forces teams into the reactive pusher.
So reactive pressure is when you are reacting.
After the act after the incident occurs, so they respond only when the
issue is already impacting customers.
Statistically reactive monitoring correlates with 2.8 times more
service disruptions and 7.4 hours of unplanned downtime per
month in the production system.
In contrast to this the event driven real time dashboards, the kind
that we are talking today, cut that downtime to 2.1 hours by detecting
issues as they happen to get there.
Though we need a system that can process massive telemetry volumes
in real time with consistent low latency, and that's where Rust fits in.
So how does changes everything?
So when we look at the platform for real time monitoring, two
things are non-negotiable.
One is predictable low latency.
And second is high readability.
Reliability under load and rust deliver both of them.
So if I keep the points read out the points, it is zero cost abstraction
that let us push through high throughput data processing without runtime over.
Memory safety without garbage collection means there is no unpredictable GC pauses.
When there is a telemetry spike ownership model and borrower checker
borrow checker that prevents the race conditions and concurrent workloads.
Compile time guarantees that eliminates entire classes of
runtime manner like NU D references.
So this combination make Russ uniquely well suited for latency sensitive
workloads like containers monitoring.
Now the Russ ecosystem, although the language is very powerful,
but the ecosystem is what?
It makes it completely production ready to get it to the production stage.
We need maybe Tokyo for asynchronous runtime, optimized for millions
of concurrent IO bound tasks.
That is perfect for the telemetry ingestion.
So for zero copy serialization, decentralization for massive js or proto
or even aro pay payloads act as any low latency bear framework that can deliver
low live updates to the dashboards or crossbeam that can have safe fund currency
primitives for multi-core scaling of these telemetry processing with this type.
We can process millions of data points per second while
keeping latency under control.
So let's see how this transformational results the transformational result when
this kind of ecosystem was deployed.
So in a real world deployment where we are replacing the static
dashboards with the rust powered event driven system, we saw that
there is almost 63% reduction in MTTR.
So it was from 52 minute.
Down to 19 minutes, there were 37% improvement for critical incident,
44% fewer operator errors, thanks to the instant feedback, 31%
less downtime year over year.
And these aren't the hypothetical improvement.
They come directly from the production environment, from where the study well,
where these deployments were done.
Going forward let's take a case study where we have a global financial
service provider with an environment with almost 12,000 there more than
12,000 instances that are running across multiple cloud providers.
The problem was faced that.
Almost 11 times average detection time, 9.4 hours of downtime per month.
So after migrating to a RAs base, realtime monitoring backend the pro
the process 3.2 million telemetry data points per second, with less than five
milliseconds of pipeline latency, the anomalies were detected 65% faster.
The incident detection was reduced from 11 minutes to 47 seconds,
and the downtime, it was dropped to almost 2.3 hours per month.
So from 11 minutes sorry, 9.54 minutes.
Two, 2.3 hours per minute per month.
That was a huge difference.
So that's the operational differences that these real time
make in it's seen in the real time.
So along with Rust, there are other things that can be done.
There are a lot of emerging trends these days when we have real time telemetry.
There can be powerful capabilities that can be added on top of it.
It could be AI driven anomaly detection.
More ML models can be can be added on top of it that can catch these deviations that
are invisible to those static thresholds.
NLP based query interfaces can be added.
When natural language query interface is provided, it can translate la
questions in the common language into a structured telemetry searches, and even
on the ui a more sp more graph based visualization, more interactive math.
Maybe maps of dependencies that can cut diagnosis or time over time by over 40%.
So these can be added on top of it, but the prerequisite, if we consider
is low latency, reliable telemetry pipeline, which just already enables.
So let's see how we want to implement this.
So a typical rust power monitoring stack may include agent layer,
which is lightweight, rust Aries, maybe less than 15 mbs.
Less than 2% CPO use.
And sending telemetry over GR PCs or web sockets streaming pipelines that are acing
processing of multi-core parting real time API layer that can deliver updates to the
dashboard in less than 50 milliseconds.
And then the reactive ui that's where it is instead of polling it is getting events
from web soc, web sockets, or SSE based dashboards that can update automatically.
So there, there are three common.
Now while adopting this stack, there will be three common challenges.
One is.
Integrating with the leg legacy environment, so that can be solved with
the just FFI or GRPC bridges so that you don't have to replace everything at once.
The definitely a skill gap that we can start with one self-contained
component to build confidence.
And then the data volume, the filter and aggregate or add the
agent before sending data upstream.
So these challenges are real, but manageable if it is rolled out in stages.
Considering the roadmap, we, it is always recommended a phased approach.
So in phased approach.
The first month maybe can be for dedicated for auditing all these blind spot,
defining baseline m TTRs and downtime.
Then couple of months.
To deploy agent in staging, validating latencies, throughput detection, accuracy.
Post that a quarter maybe in rolling out to production, integrating
with the incident response tools.
And beyond that it can be dedicated for all these emerging trends.
Integration like adding ml, anomaly detection, domain specific visualization.
So adding, doing in a phase approach will keep the risk low
while proving its value lit early.
The key takeaways is static dashboard.
They create dangerous visibility caps due to polling delays
and limited anomaly detection.
Rust deterministic performance, memory, safety concurrency model
make it ideal for building real time monitoring pipelines in production.
The this translate to faster detection, fewer errors.
Significant, significantly low downtime.
So Rust Power dashboards aren't just I would say tech upgrade.
They are more as an operational shift.
They let team move from reacting.
To protecting from, like firefighting to preventing.
So once we experience this detecting incident in under a minute, there won't
be no, there won't be any going back.
That's all from my side today.
Thank you.