Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey everyone.
I'm really excited to be here at Con 42 SRE 2025.
Today I want to talk about something that's been gaining a lot of momentum,
especially in the world of distributed systems and that's edge intelligence.
As more and more data is generated outside traditional data centers like in factories
on vehicles or in remote locations.
We are seeing seeing a shift in how we build and manage reliable systems.
And that's what the stock is all about.
How we can apply SRE principles to this new decentralized environments in a way
that's practical, scalable, and reliable.
Before we dive in, let me give you a quick intro about myself.
I am a senior software engineer where.
I work on building systems that are designed to scale reliably and securely.
My focus has been on cloud native platforms, distributed architecture
and observability, basically making sure things are running
smoothly, even as complexity grows.
Over the past few years, I've started looking more closely at Edge
Computing and what it means to bring the reliability mindset of SRE.
Into environments that are far more unpredictable and decentralized.
That's the foundation for today's session.
So now that we've got that set, let's talk about what's actually
driving the shift to the edge.
Perfect.
So what's actually driving the shift towards edge computing?
For a long time, the default approach was to send everything to the cloud, collect
data from devices, send it upstream, and process it centrally, and then respond.
That worked when data volumes were manageable and latency
wasn't a huge concern.
But now we are in a world with millions of sensors, real time
feedback loops and limited bandwidth.
Especially in industries and or remote settings.
So what we're seeing instead is a move to process data, right where it's generated,
whether that's on a factory floor, inside a vehicle, or the network edge.
This has massive benefits, like lower latency, better responsiveness,
and reduced cloud costs.
But.
It also changes how we architect systems.
We are no longer relying on centralized infrastructure, and we are
building intelligence into the edge.
And that introduces new reliability challenges.
And that's where we start thinking how do we bring in SRE principles
into these kind of environments?
So now if, if worked in SRE or reliability Engineering, a lot of the core ideas
like error budgets, SLOs, automation, and observability are probably familiarity.
But when we take those ideas and try to apply them to edge environments,
things get a little trickier.
The edge is messy.
Devices might be low power, they.
Might have limited connectivity and they're often running in conditions
we can't fully control, so we can't always expect five nines
liability or seamless deployments like we do in the cloud, right?
That means we need to adapt SR PRI principles or practices to
fit this particular context.
For example error budgets might need to account for connectivity
gaps or power limitations.
And SLOs should reflect what's realistic for each location or device
type, not just a global average.
And coming to automation, it still plays a huge role, but it needs to be
lightweight, resilient, and designed for.
Patchy networks.
When it comes to observability, we have to rethink how we collect data because
shipping logs and metrics to the cloud in real time isn't always feasible.
So overall, it's not about abandoning SRE, but it's more about reshaping it to work
in the real world conditions of the edge.
Great.
Now coming to the DevOps and container containerization at the edge.
One of the ways we've brought consistency and speed to backend
systems is through this meaning DevOps practices and containerization.
And now we are extending those same ideas to the edge.
But of course edge environments have very different constraints,
so we have to adapt accordingly.
For starters, our traditional containers can be way too heavy for edge devices,
so we lean on lightweight container ima images using things like apen
based belts or even dustless containers that strip out everything you don't
need that helps reduce memory usage and start up time dramatically.
Sometimes over 90%.
Then there's CSED.
Edge deployments need a different approach, especially because
devices may have limited bandwidth or unreliable connections.
So we use more progressive rollout strategies, automated canneries,
rollback mechanisms and smart sync pipelines that can resume where they
left off if the connection drops.
And finally, configuration management.
We still want GitHub style workflows, but tailored to the edge.
That means treating device configs like version code and using
tooling that can sync and enforce config states even at scale.
And it should be across hundreds or maybe thousands of devices.
So DevOps at the edge is absolutely possible, but it
just needs to be tuned for the.
Environment it runs in.
So once we've deployed applications to the edge, how do we know they're
actually working the way we expect?
That's where observability comes in.
But again just like with automation, we have to rethink
how we approach it for the edge.
In a typical centralized setup, we might collect a ton of logs, metrics, traces,
and send them all to the cloud and analyze them there in our general setup.
But at the edge, that's not always practical.
We might not have the bandwidth or the connection might be
intermittent, so the strategy shifts.
We focus on collecting the right data and doing smart aggregation locally
before pushing anything upstream.
For example, instead of sending every log line, we might detect
anomalies at the edge and only forward summarized or critical data.
Some platforms now even support lightweight anomaly detection directly on
the device so we can catch issues early without waiting for any cloud analysis.
And we still aim for end-to-end visibility using distributed
tracing so we can follow requests across the edge and into the cloud.
But we build it in a way that respects the resource constraints.
So in short, observability at the edge is about being smart,
selective, and efficient.
So we stay informed without.
Overwhelming the system.
Okay.
Now here's where things get really interesting because we are not just
collecting data at the edge anymore.
We're actually running intelligence at the edge.
With the help of lightweight AI models, we can now deploy logic that
doesn't just react, but actually predicts and heals right on the device.
A great example is predictor maintenance.
Instead of waiting for a sensor to fail, edge devices can run small models
that look at vibration or temperature trends, detect anomalies and trigger
alerts before something breaks.
One real world use case.
Manufacturing facility reduced downtime by 37% just by deploying local models
that caught early signs of machine wear.
Then there's autonomous healing.
Imagine a system that detects performance degradation or corrects it automatically
by rerouting traffic, restarting services, or simply adjusting resource usage
without any human inter intervention.
And this is happening.
We are seeing this in smart grids where devices can reconfigure
power distribution path.
When there's an outage keeping critical infrastructure online, what's amazing
is that these capabilities are running on resource constrained edge hardware
and not some big cloud clusters.
It's a huge leap forward in reliability and a perfect complement to the kind of
self-healing systems we aim for in SRA.
We've talked about the why and the hub, but let's take a look at what this
actually looks like in the real world.
There are just a few examples of industrial IOT systems that have
embraced edge intelligence with the really powerful outcomes.
The first one comes from smart manufacturing.
A major automotive company deployed edge based analytics
for realtime quality control.
On their assembly lines.
By analyzing sensor data right onsite, they were able to reduce
defect rates by 23% and at the same time, they cut down on cost cloud
data transfer costs by nearly 80%.
And coming to the oil and gas sector remote well monitoring
is a perfect edge use case.
One setup was able to maintain.
Visibility and control during a full three day cloud outage, avoiding
any downtime that would've cost over a million dollars per day.
That kind of resilience simply isn't possible without edge first design.
And in maritime logistics where connectivity is often unreliable, edge
devices on shipping containers allowed real time tracking and monitoring.
Even when completely offline, ensuring end-to-end supply chain
visibility possible in this use case.
So what ties these together isn't just the use of edge, it's the
way SRE principles were adapted to environments that are unpredictable,
distributed, and often disconnected.
It shows that with the right architecture and mindset, we can build
highly reliable systems anywhere, even in the harshest conditions.
Now, let's talk about how we actually design these edge systems to be resilient
by default, especially in environments where failure isn't a matter of f, but.
We often say everything fails eventually, and at the edge that's even more true.
So we built with that in mind always.
One of the most effective tools is circuit breaking, where a component that's
struggling or unstable can be isolated before it causes a chain reaction.
This helps us contain problems instead of letting them spread across the network.
We also use intelligent failover mechanisms where systems can automatically
switch to backup nodes or fallback services when something goes wrong
and then comes data replication.
It is another big one.
When devices are scattered across locations, we need to make sure that
data isn't lost if one node fails, so we replicate it smartly often in real time.
And then there's mesh networking, which is super valuable in
remote or mobile environments.
Devices can talk to each other directly rerouting around failed
connections, forming self-healing networks that maintain uptime even
when parts of the system go dark.
So together these patents gives us a foundation for building robust
edge systems that stay operational even in unpredictable conditions.
Perfect.
So once we have designed these edge systems for resilience, the next
question is, how do we manage them at scale without losing our minds?
That's where infrastructure automation becomes critical.
First, we start with infrastructure as code.
This let's us define edge environments declaratively, just
like we would with cloud resources.
We voice and control everything, so we can re reply repeat deployments consistently,
whether it's 10 devices or 10,000 devices.
Next, we bring in policy as code Before any deployment goes live,
we validate it against security, performance, and reliability rules.
It's like a safety net that prevents bad configurations
from going out to edge devices.
Where rollback might be slow or impossible.
And finally, we layer on GitHubs.
The, this gives us a clean, automated sync between what we've defined and get
and what's actually running in the field.
If something drifts, it gets reconciled automatically.
What's great about this approach is that it gives us a single source
of truth and repeatable workflows.
Even across disconnected resource constraint environments.
In other words, we bring the discipline and automation of cloud
operations to the edge without compromising reliability or control.
Okay.
One of the biggest architectural decisions we face with edge systems is this.
What should run locally and what should stay in the cloud?
Not everything makes sense to push to the edge.
Some workloads are better centralized, especially when they're compute
heavy or require global view.
But for the other tasks like real time control, anomaly detection, or basic
analytics, we often can't afford the latency of cloud round reps. These
are the kinds of operations that need to happen right at the source.
So what this slide shows is a way to think about that balance.
On the left, we have workloads real time decisions, anomaly detection, and
basic analytics, things that benefit most from being processed at the edge as
we move to the right, into things like historical analysis or machine learning.
Model training, those are typically still better handled in
centralized cloud environments.
So successful edge architectures find the speed bot.
They combine local autonomy for fast, reliable reactions with centralized
intelligence, like for learn for learning, optimization and visibility.
This hybrid model.
Where we combine both is really the key to getting the best of both
worlds which has, which will have low latency, local action, and high level
coordination and insight from the cloud.
Alright, let's wrap it all up with a few key takeaways.
First, implementing edge intelligence isn't just about performance.
It delivers real measurable impact.
On average, we have seen a 40% reduction in latency when not time sensitive
operations are processed locally.
That's the difference between reacting on time and reacting too late On the cost
side, moving analytics to the edge can cut cloud data transfer by up to 65%,
especially in data heavy environments like, manufacturing or logistics.
And when we combine that with the proper SRE practices, customized for edge
environments, we can realistically aim for 99.99% reli reliability, even in
decentralized resource constraint systems.
So with all these key takeaways in mind, what are the next steps?
Start by looking at your architecture.
Identify which workloads need real time processing and which
ones can stay centralized.
Then define SLOs for your edge systems.
Think locally, not just globally.
And finally, build in the right observability, automation and
infrastructure management so you can scale with confidence.
It's Edge computing isn't a future trend.
It's already here, and with the right mindset and tools, we can make it just
as reliable as anything in the cloud.
Great.
Thank you so much for joining me for this session.
I hope it gave you a fresh perspective on how SRE can evolve for the edge
where systems are more distributed.
The stakes are higher and the challenges are different.
But remember, no less solvable.
So if you're working on anything in this piece or just curious about
edge infrastructure, observability or reliability, allow to connect and
learn from your experience as well.
You can scan the QR code on the slide to connect with me on
LinkedIn, or you can just use the handle attached to connect with me.
Thanks again for your time and enjoy the rest of the conference.