Transcript
This transcript was autogenerated. To make changes, submit a PR.
Good morning, good afternoon, good evening, everyone.
My name is Kumar and I'm honored to be here at conference 42 to discuss what
I believe is one of the most impactful safety in modern manufacturing.
Applying site reliability engineering or SRE principles to achieve strategic
and long-lasting reliability.
Over the next half an hour or so, we will explore how embracing a
proactive data-driven SRE mindset can transform production flows, reduce
costly downtime, and pave the way for a more efficient and resilient future.
Before we dive into the details, let's take a quick look at
our roadmap for today's talk.
We gonna start by understanding the reliability crisis in manufacturing
and why traditional methods fall short.
Next, we'll examine how SRE concepts can bridge up personal
technology and information technology into a unified approach.
Next, we'll explore the case studies and patterns that illustrate the real
world impact of SRE in manufacturing.
Next, we'll learn about practical challenges, solutions, and a step by
step roadmap for SRE implementation.
This will follow by highlighting the quantifiable business outcomes
organizations are seeing from adopting these approaches.
We'll end by outlining the next steps so that you can begin your own SRE journey.
By the end of this session, I hope you will feel empowered to look at your
own existing processes with a fresh perspective and to champion strategic
reliability initiatives within your teams.
Let's begin by clarifying the why behind this approach.
When we talk about reliability in manufacturing, we are no longer
speaking about it as a reactive or post-incident measure in state.
We now have the tools and methodology to engineer reliability from the ground up.
So let's start by examining the state of manufacturing reliability today
and why so many organization feel they are constantly putting out fires.
Let's try to understand what is manufacturing reliability crisis.
When we discuss a reliability crisis, we are referring to how
unplanned downtime and maintenance emergencies have become the norm
in many manufacturing environments.
Research so that on average a single hour of unplanned downtime can cost upward
of $260,000 across various industries.
That is an astonishing field, and it does not even account for the ripple
effect, such as backlog, delete deliveries, and lost customer trust.
As you can see on the slide, there are three main component of this crisis.
Number one.
Costly downtime.
The immediate loss in productivity and the scramble to address breakdowns,
leads to a cascade of expenses.
Even a sort equipment failure may push back productions
schedule for days or weeks.
Next is reactive approaches.
Many facilities rely heavily on break fix methods.
They wait for things to fail before mobilizing teams to
repair or replace components.
This firefighting posture runs resources and generates a
cycle of un unpredictability.
And the last but not the list is siloed information, which means in most plants.
Critical operational data is scattered across different
systems, some legacy, some other.
Without a unified view, it is incredibly challenging to spot
clear warnings or early warning signs or identify root causes.
Companies end up spending significant manpower and
budget on managing breakdowns.
Maintenance teams are a stretched thin, and morale can suffer when they.
And when every day feels like another day crisis.
Traditional metrics around uptime or availability may look good on paper,
but in reality, they don't reflect the continuous disruptions and the subtle
inefficiencies that add up over time.
Clearly, something needs to change.
We want to build a more resilient vessel instead of trying to
patch leaks in a sinking set.
That's where site reliability engineering concepts coming.
Let's see how the safe, how this safe transforms firefighting
into a proactive prevention.
Now let's talk.
Let's spend some time talking about SRE Paradigm Safety Site
Reliability Engineering originated in manufacturing site reliability.
Engineering originated in the software and tech world at companies like Google.
Where high availability and efficiency are paramount.
The idea is to treat operations not as a afterthought, but as a strategic
function that is engineered, measured, and treated upon for constant improvement.
As you can see on the slide, there are two sections.
One is for the traditional manufacturing approach, and on the right side we
have SRE manufacturing approach.
So let's focus on the right side, which is on this SRE approach.
But the key components are proactive operational resilience, which means
instead of waiting for machines or, or process to fail, SRE encourages
building systems that anticipate issues.
It's about identifying single point of failure and mitigating
them before they cause downtime.
Next three component of it is cross-functional collaboration.
SID principles break down silos by bringing our presence, engineering
and quality teams together.
In software, we call it as DevOps or Xerox, something like that.
And in manufacturing it means bridging OT and it.
So that data, insights and responsibilities are shared seamlessly.
Next key component is automated observability, which means whether it
is a cloud-based analytics platform, or s computing for real-time monitoring.
SRE relies heavily on automation to provide continuous visibility into
every part of the production line.
Next main component is end-to-end ecosystem coverage, which means
manufacturing reliability isn't just about a single machine.
But entire production ecosystem that is from robotics on the soft floor to ERP
system in the cloud, which is stores transactions first, but not the list is
measurable reliability, which means SRE uses service level objectives that we
call as SLOs and service level indicators, which cause SLIs to quantify reliability.
We set clear nuMe numeric targets such as 99.9% availability of a particular
assembly line or machine or production system, or a certain threshold for
qual quality yield and continuously measure performance against those goals.
Adopting these principles has consist distinctly reduced unplanned downtime
and improved mean time to recovery.
That is MTTR.
Which is, which means meantime to recovery.
This shift is culture as well as technical.
It is about redefining reliability as something we engineer and optimize
rather than something we hope to maintain through good luck and peak repairs.
Next, let's expand this idea beyond a single piece of equipment to encompass
an entire manufacturing environment.
Next, let's talk about holistic manufacturing SRE model and what it
is to truly reap the benefits of SRE.
We need to look at the entire manufacturing ecosystem holistically.
This includes, as you can see on the slide, the four components.
Number one is shaft floor system.
This is about programmable project controllers.
That is PLCs human machine interface.
That is hmis, robotics and machine controllers.
Each of these has unique protocols and timing to s. Next main component
is edge computing layers, which is basically manufacturing excuse
system form that often run in real time near the production line.
Next is core infrastructure, and this includes on-prem services,
database and networks that handle everything from patch scheduling
to Metro Resource Planning or MRP.
Last, but the Northeast is cloud services.
This consists of the area of remote platforms that host data analytics,
digital twins, and enterprise application, integrating multiple plans or geographies.
An SRE approach demands that each player's.
Each layer's reliability goals are aligned and measured.
For instance, if your cloud-based digital twins requires a real time data to
predict machinery failures that data feeds need and agreed upon SLOs for uptime.
If the feed fails, your predictive maintenance cannot function.
Potentially missing critical early warning signs.
This Holistical uh, this holistic model is what makes SRE so powerful by engineering
reliability across the entire stack.
Branded sports are eliminated and consistent data flow is insured.
It also means that reliability is no longer an engineering
problem or an IT problem in state.
It is a shared objective that expands entire facility or global operation.
Organizations that implement a holistic SRE model usually
reports significant gains.
A few such gains can be listed as.
Reduced unscheduled downtime, faster incident protection and response.
More predictable products and schedules and inventory management,
higher confidence from both internal teams and external customers.
Now, of course, to enable this holistic coverage, we
need comprehensive visibility.
Let's explore how advanced monitoring and observability.
Pivotal in making reliability miserable, and accident.
Now let's talk about advanced monitoring and observability.
One of the main cornerstone of SRE is having robust real-time insight into
everything that is happening across the manufacturing or products and environment.
We call this observability because we are interested in collecting data and
correlating it in meaningful ways.
As you can see on the slide, the.
Pyramid shape, which consists of four components, predictive
analytics, contextual correlation, real-time visualization, and
comprehensive data collection.
Let's just start from the top.
That is predictive analytics by gathering time series data, for example, related
to where vibration, temperature, throughput rates, cycle times.
We can use machine learning models or rule-based algorithm to identify
is before they cause failures.
A study show that SRA driven monitoring can reduce unplanned
downtime by as much as 40 to 42%.
Second layer from the top is contextual correlation.
What it means is if a certain robotic arm in one area of the line is slowing down.
Then SRE systems can correlate that event with upstream or downstream
signals like a sensor reading that indicates a jam or particular patch
of raw material that is out of spec.
Next layer is realtime visualization.
What it means is centralized dashboard that give operators, engineers, and
managers a life picture of system health.
This includes metrics such as overall equipment efficiency.
That is OEE, machine states and SLO compliance.
Last but not list at the bottom is comprehensive data collection, which
means telemetries gathered from PLCs, edge devices, server logs, and cloud
applications, and any other system that may be integrated by pulling all
this data together helps transform it.
Transform a one siloed environment into integrated transparent ecosystem.
These monitoring systems catch problems early and empower teams
to do more strategic planning.
For example, if you see a machine trending toward an out of tolerance
condition, you can schedule a preventive maintenance window instead
of waiting for a catastrophic failure.
This proactively stands ultimately leads to a calmer, more efficient operation.
However,
advanced observability is only one facet.
The real gains come from.
When we layer specific SRE patterns on top of this data to drive continuous
improvement, let's discuss some more in emerging manufacturing SRE
patterns that are proving especially effective in bridging OT and it, let's
talk about emerging manufacturing SRE patterns and what this mean.
In the journey to adopt on SRE framework, organizations have developed
specialized patterns that address the unique complexities of manufacturing.
As you can see on the slide, the three such as specialized patterns
are industrial control systems.
That is ICS reliability.
You need this contents about.
Unique constraints, which is, uh, ICS environment, often run
on older proprietary protocols.
They can't always be updated or rebooted as easily as modern IT systems.
Now, this is addressed in SR Aion.
It is addressed by introducing redundancy and a specialized monitoring
tailored to assist protocols.
Companies have seen up to a 45% reduction in downtime caused
by control system failures.
Next is OTIT integration framework.
Basically, it is talking about bridging the two worlds.
Historically, OT and IT teams have been separate in operating in silo and each
with its own security and data practices.
In SRE adaptation.
These are unified observability across both domains, helps teams spot issues that
software, software and hardware layers.
It also strengthens security by creating shared accountability and
more transparent data governance.
Last but not the list here is digital twin deployment model, right?
This allows virtual testing ground.
By applying models in the production environment virtually and experimenting
with the changes before rolling them out on the factory floor.
So the, sorry, adaptability helps in this is by predictive reliability
engineering, which is by stimulating failures or throughput extremes.
Incident response teams are trained in advance and optimize
processes within minimal risk.
These patterns demonstrate how SRE is not just a one size fits all approach.
Each manufacturing environment can adapt patterns that best
align with this machinery data infrastructure, and corporate culture.
Let us see how one automated manufacturer implemented these
principles and achieved tangible.
Next, we'll talk about a case study.
Of an automotive manufacturer, which implemented SRE principles to
illustrate the power of these ideas, let's look at prominent automotive
manufacturer that implemented SRE strategy across pre-assembly plants.
What were the challenges?
They were dealing with frequent unplanned downtime events, particularly at a
critical points along the assembly line.
This led to products and bottlenecks.
MR. Delivery targets and substantial financial losses.
They decided to implement SR implementation, so the organization
developed clear service level objectives, SLOs for the robotic
welding station, paint shops, and final assembly steps, because these
were the three critical areas where the problem was repeatedly happening.
They also.
Introduce advanced observability dashboards that consolidated data from
sensors, controllers, and enterprise systems into a single interface.
What was the result?
Within 90 12 months?
The meantime to recovery, that is MTTR dropped by 62%.
Unplanned downtime was reduced by 45 to 47%.
The company estimated almost $3.2 million in savings annually
from avoiding production losses.
And these all three results were considerable and of utmost
importance for the company.
Beyond the raw numbers, the most significant outcome
was the cultural shift.
Maintenance and engineering teams became more proactive and management became
to see or began to see reliability metrics as a strategic asset instead
of solely measuring efficiency in terms of daily output, they started tracking
how reliability that output was or how reliably that output was achieved.
This example shows that SRE is not just a theory.
It delivers quantifiable benefits, especially when there is strong executive
sponsorship and cross-functional buy-in.
Of course, it is not always smooth sailing.
Let's take a look at some of the challenges that might face when
implementing SRE and how to address them.
In this section, we'll, uh, look into some implementation challenges for the.
While the potential benefits are substantial, organizations
are often encounter barriers during the SRE adoption process.
And the three common challenges that organization face include legacy system
integration, next, reliability, metric definition, and cultural alignment.
Starting at the top.
That is reliable legacy system integration.
Is the reality.
Reality is manufacturers operate machines and control systems that are
20, 30, or 40 or even 50 years old.
These legacy devices may not have a straightforward data
export or telemetry capabilities.
What is solution for this?
So solution is basically a specialized industrial gateways
and protocol converters.
Can extract relevant performance metrics.
SRE teams often partner with control engineers to retrofit older equipment
with the modern sensors, building a bridge between dec old controllers
and a state of the art analytics.
Next is reliability.
Better definition.
What is the reality around this?
Determining what to measure and it is always difficult to determine
what we want to measure and setting realistic and challenging goals or SLOs.
In our in SRE perspective, which requires deep domain knowledge, a mismatch in this
leads to a metrics that don't reflect true production goals and the solution for
this successful SR initiative typically.
From cross-functional committees that include operations, engineering,
maintenance, and quality assurance teams.
They work collaboratively to define SLOs and SLIs, ensuring they align
with both products and targets and real time operational risk.
Last but not the least, and most important is cultural alignment.
The reality is it's difficult shifting from a reactive fix it
when it is broken culture to a proactive preventive before it breaks.
Breaks culture or mindset, which is spark resistance.
Long tenured staff might be surgical of new methodologies.
What is the solution for this?
The solution is change management.
Change management is crucial.
Providing training pilot runs that demonstrate clear winds and
communication, or communicating the values of SRE in language that
resonates with the shop floor.
Successive stories like the automatic manufacturer we talked about
earlier are powerful motivators.
Overcoming these challenges requires persistence, strategic planning.
Leadership support.
Most companies that tackle these systematically begin to
see measurable improvements in reliability within six to 12 months.
The question then is how best to structure an SRE rollout?
Let's explore a strategy roadmap that can guide implementing
organization a step by step.
In this section, we'll talk about SRE implementation roadmap.
A common thread among successful manufacturing SRE rollouts
are a structured approach that grows progressively.
On this slide, we'll talk about four different phase roadmap that can
be tailored while implementing sre.
The phases include assessment, pilot, a scale, and optimizing.
Let's just start with with from the top that it assess during the
assessment organization need to conduct a complete system inventory,
identify key reliability, pain points, and measure baseline metrics.
This sets a clear before picture to quantify improvements.
Outcome of this, uh, phase or step is prioritize a list of system and processes
or gently needing SRE attention.
Next phase is the pilot and action item for this phase is basically select one or
two critical system that suffer frequent downtime or are crucial to throughput.
Define a specific SLOs, say a 99% or 99.9% uptime goal, and implement
initial monitoring, alerting, and incident response protocols.
Outcome of this pilot phase is expected to be quick wins that demonstrate value.
This stage fosters organizational buy-in and provides a model for broader adoption.
Next phase in the process is a scale, right?
The actions action item for this phase is basically expanding SRE practices to
additional rocks and lines or processes.
Integrating automation tools and refining workflows.
Cross-functional teams become the norm, bridging it and ot.
Outcome of this phase is basically a larger slice of operation
gains, continuous observability, consistent SLO reporting, and more
proactive maintenance strategies.
The final phase of this implementation is basically optimized, right?
That's an item of this phase is basically leveraging advanced analytics,
machine learning and digital twins for.
For predictive insights, trust and operational staff to SRE
knowledge be becomes, uh, embedded a across the organization.
Outcome of this phase is basically a mature SRE culture with
continuous improvement cycles.
Incident response gets faster and reliability moves closer
to engineered perfection across the manufacturing footprint.
Following this roadmap ensures a methodical rollout and helps sustain
momentum needed for cultural change.
And of course, such systematic efforts lead to tangible business results.
Let us discuss some of the quantifiable metrics companies are seeing when they
fully embrace SRE in manufacturing.
Next we will talk about quantifiable business impact.
By this point you may be wondering.
Next, let's talk about quantifiable business impact.
By this point, you may be wondering what kind of hard numbers you can expect
if you successfully implement SRE.
As you can see on the slide, based on collected data and case studies,
there are three components of it.
Number one is 68% incident reduction.
On average, companies observe a 68% drop in production impacting incidents.
Once the SRE framework is implemented and are in place.
Uh, observe 43% MTTR improvement that is, uh, meantime to recover,
meantime to recovery, or how quickly you can fix issues when they arise,
improve significantly due to better observability and more robust process.
And the results have shown the improvement is almost 43%.
The next is about annual savings.
That's a big one, right?
Always everyone asks our question is how much annual saving is going
to be so organized and frequently report multimillion dollar savings
from preventing downtime alone?
This doesn't include secondary benefits, like improved brand reputation or
reduced inventory carrying cost on an average company, save almost $4.2 million
from from the past data that we have.
These gains go beyond plan items on a spread set.
Reliable production schedule allow for leaner inventories,
more accurate demand forecasting, and higher employee satisfaction.
Once consumed by emergency repairs, maintenance team can pivot to higher
level tasks such as process optimization.
Customers also notice the improvement.
You build trust and strengthen your competitive advantage.
When you consistently deliver on time.
As r related reliability becomes a true differentiator In markets
where zero time, time, downtime is, uh, increasingly an expectation.
Now that we have covered the business case, let's talk about the next
practical steps you can take to begin or accelerate your SRE journey.
Next, let's talk about the next steps, what your SRE journey looks like, right?
If all this resonates with you, you might wonder where do we start?
Here are four immediate steps to help you kick off your transformation.
Number one assessment workshop.
Before diving into technical deployments, you need to gather stakeholders,
production managers, it, ot, and maintenance leads to identify core
bottlenecks and reliability gaps, and what is the benefit of it?
You will have a concrete cross-disciplinary understanding
of your current state and where SRE principles could deliver the most impact.
Next on the list is pilot program design and question will be why it is needed.
We need to pick up some critical high visibility area to demonstrate
success rather than overall every overhaul, everything at once.
And what did the benefit of it?
It gives quick wins, which build momentum.
Justify further investment and give teams hands-on experience with SRE practices.
Next step is team enablement.
SRE is a collaborative, discipline, operational, or operation for personal
need, to need the skills and mindset to leverage new tools and data effectively.
The benefit of it is basically continuous learning and professional
development, which help with the immediate pilot and ensure that
future expansions will be smooth.
Last, but not the list is implementers and roadmap.
This requires real change.
Plan.
Right, so basically about defining the milestones, budgets,
roles, and performance metrics that is going to be measured.
And the benchmark.
The benefit of it is basically visibility and accountability.
Everyone knows where the effort is headed and how success will be measured overall.
Remember that SRE in manufacturing is a marathon, not a sprint.
All the challenges are natural, but each incremental improvement in reliability
pays off exponentially over the long term.
Wrap up our session by summarizing the key takeaways and looking at the
future for manufacturing reliability.
This brings us to the conclusion of this talk, and as we conclude our 30 minute
journey into SRE for manufacturing.
Let's recap the main points We talked about manufacturing, reliability
crisis, which is basically highlighting about unplanned downtime costs,
arguments, and traditional reactive methods are no longer sufficient
in today's competitive landscape.
Next, we talked about paradigm shift.
Which is basically site reliability engineering, introducing proactive
engineered reliability with cross-functional collaboration
and metric-driven accountability.
We talked about holistic taking, holistic approach, which is about
effective SRE spans software, uh, software systems, s devices on-prem
infrastructure and cloud platforms.
Anything and everything that an organization is impacted by and all
sharing data and insights in real time.
Next, we talked about advanced observability, which is about monitoring
and predictive monitoring and predictive analytics, which enables team stress
part issues early, reduce incident rates, and accelerate recovery times.
Next, we talked about emerging SRE patterns.
Which contains ICS reliability T and IT integration and digital twin uses,
which are basically among the specialized patterns driving significant gain.
We also covered about, talked about a case study of an automotive
manufacturer, which is basically a real world example to show dramatic
reduction in downtime and meaningful ROI sometimes in the millions of dollars.
We talked about overcoming challenges, which is basically about legacy
integration, metric alignment, and cultural barriers that can be
addressed through dedicated change management and faced adoption.
We covered roadmap and business impact, which included about, uh, a structured
rollout and precise measurement of SLOs.
Which deliver quantifiable benefits, often 10 of percentage
points in downtime reduction and millions of dollars in saving.
And finally, we talked about the next steps, which included basically key
phases, four phases, which is assessment, pilot programs, team training, and an
overarching implementation roadmap, which are key to a sustainable transformation.
Ultimately embracing SRN manufacturing is just about keeping machines running.
It is about fostering a culture of continuous improvement where reliability
is woven into every aspect of operation.
This culture and technological alignment can yield incredible
competitive, uh, advantages.
By with greater output, consistency, lower operational cost, and
the workforce empowered rather than broadened by technology.
Finally, thank you for joining me in this, uh, exploration of SRE based reliability.
I hope you live here today with few insights and, uh, the inspiration to
Champion Reliability as a strategic driver in your organization.
Thank you again and I look forward to any questions or suggestions you may have.
Thank you.