Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
My name is Maheh Lati.
Today we are going through a concept called Cloud powered
Retail Resilience and how AWS Drive Customer centric Incident response.
Let's jumping into the topic.
Okay.
Let's discuss about how the our retail market works.
Now.
In today's digital first retail landscape, customer expectations
have fundamentally shifted.
Shoppers demand seamless experience across all touch points from mobile
apps to in-store interactions.
Any disruption whether a website crash or inventory system failures during
peak shopping hours directly impact customer satisfaction and revenue.
Modern retailers operate in an always on economy where downtime is not just
inconvenient, it's a business critical.
The challenge extends when si simply keeping systems running into ensuring
optimal performance during traffic spike seasonal suggests and unexpected events.
So the main thing is that retail learn how to maintain the product or website.
Our backend infrastructure available 24 by seven and no matter what the
season is like during, for example, in Thanksgiving, we'll have.
High sales where a lot of our clients, our customers will log in and try to
buy our products during that time.
There should be no impact or no downtime for the customers, which will
be a bad reputation for the company.
The main things which we need to consider are customer experience,
revenue impact, brand reputation first, the customer experience.
Every second of downtime erodes customer trust and drive shoppers to competitors.
Modern consumers expect instant gratification and seamless
interaction across all channels.
We always think like a customer, even though we are retailer, we have to
think like a customer and how customer will react if there is any issue, any
impact or any downtime to our website when we are trying to buy a product.
Revenue impact system failures directly translates to lost sale opportunity.
Peak shopping periods amplify these losses, making resilient infrastructure,
a revenue protection strategy.
Another thing is brand reputation.
Service disruptions become social media events.
Poor incident response can damage brand reputation built over years
affecting long-term customer loyalty.
You maybe see noticed recently that lot of customers, if you, if they
face any issue, they're directly writing into the social media and
which will impact the revenue and also brand of the particular organization.
So we need to be very careful.
So I'm going to tell you why I chose a cloud when it
compared with the traditional.
Here are the details I consider multiple aspects.
Before choosing cloud infrastructure, these are the multiple reasons.
First one is infrastructure.
In traditional IT reli on physical data center and static server environment.
Scaling requires manual hardware.
Ping makes it slow and costly.
Whereas in cloud, it's built on virtualized.
Cloud-based infrastructure resources are pro dynamically enabling
rapid scaling across regions.
Suppose there is any impact to any hardware in traditional, and it'll,
it has to monitor and it has to send email or it alert some person and the
person have to be available and how to fix the issue manually, which may
take hours and which may cost revenue.
Whereas in cloud, everything is automated.
Even there is impact to one hardware, it'll automatically spin a new hardware.
And ma make sure the website is available for the customer.
Scalability, which is very impacted factor for any business in traditional
method, li it Scalability is very limited, requires capacity planning
and or project to handle peak loads.
Whereas in cloud, it's very highly scalable and auto-scaling.
Elastic load bands automatically adjust as per demand in real time.
What this mean is, suppose we are in a traditional and on-prem infrastructure
where generally there will be high traffic flow during seasons, holiday
seasons, whereas non-holiday season, the traffic will be low or normal.
So in order to maintain it in traditional and no impact for the
customer, we always have to go with the high performance infrastructure.
Whereas in cloud, we don't need to do it.
The automation will take care of it in such a way that where the business or
traffic is normal, it'll go with the regular resources and by in, in holiday
season, by default it auto scale and scale up all the resources to adjust the new
and high traffic recovery time objective.
It is long recovery, traditional, whether it is a long recovery types
can take hours a day depending on the DR side and manual process and how
we did the setup while as in cloud native rapid recovery using automated
failover, snapshot and cloud native DR tools like a WS elastic disaster.
With my personal experience, it happened once and I am able to
recover from the DR within minutes.
So I prefer that is the reason why I prefer cloud than the traditional way.
In traditional way, disaster require requires a secondary physical site
with a duplicate infrastructure leading to high maintenance and high cost.
Whereas in cloud, we will host our infrastructure in multiple
region where the, everything will be in a replication For every sec,
every second, it'll be replicated.
An automated failure and manual service with significantly lower car
and faster recovery incident response.
Traditionally, the reaction is very slow and often relit depend on the
human intervention and communication is manual and often delayed.
But whereas in cloud is proactive as an automated incident detection
and response using AWS provider service tools like AWS CloudWatch.
It has Lambda and step function functions, monitoring and
observability in traditional, the basic monitor tools, limited visibility
into real time system health.
Hard to trace route cost quickly, even though if you want to monitor, we have
to buy some third party tools and we have to install the monitoring tools
under our infrastructure, which will consume more CPN, Ram and ram, and it
results in a high cost of maintenance.
Whereas in cloud, they will provide a monitoring tool on their own, whereas
full stack Ty using distributed tracing and real-time dashboards and log via
CloudWatch, X-Ray and Open Telemetry.
And also a WSA dashboard where we can check and the health and monitor
the health checks every time.
And also whenever there is issue, it'll send trigger the email or alerts
to the responsible team or person.
And it's a model, it's a capital expenditure.
Heavy requires upfront infor investment in hardware, power, cooling, and space.
And whereas in cloud operational expenditure model pay only for what you
use with the ability to scale apart, down on demand impacting power outage during
failure, high outages may last long, causing frustration and revenue loss.
Manual failover can be slow and error poor.
Whereas in cloud, the failures are minimal and are isolated with auto of healing.
Redundancy and intelligent routing to healthy instances are region.
For example, if there are issue in one region, other traffic will
be rerouted to the other region where we already confident other.
Suppose we have an issue with one server, all the traffic will be
rerouted to the health healthy server.
These are the key differences between.
Traditional and native cloud.
These, there are three three, three concepts that I prefer
AWS than any other cloud.
These are the three.
One is Amazon personalized, Amazon Connect and Amazon Forecast.
Amazon personalize delivers personal experiences, then adapt during instance,
maintaining customer engagement even when primary systems face during disruption.
Through the intelligent recommendation, fallbacks Amazon Connect provides
scalable contact center capabilities that automatically route customer
inquiries during incidents, ensuring support quality remains
consistent during high stress period.
I want to give you one example, which recently happened to me.
Suppose we have a major infrastructure failure where we don't know how to fix it.
For, so we immediately contacted Amazon Customer Services and they immediately
connected to us and we all connected in June call and they help us to
troubleshoot the issue within minutes.
I never expected Amazon to be available within a minutes and help us to fix it.
And one more thing is Amazon Forecast.
This is an amazing service from Asia, Amazon, which predicts demand patterns
and potential system stress point.
Which enables proactive capacity planning and incident prevention
before peak load cause failures.
This is it'll review all the data from the previous year here and will tell
us how much traffic is going to come in on the particular holiday season.
And depending on that proactively, we can either increase it manually or we
can let the automation took care of it.
These are the main automation and intelligence layers from the AWS.
The main service is AWS Lambda, which executes automated response workflows
instantly when incident occur, reducing human response time, and ensuring
consistent remediation process.
What this means is whenever there is system failure or some
hardware failure, it immediately.
Fixes the issue.
If he is not able to fix the issue, it'll shut down that particular
system and spin the new system.
Making sure that a customer set, a customer traffic goes smoothly.
Amazon recognition.
This is a monitoring service that monitors visual content and
customer interactions for anomalies, providing early warnings, signals
for potential experience degradation.
AWS aortic core.
Connection.
Connection to so instore devices and sensor to provide real time operational
intelligence, bridging physical and digital retail environments.
Building proactive resilience, which is very important for any retail business.
Proactive related MO moves beyond reactive incident response to
anticipate and prevent issues before they impact customers.
AWS services work together.
To create a comprehensive early warning system by analyzing patterns
in customer behavior, a system performance, and external factors,
retailers can identify potential stress point and automatically just resources.
This approach transform incident management from damage control into
customers experience optimization.
The key is creating intelligent systems.
That learn from historical data and adapt to changing conditions in real time.
This is, I gave you one example.
So how this works is it'll go back.
Suppose we have Thanksgiving coming and we don't know how much traffic is
going to come, how many customers are going to hit our website at that time.
We have this proactive resilience where it'll go and check the data
from last year, last few years.
And it'll analyze and it'll tell us that so and so many customers are
coming this every holiday season.
And this will be the traffic, depending on the requirement, either we can
manually it, we can automate it to scale up the resources during that
time sheet, during that holiday season.
Seamless.
Omnichannel during disruption.
There are basically four stages during the disruption.
One is detect detection.
AWS monitoring identifies performance degradation across any
channel, web, mobile, or in store.
Basically, there is a service called AWS CloudWatch, where it'll monitor all the.
Infrastructure and also application.
And it's no, if it sees anything related to infra application
issue, it'll immediately alert.
And if it is application, it'll alert the application owner.
And if it's a infra, it'll call lambda, which will be Lambda will trigger some
automated functional functions, which will take care of the issue, for example.
We have a high, we have total four servers in a customer, and one of the
server has high CPU and high memory.
And for that one to fix it Lambda will either add a new server, which
we, which it'll share the traffic and which will reduce CPN memory.
And if we don't, if we cannot set that one, what it can do is
spin up a new server and delete the server, which has issues.
This way it'll make sure customer has a seamless transaction.
Second one is adaptation.
Yeah.
Lambda functions automatically read out traffic and adjust service levels to
maintain customer experience quality.
So whenever there is a issue right, and it the automatically it'll Lambda will always
do the health checks on the servers.
If there is an issue with any server, it'll remove the traffic will be not
related to the problematic server, but only to the server where it
has a good health communication.
Amazon Connect ensures customer support teams have real time visibility into
issues and resolution status, even if it is after the, when it it resolves
on its own, but it'll share the data to the customer and how it fixes the issue
and what happened during that time.
That will help us to analyze and make sure that it'll not happen again in the future.
Recovery automated scaling and failover process restore full functionality while
maintain the transaction integrity, intelligent demand anticipation.
Amazon forecast, transform historical sales data, external
events, and market signals into actionable capacity planning inside.
By understanding demand patterns before they're materialize, retailers can prevent
system overload during peak hour periods.
So this is an amazing service from Amazon that is a forecast which
not only helps us to maintain infrastructure, but also the application.
And it'll help us to review the customer satisfaction.
And depending on the challenges which customer is facing, we can
update our application to make.
Seamless transactions to the customer.
This predictive approach enables automatic resource scaling ahead of
anticipated traffic spikes, whether from marketing campaigns, seasonal
events, or viral social media moments.
This result is maintaining optimal performance exactly
when customers expect it most.
The main key roles, three things in demand of is data integration.
And pattern recognition.
The pattern recognition, identify demand signals and capacity
requirements using machine learning, real world impact metrics.
Therefore, metrics that will determine the customer satisfaction and also how we need
to maintain our retail infrastructure.
The first one is and critical one is reduced downtime.
Cloud native architectures with automated failure capability.
Significantly minimize service interactions, distributed systems,
and redundant infrastructure.
Ensure continuous availability during component failures.
Next one is enhance support quality.
Amazon conducts intelligent routing and real time dashboards.
Empower support teams with context aware customer interaction, reducing resolution
times, and improving satisfaction scores.
Automated incident response workflow eliminates manual intervention delays.
Lambda Power Runbook execute predetermined recovery action within a second sub
detection AWS Security Services protects the data and it also secure services and
compliance frameworks ensure customer data remains protected throughout
incident response and recovery process.
All these four are real world metrics that will help us business wise and also
to help to keep our customer base and other thing, the security is very critical
part where Amazon will take care of it.
We don't need to depend on any additional resources, are any additional tools
to make sure our data is secure,
turning disruptions into opportunities.
Learning from event incidents.
Every disruption provides valuable data for enhancing system resilience.
AWS services meticulously capture detailed telemetry during incident,
enabling in-depth post event analysis to significantly strength and future response
capabilities through continuous learning.
Machine learning models refine their understanding of normal
and abnormal system behavior.
Constantly improving detection, accuracy, and minimize false positives over time.
I will go with one example recently, what happened to our system.
We recently install installed a new antivirus, which is consuming
lot of CPU on our servers.
And because of this monitoring it it sent an e alert stating that they're seeing
abnormal CPU spikes because of the.
New antivirus.
We immediately determined the issue and we immediately rolled, uninstall
that antivirus and came up with a new virus, which is CPU, which congen
less CPU than the previous one.
In this way, we are able to save ourself and save the systems
implementation strategy.
Their total main four strategy for implementing cloud-based retail.
First one is assessment phase.
Evaluate current incident response capabilities and identify
critical customer touchpoint that require enhanced resilience.
Second one is foundation building.
Establish core AWS services for monitoring automation and customer communication,
which will and implement the basic observability across all the key systems.
Intel, inter intelligence layer.
Integrate predictive services like forecast and Personalize to enable
proactive response and maintain customer experience during disruptions.
Continuous optimization, use incident data and customer feedback to refine
automated responses and improve overall system resilience over time.
These are the, there are some best practices for any retail teams.
Below four we'll give you the details.
Let's go through one by one reason for failure.
Don't think that your system is robust and it can handle any traffic
or any kind of customer's flow.
Always prepare for the worst, so build your systems.
Assuming components will fail, implement graceful degradation that maintains
core customer functionality, even when non-essential services experience issues.
Automate everything.
Reduce human error and response time by automating incident detection,
escalation, and initial response actions.
Reserve human intervention for complex decision making focus.
Monitor the customer impact focus monitoring on customer face ethics, not
just technical performance indicators.
Understanding business impact guides, prioritization during incidents.
Practice regularly.
Conduct regular cos engineering exercises and incident responders to
validate automated system and train team members on emergency producer.
It's none other than practicing disaster require.
Think that there is some issue happened on one region,
immediately switched to second.
Second region to test if the second region has all the data and working as expected.
Similar to the ion, you are part two.
Resilience.
Cloud power resilience isn't just about preventing downtime.
It's about creating competitive advantage through superior customer experience.
During challenging moments, AWS provides the tools and services needed
to transform incident response from reactive firefighting into proactive
customer experience management.
Start your journey by assessing current capabilities, identifying
critical customer touch points.
And implementing foundational monitoring and automation, build intelligence into
your system gradually learning from each incident to strengthen future resilience.
Ready to build retail resilience that drives customer satisfaction
and business continuity.
The cloud native approach to incident responses waiting for your implementation.
Thank you.
Thank you all, and if you have any questions, please clear to.
Please feel free to reach out to me anytime.
Thank you.