Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi.
Yeah, thanks for joining the conference 42.
My name is Il Kumar Mati, working as a product owner at Salesforce.
I did my masters from National Institute of Technology at Warangal.
Today I'm going to present a topic on AI driven incident resilience.
Scaling subscription systems in the $400 billion economy without losing uptime.
And as for today's endeavor, we are discussing ERA four topics,
subscription economy, landscape, AI driven incident detection, architectural
foundations, revenue protection strategies, implementation roadmap.
So when it comes to subscription economy landscape, that may go to the,
it's a $400 billion opportunity, which has a market size of 400 billion.
The current valuation of the global subscription economy, it's growing
at a faster rate of 18.2% per Ann and component annual growth rate since 2018.
The main focus is on the digital services.
The percentage of growth is driven by these digital services.
The shift represents a fundamental change in how business delivers
value and greater revenue.
The scale of the enterprise reliability is, harm is now directly
tied to revenue retention, so revenue retention is important.
When considering, along with the incident management systems, though
the cost of downtime in subscription businesses beyond immediate revenue
loss, apart from the revenue loss, there are other impacts that focus on the
importance of doing the AI driven system.
The average enterprise size company lost 300 K per hour of downtime.
With this downtime, it impact of is more and customer acquisition cost wasted
when new users and counter error and 15 to 35% higher shun rates following
significant service disruptions.
Brand reputation damage requiring six to 12 months to fully recover.
So to recover a brand name, the brand, the marketing, and all, it
takes around a year to fully recover.
So to avoid this, the reliability paradox, so we can say what
factors affecting the reliability.
So growth demands innovation.
To get the growth we need A innovation is required to excel.
So subscription business must continuously apply new features to remind competi to
and meet evolving customer expectations.
Without the new features, you can't grow faster and compete with other competitors.
So innovation is a must to grow.
And innovation introduces risk.
When you're innovating something.
The risk will also add with the new innovations.
So each deployment integration and feature adoption, it's potential failure points
in increasing complex systems is based on the complex system, the new features.
The new integrations will increase the failure points and comes to reliability,
require stability to get it more reliable.
The stability is a key.
Maintaining near feature uptime, perfect uptime traditionally meant
slowing release cycles and limited changes, so we try to reliability to
increase the reliability and stability.
We need a perfect up time and slowing release cycles and limited changes.
So there is a tension between the innovation and stability Created
impossible choice for engineering leaders.
Now, AI driving instant resilience emerged as a solution that enables both
rapid evolution and robust reliability.
So this AI driven solution will help.
The robust.
So it gives the robust reliability.
That's what we are focusing on with the AI reliability.
And comes to the second topic, the AI drive, incident detection, the new parum.
So we going to, to the AI as a predictive morals, so from reactor to predictive.
As of now, the traditional instant management relies on monitoring predefined
threshold and metrics like log time, call time durations, so detecting problems
only after they are impacted users.
We are getting the metrics or KPIs after the users are getting impacted
that AI transforms this approach by learning normal behavior patterns.
Yeah, we can we can understand the behavior of millions of system
transactions by analyzing the behavior of the patterns, detecting septal
anomalies that human operators would miss.
Yeah.
Can find those septal anomalies where human can't find it, and
predicting potential failures.
Ai, you'll predict the failures before it is cascaded into major incidents,
continuously improving through feedback loops and outcome analysis, feed
feedback loops like the ai trained systems with the feedback responses,
and it'll give a better outcome.
Let me go to the next topic.
Architectural foundations of resilient subscription platforms.
So it main focuses on distributed arc processing.
This architecture lines the foundation for distributor processing.
So how region is scalable, microservices with intelligent load balancing
and regional failure capabilities.
Intelligent load balancing through the traffic to the different microservices
in the highly scalable environments.
And data streaming real time event processing with buffering and
replay capabilities to prevent data laws during incidents or outages.
So data streaming with even crossing and a sufficient buffer.
We won't lose the downtime and we will get the required data even
during the loss of during incidents and multi-region resilience.
So this one Act two diplomat models with automatic traffic shifting
during regional incidents, like one region has an incident, it'll route
to the traffic to another region.
So it'll provide better region resilience and stateful recovery.
Stateful recovery is transaction and session management with automated
recovery from partial failures.
So the stateful recovery is the failures.
Some of the failures which can be recovered without any manual
intervention or human intelligence.
CI can recover partially.
So these architectural components, these four components work together to ensure
that even during incident response, the customer experience reminds minimally
impacted with near continuous of time.
Continuous of time is minimizing the downtime.
Yeah, and intelligent storage and retrieval systems.
These storage systems will minimize the latency even during incidents.
Latency is like the time the response is taken.
It'll minimize the latency.
The intelligent storage systems and retrieval systems will minimize
the latency during incident.
So the one storage and retro systems are critical to maintain
performance during incidents.
So these systems are critical to maintain the performance when any incident happen.
So predict to caching.
So it'll predict how much what to be cached that pre positions
frequently access the data.
So the frequently access data will be cached.
Based on usage patterns and intelligent data, tiring that balances
performance needs with storage cost.
The data tiring needs, which level of data we to store is balancing the
performance when it getting accessed, and which will reduce the storage cost.
The foster response requires more storage costs.
Like RAM hardware, so the intelligent data tiring is like that, and red
replicas that scale horizontally to handle traffic spikes.
And so the red replicas is made commonly used with data by multiple
millions of users that can scale HORR to handle traffic spikes, which
distributed to multiple systems.
The graceful degradation that prioritizes core functionality
during resource constraints.
So this one is the priority goes to the core functionality and that
during resource constraints it'll benefit in improving the latency.
And this is the metrics, the leading platform, 2 99 0.99% availability
for read operations, even during significant backend incidents, ensuring
users can still access their content.
Yeah,
so the beyond detection.
Not only detect the problems incident management, but building
revenue protection, it'll help in building the revenue protection.
So AI stands beyond incident management to actively protect
revenue streams without any outages.
Customer satisfaction, churn production.
So let me go to the next topic on this.
Adapt to churn models.
Retention during incidents.
So add one, subscription businesses are implementing aid Raven Retention
Systems that specifically target risk customers during and after incident
Windows Risk customers are like, which are sent due to the customer experience.
And these are.
VL evaluated using impact scoring.
It'll real time assessment of how instant effects specific user segment
based on their interaction patterns.
Proactive engagement, so how it is automatically personalized routers to
effect users with appropriate contest and compensation recovery monitoring.
Tracking post incident usage patterns to identify users showing
disengagement signals so that these three help in retention.
The customers during incidents or like scoring how at what segments
of data users are getting impacted and how we engage the users.
How we can monitor them back after incident and see how the user behavior,
so the engagement signals helps in retention during incidents, how we
can retain returning the customer using these models that help.
To improve the user experience or user customer satisfaction after incidents,
AI enabled personalization frameworks.
So AI is using these frameworks to better improve the incident management and
revenue model, user behavior analysis.
So in this it's continuously monitor of interaction patterns, content
preferences, and uses usage frequency, how frequently they use the same data.
What is the preference of using the data?
So this user behavior analysis is the key part in the framework
and dynamic segmentation.
So this is like on a real time basis.
Classification based on behavior.
So this is dynamically classifying the behavior of the user, not just
static demographic attributes like a region how the user behavior is.
It's not like static behavior.
It's dynamically based on the how user is behaving it and personalized experience.
This is like Tyler interfaces.
Content recommendations.
AI can identify what is good recommendation based on the user
behavior and feature highlights.
This is the key in the framework and engagement measurement.
So we generally take the metrics, comprehensive metrics, capturing
depth and quality of interaction.
How suppose they can, how long the video is and how much time they watch it.
This's the easy example of engagement measurement.
These systems maintain engagement even during incident recovery periods
by routing users to unaffected features and contents, minimizing
the would impact on their experience.
So during instant recovery, even the EA enabled system will track and provide the
unaffected features and content companies implementing advanced personalization,
C 22 to 28, higher customer lifetime value, and significantly improved
resilience to service receptions.
So the companies who are implementing this model Frameworks has better
advanced personalization, C 22, 20 8% higher lifetime customer lifetime.
So those who are taking a one year subscription, they're continuing
to one year subscription, not terminating at eight months.
This frameworks help to retain their customer lifetime and
significantly improved their service during, even during descriptions.
And dynamic pricing.
All dams help in improving the revenue model during the incident.
So this is mainly focusing on balancing value and revenue.
So the pricing system say, help in subscription business, maximize
both acquisitions and retention.
So it considers key parts, value-based pricing that aligns
cost with customer benefit.
So it is derived by how satisfied the customer, how the content is
useful for them, how value based pricing, what value the company
is providing that much pricing.
And elasticity model that produce conversion rates at various price points.
What is the best conversion rate that can be available?
That can predict the, a model can predict and complete to positioning that
dynamically adjust to market changes.
The market changes like the customer behavior is tuning
to different, experience.
So this AI helps metrics, helps these algorithms, helps in trying to position
the company into dynamically adjustable market changes, risk adjustment
offerings that provides appropriate discounts to add risk segments.
Some segments are risk, so it'll provide a better offering and it'll recommend the.
Go right, or these algorithms help in optimization need those risks.
These systems automatically adjust during instant recovery periods.
So if an incident happens, so how we it, the AI will balance
out the value and the revenue.
These four models will help in balancing out the value and revenue during.
Incident recovery periods, the downtime, future, and
availabilities system failures.
So providing target and incentives to users who experience service disruption.
So it helped to find the right segment of users and provide the rise in with broadly
discounting for unaffected segment.
And what is the journey to optimize this?
How, what is aware user journey optimization?
What is the user journey during this incident management process?
So say the before the incident.
Pre-incident standard personalized journeys based on engagement patterns.
And conversion optimization.
This is before an incident happened.
We have these engagement patterns, how the customers are engaged with the
features or patterns we have, and active incident dynamically during rerouting
to functional features, transparent communication and expectation management.
So during an incident, this AI model will help rerouting to functional
features, how which feature is more feasible and how we can transparently
communicate those features to the users and meet the user expectations.
Initial recovery.
It's like enabling of features with prioritized access
for high value customers.
So when you recover it, who is the right set of users has to access that?
Then we can reprioritize to a low level of low level, low value
customers and stabilization.
Targeted re-engagement campaigns and satisfaction monitoring
with compensatory offerings.
So to stabilize this to understand the user behavior, understand the engagement,
how user is engaging with, and by monitoring with compensatory offerings,
it'll offer to provide a new offers, new feature, new free features like that.
And post recovery growth.
Post recovery growth how the growth of the company or growth of the product.
During our, after this is mainly focusing on after incident accelerated feature
adoption and expansion opportunities with renewable trust emphasis.
Try to build the trust, try to provide the right service, try
to improve the good offerings.
This AI leading subscription business don't just manage incidents.
They transform the entire user experience around service descriptions
to maintain engagement and protect customer lifetime value.
Yeah.
And to implement this, what roadmap we have in place psych
the main four categories we have.
Foundational data collection.
It'll focus on comprehensive telemetry across all the system
components with unified logging and standardized event formats.
The timeline it takes around one to two months, and the
next one is baseline modeling.
So this in during this time, it'll establish normal operation patterns and
identify key performance indicators, what indicators are required for
specific business and correlate with user satisfaction to understand
the user behavior, which patterns, which indicators are required.
It'll model during that baseline modeling.
This will take around.
Two to three months.
Yeah.
And initial detection models, so this is like a deploy first
degeneration, anomaly detection with human verification feedbacks.
And this loops to minimize false post two is AI system, which will be trained with
the feedbacks that's coming from human.
Verification.
So initially it starts with the AA first generation system and
later on getting trained with human verification feedbacks and it getting,
trying and better for future model.
It takes around three to four months and response automation gradually
implement automated responses.
For well understood incident patterns with clear remediation parts.
So after the initial detection models, we'll go to response automation.
It was system is fully trained and it was gradually implemented
the automated responses as we captured the human responses in
the third in the previous months.
And now we are going to automated responses.
The incident patterns with clear remediation parts will analyze it
and continuing the advanced capabilities with implementation
roadmap, there are two, which takes one with the six to 12 months.
This mainly focus on user impact correlation.
Connect systems metrics directly to user experience indicators.
So we have, it'll follow some of the indicators, metrics, it'll co correlate
with the user experience and predictive modeling or predicting shift from reactive
to anticipatory incident management.
So this is a great, with predictor modeling, it'll shift how the incident
management system will re from reactor to predictor and retention
integration, link incident data with customer engagement systems.
This will help in retention of the customers.
And the second model we have here to and beyond.
So in this phase it takes autonomous recovery, self-healing
systems with minimal human interventions, autonomous recovery.
It'll gain the knowledge by itself, which is called self-healing
systems, and trained by themselves without human intervention.
And the next topic it is cross platform intelligence, which is shared
learning across multiple product lines.
So multiple A models or multiple systems has common data.
It'll share the data to get and trying the A model, continuous
architecture revolution system that adopt based on incident patterns.
This architecture will continuously improve and improve adaptability
based on incident patterns.
So incident patterns the first quarter, what type of incidents
and what geography or from which discovery, which product line.
So it's not like incident patterns.
This is continuously evolve and find the new patterns and continuously
architecture will evolve.
Yeah.
Is the roadmap.
Yeah.
And the guidelines for resource investment.
So the, we classify as team structure, which is hybrid squared, combining SRE
data scientists and product engineers.
The ranges, it range from the project size to five to seven members.
And executed with execute sponsorship.
And when it comes to technology stack, the real time data processing framework is
uses machine learning, operations platform and automated orchestration tools.
It is visualization capabilities, so we need a visualization to see the
performance, the matrix, the output, along with the machine learning
frameworks and real time data crossing.
With mathematical models which we covers under technology stack and investment
returns, we can take it around.
On enterprise implementations, show around three to five times the return
on investments within 18 months through reduce downtime and improve
retention and decrease recovery costs.
Sees the what this AI driven system is capable of giving three to five
x or y. The KEYT chem is with this model it a driven, it transforms
AI transforms incident management from reactive to predictive.
So the A model is staying from reactor to predictor.
We are expecting the same early detection and automated response
dramatically reduced customer impact.
We are predicting and detecting early and providing automated
response, which reduce customer impact and architectural resilience
in your revenue production strategy.
So how the architecture helps the intelligent systems maintain functionality
because the intelligent system know which functionality is more critical,
even during partial failure, it's keep on running with critical functionality
and the user behavior preference drive the, which functionality is critical.
And user experience continuity require cross-functional integration.
This is like connecting technical monitoring with
customer engagement systems.
And so it's cross-functional integration.
The technical monitoring with the customer engagement systems.
Hey, customer can do some part of our operations by themselves and
they can monitor by themselves.
These features in the cross-functional integration by cross-functionally
integrate the technical monitoring with the customer engagement systems
like CRM and back office systems.
The user experience APRO and implementation follows
a clear maturity model.
So start with foundations and progress build advanced capabilities.
Like auto response start with initial detection model and then growing up to
advanced capabilities of auto response.
These helps in implementing with more clear maturity model.
Thank you for joining this conference.