Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey.
Hi everybody.
Welcome to conference 42, machine Learning 2025.
My name is V Kushner and I work as a senior software engineer at
Walmart, where a primary work.
In infrastructure engineering today, I am honored to speak on a transformative shift
that's occurring in cloud ecosystems.
That is the journey from reactive to predictive with AI driven API
resiliency in cloud ecosystems.
Imagine a world where your APIs don't just react.
To failures, but predict and prevent them.
Thus, the power of AI driven resilience.
APIs are critical business assets.
Any downtime can lead to lost revenue, diminished customer
trust, and operational chaos.
Okay.
Moving from reactive to proactive stance is no longer just beneficial.
It's essential,
the API management landscape in today's microservices architecture driven world.
The following statistics highlight the immense scale at which these
architectures are being adopted.
By 2032, the API Management market is projected to reach a
staggering 27.3 billion US dollars, highlighting how crucial APIs
have become to business success.
Companies that integrate AI into their API Management strategies
experience substantial gains, which includes 34.7% fewer errors.
28.5 percentage faster response times, and that's clearly AI is becoming
an indispensable partner in ensuring API performance and reliability.
Moving on.
Next, we have three main pillars of API, resilience.
To transition effectively from a reactive to a predictive approach.
The three pillars of API resilience are critical, intelligent optimization,
which is nothing but dynamically managing performance using ai, where
advanced AI algorithms that dynamically tune performance, balance the
traffic loads and allocate resources based on real time demand patterns.
The preemptive security.
Security plays an important role in microservices, proactively detecting
threats before the A escalate rather than reacting after an incident.
This involves advanced threat intelligence that uncovers potential
vulnerabilities early and integrating behavior analysis with context
error authentication mechanisms.
And autonomous healing autonom automatically predicting, detecting
and fixing before they even escalate.
So the intelligence systems that continuously monitor predict potential
failure points and automatically implement corrective measures
before service disruption occurs.
And now we have intelligent optimization in action.
How does it happen?
Imagine an API gateway that can dynamically scale to process
3,500 requests per second.
That is 3,500 requests.
35 TPS achieving a 67 percentage increase in throughput with
the no drop in performance.
And that's the key point, and that's such a huge reduction in
processing, time of request, and handling large loads of requests.
User experience is the most essential part of a successful journey for any product.
We need to have AI deep learning algorithms balance load intelligently
achieving 25% lower latency during high traffic scenarios, ensuring
consistent, usual per experience,
the reinforcement learning based caching.
Predicts frequently assessed content, reducing the database load by an
impressive 35% that can be devoted towards CDNs or any cash enabled in the workflow.
And let's move ahead and check the What's preemptive security framework is?
Traditional security models wait for an attack to happen so that
there needs to be prevented.
With AI driven preventive security, the narrative changes drastically.
Continuous monitoring, real time monitoring analysis.
1 million security events per second.
So guys, just imagine 1 million security events, and that's a
lot enabling instantaneous threat identification, a powered contextual
authentication cuts, unauthorized access attempts by up to 73% through
behavioral and environmental analysis.
Predictive threat detection improves incident recognition, speed by
58%, dramatically reducing the likelihood of a successful attacks,
and then comes the self feeling architecture.
We just don't anticipate problems in here.
We fix them proactively.
We have baseline models.
Which analyze 50 K events per second.
Learning what normal looks like when we say normal is the ideal system behavior.
So the early warning system detects 30 minutes before the conventional tools.
Autonomous recovery capabilities can deploy solutions in milliseconds,
ensuring minimal D time and seamless user experiences.
And again, I want to emphasize on this.
Point, user experience is very important for any company success, and these
are the downtime reduction results.
Traditional ai enhanced predictive versus autonomous, and this graph.
It clearly proves a point that you know how moving to autonomous helps
us in reducing the down downtime in any system or any applications.
So what is autonomous incident resolution?
Advanced algorithms autonomously detect and resolve incidents that
humans might overlook, and that's always a possibility with machines.
They own overlook, and that's why we train the systems with the historical
data to help understand the system.
AI based incident detection rapidly identifies subtle, anonymous, advanced
algorithms identify subtle, an anatomy patterns invisible to human operators,
instant root cause analysis across distributed systems, instant maps, and.
Analyzes complex system dependencies across distributed environments.
Automated resolutions successfully handle 70% of the critical incidents, slashing
the meantime to resolution by 85%.
And that's the most important criteria or the statistics that the operational
Excellency team is proud of when it's low.
So in AI world it's all about continuous learning.
So what happens in this continuous learning.
There is a point for our systems to have more data points, help
understand the system even better.
So the AI systems continuously learn and improve, creating a robust,
ever evolving defense mechanism.
Historical in incident analysis provides actionable insights.
Systematically mine the past insights to extract actionable intelligence
and optimization patterns.
This always helps in bettering operational efficiency and system reliability.
Predictive models are refined, constantly achieving 12 to 15% in a
quarterly improvement basis whenever the incident response effectiveness.
Okay.
Enhance the predictive models with real world data so that they create
more precise response Frameworks.
Train the system with previous incidents so that the incidents are
fine tuned over the period, which aids to smooth system running.
Continuous knowledge expansion enhances long-term API Resilience.
Making your cloud ecosystem smarter, more resilient over time.
Continuously enha enrich the incident databases and solution repositories to
strengthen future response capabilities.
It's all about the learning and how do we apply on the future instance.
That's all the AI is all about to train.
And how do we implement What is the implementation roadmap?
So adopting to predictive a PR resilience requires a structured approach.
It cannot happen overnight.
So the most important phase is the assessment and the planning.
So we have to evaluate the existing infrastructure and
identify the weaknesses.
Conduct comprehensive API infrastructure analysis.
Identify the critical resilience, vulnerabilities, and capability gaps.
Establish clear quantifiable objectives with stakeholder agreement alignments.
And then comes the pilot deployment as a responsible engineer using microservices.
In today's world, we have to deploy them on a cannery deployment techniques,
like there are various techniques, but this is one of the most used
techniques for any deployment.
So deploy AI solutions to non-critical a PS first.
To refine the modules and then launch AI powered monitoring solutions on
these selected non-critical APIs.
Established performance baselines and detection thresholds, validate
the prediction accuracy against these real world incidents.
And once these are done, we'll move on to the full integration.
So when I say full integration, we gradually expand the AI driven
resilience across all APIs.
Ensuring seamless integration with existing tool happens,
expand the implementation across production environment seamlessly.
Connect with existing monitoring and management systems.
Develop and execute targeted technical team training programs, and always
make sure to have a backup plans ready just in case if anything needs to be
corrected or needs to be put back.
In these unforeseen circumstances, so continuous optimization, enhanced
prediction modules using real operational data and measure the return
on investment in downtime reduction, and also in the improved reliability
on the system progressively expand autonomous response capabilities.
So by now, we should be able to define the next steps in our organization.
And how do we practically begin this journey towards
this predictive API resilience?
Assess your current resilience maturity, understand your baseline,
identify the critical APIs, evaluate your API ecosystems resilience level.
Identify critical services requiring the highest availability.
And always start small scale fast, and that's the mantra.
Always remember that even thousand miles of journey starts with a single step.
Incrementally integrate predictive capabilities, starting with
analytics and expanding step by step.
Begin with monitoring and analytics.
Add predictive capabilities, incrementally measure improvements systematically.
So build cross-functional teams.
So combine expertise from API developers, machine learning specialists
and site reliability engineers to maximize the effectiveness.
So when everybody, the when we get a people from all different parts
of the world, they have different experiences though that help us.
Bring more knowledge to the team.
So foster continuous knowledge sharing to the team to learn and grow as a team.
So that's all.
And to conclude the shift from reactive troubleshooting to
proactive and predictive management is not only transformative, but
essential for sustained success.
Leveraging AI driven API Resilience will future proof your organized
operations significantly enhancing reliability, security, and inefficiency.
Thank you all for your attention and I hope everybody enjoyed the session today.
Thanks a lot.