Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone and thank you for joining me today.
My name is Manoj Kumar Awa, and I'm really excited to talk about
something that's shaping the future of AI platform engineering.
Now when I say that phrase, AI driven platform engineering, it
might sound very technical, but to be honest, it's a simple question.
And the question is how do we make the lives of developer easier
while also making systems more reliable and cost effective?
So in this session, we are going to explore how the artificial intelligence
and machine learning are stepping in to help with the challenge and why
it matters, not just for developers, but for organization as a whole.
So here's what we will be covering today.
We will be beginning with the evolution of platform engineering.
A little story of how we got here and why the old ways just don't scale anymore.
Then we will dive into idea of intelligence infrastructure,
automation what it means, what it looks like, and why it's so powerful.
After that, I will share a couple of real world case studies.
Stories of organizations that put into practice and saw some excellent results.
Then we will talk about how strategies, architectural choices, and some of the
challenges you need to watch out for.
And finally, I'll wrap up with future outlook and some practical next steps.
That you can take if you want to explore this in your own talk, own context.
So let's get started with the story of how platform engineering has evolved.
So when Platform engineering first emerged, the mission was simple.
Give developers a standard set of tools so they could build and ship software more.
Consistently, things like pipelines, development scripts,
and monitoring dashboards.
And in the early days this was huge win.
Developers had guardrails.
Teams did not need to reinvent the wheel in a way every time
they want to deploy something.
But then everything grew.
Infrastructure became massive, and we moved from monoliths to microservices.
We went from managing a few servers to managing thousands of
containers across multiple regions.
When the cloud came in and suddenly scale was unlimited.
We all have experienced this, but so where, we were moving parts.
So this created four major problems.
One is mental overload.
So basically developers now juggle dozens of dashboards.
Alerts and tools.
Every change means extra mental effort.
And the second one is maintenance overhead which is platform engineers
spend more time fixing and patching the system than actually improving it.
And the third is scaling bottlenecks.
Manual work doesn't scale, so if your app traffic grows 10 times, you can't
just hire 10 times more engineers.
And the fourth one is rising cost.
Cloud is flexible, but if unmanaged, the bills grows fast then the business.
You have probably seen this in your own work.
Maybe developer spends an entire sprint just dealing with, broken bills.
Maybe the platform team spends nights, firefighting instead of innovating.
This is reality means, teams, this is the reality that teams live in, actually.
And here's a big question.
How do we escape this cycle?
This is where AI opens a door.
Instead of treating platforms as things that need constant human
babysitting, we can start to see them as a intelligent ecosystem.
Like systems that not only automate, but also learn, anticipate, and also optimize.
Here's a simple analogy in my point of view.
Thinking like driving a car, the old way of platform engineering
is driving a, shift car.
Where, in rush hours you need to change your shifts constantly watching
every mirror and it's very stressful, but yeah, driven platforms are like
moving to a car with self-driving features like adaptive cruise control.
It manages avoid hazards and even suggest the best route.
You still decide the destination, but the system takers of the heavy lifting.
That's the transition we are stepping into.
So what's on the other side of the shift?
What's the promise of AI driven platforms?
There are four big benefits I want to highlight here.
First is the less mental overload developers can hand off
repetitive and routine tasks.
Think about things like selecting which test to run or
ruling back after a failure.
Yeah, you can handle that.
Leaving developers free to focus on actual problem solving skills.
Second, faster delivery.
With automations moving out, bottlenecks and deployments move
quicker and with fewer errors instead of, stressful once a week.
Release, teams can deploy more times a day with more confidence.
And third is a self-managing infrastructure.
Imagine a system that notices a service is slowing down so it adjusts itself
automatically and fixes the problem without anyone needing to log in.
That's no longer a really a science fiction, it's becoming real and
fourth, a better developer experience.
Developers don't want to fight these tools.
They want platforms that are responsive and even helpful.
AI can create these environments that really adapt to the way
individuals and teams work best.
And here's a key point.
This isn't just about efficiency, it's about people.
Happier developers, write better code teams with less frustration, move faster.
And organizations that their developers.
Gain a real competitive edge.
That's the true promise of AI driven platform engineering.
Now that we have talked about promise of AI driven platforms, the natural
question is how does this actually work?
So at a higher level, there are five building blocks.
First one is smaller, CID smarter, C-I-D-C-I-C-D pipelines.
Second one is predictive analysis.
Third is automated remediation, and fourth is dynamic resource optimization
and intelligent observ observability.
So these are the five pillars.
Think of them as the toolkit that transforms transact a traditional
platform into one of the, that can learn, adapt, and that can improve over time.
So let's get deeper into each of these.
So traditional p pipelines are rigid, right?
They treat every build and deployment the same way.
Whether the change is small or huge, the pipelines are on the
same long process that wastes time and also create more problems.
So with AI pipelines becoming smarter, they can analyze codes
and decide which tests actually matter and skip irrelevant ones.
They can also predict deployment risks.
So if something looks wrong after release, they can also roll back automatically
before even customer even notices.
And they can handle dependency checks and scru scans without
any extra manual effort.
Think of it like a chef in a kitchen.
A traditional pipeline is like following the same 10 step recipe
every time, no matter what the dishes.
And a smart pipeline is like a chef who knows which step really
matters for this specific meal.
Saving time reduces mistakes, and also deliver food faster and more reliability.
So organizations that actually adopt this are reporting higher success rate
fewer failed deployments and fast cycles.
That's a big win for both developers and customers.
Now let's talk about predictive analysis and incident resolution.
The traditional way is reactive.
Something breaks, the pager goes off, engineer rushes in, and
you, you need to work in a night.
To fix it.
We all know how stressful it is.
So AI allows us to flip that into a proactive model so it can spot,
usual patterns in system behavior.
It can perform root cause analysis shifting through logs and
metrics much faster than a human.
And it call, it can also suggest fixes or even apply them
automatically if they're safe.
And it's a routine task as the perfect example could be a CP utilization
where a system automatically identifies it and increases the CPU
code memory to that specific process.
And it can filter alerts, reducing the noise, and highlighting
only what actually matters.
So for engineers, this means fewer false alarms and less wasted time.
And for business it means less downtime and faster recovery.
So to correlate the perfect example is again, imagine your car.
The traditional approach is driving until the engine light comes
on, then rushing to a mechanic.
Whereas the predictive analysis is having sensors that warn you a week
earlier, hey, your oil is running low.
Schedule a quick service that that early warning prevents a breakdown.
That's the power of ai bringing to this incident resolution.
Okay.
So one of the biggest hidden challenges in cloud computing is cost.
We often overprovision resources just to be safe, which means, we pay for
capacity that we really don't use.
Machine learning helps us tackle this.
How does it do it does it by analyzing past usage patterns.
Forecasting future demand and automatically adjusting resources
or down to match the actual needs and learning from the outcomes
to improve accuracy over time.
These results are really impressive.
Companies have seen 15 to 30% savings in cloud costs while maintaining
or even improving performance.
So let me give you a simple analogy here.
Imagine if your home electricity automatically
adjusts based on your habits.
Lights dim.
When you leave a room, ac adjusts, et cetera.
Appliances run at off peak cars.
You would say money without even noticing it.
And that's what machine learning powered resource optimization does at a
scale for this kind of infrastructure.
Now let's see this in action with with real examples.
So let's assume there is a gl, a global financial services
company that was struggling.
Their deployments were slow.
Production incidents were frequent and cost for climbing.
So in a business where every second of downtime can affect trust and
revenue, that's a very serious issue.
And here's what they need.
They implemented predictive scaling.
They deployed machine learning, analogy detection, and they built
intelligent deployment pipelines with safe and staged rollouts.
So the results speak for themselves.
So by implementing these things they deployment speed increased by 65%.
Production incidents dropped by 42%, and infrastructure costs went down by 22%.
Even though they were handling more transactions, this shows
that AI isn't just a buzzword.
It can really deliver concrete, measurable improvements in speed, reliability, and
cost, even in high stake industries.
I would like to take another example.
Here there is a fast growing e-commerce platform.
Their main challenge was scaling, so they need to support hundreds of developers,
but they couldn't just keep, hiring more people for the platform team.
Their solution was AI driven self-service platform with AI assisted
configuration so developers could handle.
Set up themselves and smart observability that not only monitored,
but also, auto troubleshoot common issues that they see in a, in, in,
in a very frequent daily basis.
And machine learning powered scaling to handle huge spikes
during shopping seasons.
And the results are the platform team supported three times more developers
without adding any headcount.
That's the real story of ai.
It scales people's impact.
It lets smaller teams achieve big results without, burning out.
So we have talked about all this, next we will be looking into
this implementation strategy.
So how do we start bringing AI into your own platform engineering?
Here are some of the practical strategies.
The first one is start small.
So don't try to automate everything at once.
Pick a high value, low risk area like test optimization or auto scaling.
This way, you know what you're getting into and get used
to it and build the trust.
So next is build a strong data foundation.
AI is only as good as the data it sees, so make sure all your logging
metrics and observability are solid.
Without data, AI can't make any good decisions.
So next is use feedback loops.
In the beginning, let human review AI decisions.
For example, if the system recommends scaling down resources,
let an engineer approve it.
As trust builds, you can always increase the automation.
So increase automation gradually.
Think of it like learning to swim.
At first.
You keep one feet on the ground.
I know as you get more confidence, you get deeper.
Same with ai.
Start with recommendations, then semi-automation, and then
followed by the full automation.
And remember, this isn't just technical, it's cultural.
Teams need to see value, not just change.
Share success stories internally, highlight time saved, and celebrate
the wins, that helps people embrace the real journey here.
So when you design AI driven platforms.
Architecture really matters.
Four principles stand out are, modularity build AI as plugin components.
Don't just rip and replace all your whole stack, that's gonna burn
your arms and your energy more.
So second one is explainability.
Make sure people can see why AI made a decision.
Transparency.
Transparency builds a trust.
Fallback mechanisms always have manual overhead options so human
can step in whenever needed.
Continuous learning.
Let this system improve with every outcome and a piece of feedback.
Think of it like, hiring a new team member at first, obviously you
don't give the entire access, they explain their reasoning, you double
check their work, and over time you trust them with more autonomy.
That's how.
We should approach AI in in platform engineering as well.
Of course, like any, in any other technologies, we do have some of
the challenges and some of the technical challenge challenges
could be like, data quality issues, integrity with older systems, keeping
models up to date, and also meeting security compliance requirements.
And there are also some of the organizational challenges.
Which most of the organizations obviously do notice this, we, that is
a skill gap in AI and machine learning.
And and for the existing employees it is always a resistance to change
and concerns about losing control and difficulty managing really.
ROI, beyond just cost savings.
So do, how do we actually overcome these problems is with a balanced approach.
Invest in upskilling like trainings.
Start with small pilots, communicate clearly about what AI will and
won't do, and always combine automation with the human oversight.
So if you address both the technical and cultural sides,
adoption becomes much smoother.
So I just want to key recap in know a big Tk K key takeaways here.
AI is transforming platform engineering.
It's creating more scalable resilient and developer friendly systems.
So start small and scale gradually, begin with pilots
that solve clear pain problems.
Gradually expand measure impact on developer experience.
Success isn't just about uptime or dollar sale.
It's about, making developers' lives better so they can be more
creative, productive, and do the actual innovative solutions.
That's the lens through which we should match, measure the progress of the ai.
So some of the recommendations are here.
You should always assess your current platform.
Where could AI make a quick, meaningful difference?
So evaluate your observability.
Do you have the data foundation AI that needs to learn?
And the next one is, build a roadmap.
Plan progressive implementation.
Start with small wins and invest in people upskilling your platform teams
in AI and ML for infrastructure.
This is very much needed.
For the continue progress improvement.
So if you do thi if you do these things you will move from theory to
practice in a safe sustainable way.
And thank you.
I really enjoyed, talking about this and hope you implement in your
environments and see the big difference.
Thank you for joining me today.