Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
I'm GaN.
I have 15 plus years of experience in designing and implementing
scalable distributor systems in multi-cloud architecture.
Today we are diving into observating at scale, optimizing a multi-cloud
data platforms, modern enterprises.
Really on resilient, observable data platforms across multi-cloud environments.
I am excited to share how a AIOps and finops are transforming operations,
boosting resilience and optimizing cost.
Let's get started.
We are in a world where 94 percentage of enterprise leverage
cloud services and 70% in.
Avoid vendor logging.
But here's the catch, achieving 99.999% uptime across these
enormous is tough, fragmented tools.
Inconsistent metrics and silos team create blind spots.
Today we will explore how absurdity addresses these challenges to deliver
reliable cost equity platforms.
Multicloud offers incredible opportunities.
You can cherry pick the best features from A-W-S-G-C-P or Azure TAP into specialized
capabilities and avoid vendor lockin.
But the challenges are real.
65 percentage of enterprises struggle with the fragment of visibility, inconsistent
metrics and disparate monitoring tools.
This leads to reactive operations and costly inefficiencies.
This narrative by providing unified insights.
So what is exactly a multicloud observative?
It's about unified visibility monitoring all your cloud
environments simultaneously, it enables proactive detection, catching
issues early for faster resolution.
IT data correlation linking metrics, logs, and traces across provider.
Platforms and customer usage and standardization and normalizing
diverse data into consistent formats.
This is the foundation for the operational excellence.
Do we have a single tool I can just buy and then have this problem fixed?
No isn't just a tool, it's a critical engine value.
It accelerates product delivery and enhancing service reliability.
It optimizes resource allocation and utilization across all your platforms,
and it make it scaling efficiency and it provides 360 degree insights from
unified metrics and logs and traces.
At, we have seen ability empower teams to diagnose issues quickly and
predict failures and make a data driven decisions that impact the bottom line.
Okay, so how to build a foundation.
Step one is start by creating an unified platform.
Just come up with an idea of the unified platform for consistent data collection.
So you should define and document all the cloud resources and then
deploy a centralized system to aggregate metrics, logs, and traces.
This strategy chooses you have single source of truth, making it
easier to monitor, optimize across.
AWS, GCP and Azures or any other cloud, A driven is a game changer.
Our platforms are to detect incident 67% faster, reducing meantime to detection.
Advanced MLL card point road causes instantly, which cuts the
resolution times by 51 percentage.
An a anomaly detection, proven 78 percentage of
incidents before they happen.
Through efficient scaling bottling mechanisms, this sips teams from
firefighting to strategic optimization, boosting reliability, and cutting costs.
Let's talk features.
Smart alerting reduces electro with the dynamic thresholds and contact
our notifications how to do it.
For example, when you have a latency based alert for 20%, the latency increases.
You try to generate an alert.
So there can be a lot of false positive cases when the system tried to scale
during that time, the unusual load from the customer that the regular
pattern it should be, and there may be some underlying issues and
some subsystems is happening or it does automatically getting fixed.
So when you have those, all these context added.
And you make sure the alerts only comes when the context is very clear
that, okay, there is something going beyond auto resolving the problem.
So that will make the alerts more meaningful and it'll drive
the right actions as needed.
So that is how I think we implemented Smart Alert and Tio to make sure the
alert won, impact our team's efficiency.
And the next important thing is anomaly detection.
As the time goes by and the different type of customer, the usage pattern comes in.
The anomaly detection is more important because it identifies the unusual
pattern and how it affects the system.
As you have this anomaly detection happens periodically, so you able to make sure.
Able to handle these different type of patterns in a more efficient way in
a, and then I think it's also helps, I think the cost aspect as well.
And the third one is a predictive analytics with all this
unified system of data alerts.
All these things in a single place will help you to have a.
An ability to forecast the resource needs also forecast any
potential failure that can happen.
How to do it again in radio.
So when we monitor and aggregate the data for every five minutes and you generate
alerts based on all these context.
And when you have this data for a period of time, now you can
provide that to a predictive model.
Now it can predict the next five minutes based on a historical trend.
So this way you able to really understand your various usage patterns and based
on that, how the system behaves, what it should be done next, or what will
happen next so that you've able to scale the resources or you generated alert.
So that people understand there can be a potential problem it can create
when you cannot automatically fix it.
So that's what I think the Predict Analytics will really help you to
find out these type of scenarios.
And finally, it'll become your correlation engine.
You should make sure.
In your unified system, you are able to understand all the different
signals coming across the, in the case of multi-tenant, the tenant level
signals, the platform level signals, and each and every subsystem level
signals the cloud level signals and also the, and you go next, I think
you add the cost level signals.
Everything together is a one engine, so you troubleshoot.
Efficiently optimize the subsystems, which are have a huge
impact on the final outcomes.
One of the biggest wins because with the AOPs, our teams able to
manage 3.4 times more infrastructure without additional staff.
Automation handles the routine tasks and freeing engineers to
focus on the next innovation.
It'll provide a compounding results even when you are a small,
medium or log large organization.
The already enables efficient operations at scale.
Now let's integrate finops Observability of AOPs.
And finops gives you a comprehensive cost visibility in real time, alongside
the reason why it had really happened.
It enables data-driven forecasting for better budgeting.
You are now able to correlate the cost with a real usage and what
type of usage drives the cost.
So now we have a very unique formula to create a budget on the
usage trend and the cost trend.
And based on type of usage you expect from a various customers you are
expecting to get the optimization through.
Intelligent rightsizing saves significant costs and precise cost
allocation to business accountability.
To focus and to get the maximum value.
So that's why that having a clear, unified observability system
incorporating all these data points into one, it drives multiple benefits,
including aspect and budgeting aspect.
The result speaks for themself.
We able to achieve 28 to 37% reduction in cloud expenditure through strategic
resource placement and right sizing.
Automated reduction eliminates ideal resources.
Boosting computing efficiency by 61%.
Underutilized instance drops by 43 plus precise cost attribution and to
person more accurate budget forecasting airlines spending with business code.
Also, these efforts translate to digital business impact.
We act a time to market by three times through operational
efficiency, the service reliability.
41% and enhancing availability and which have a impact on the customer
certifications and be able to free up more resources to do the innovative
things using a and other things.
So that's really had a great impact.
We understand what is AIOps finops, how to put everything together in observability
for the multicloud architecture and how much benefit you can get out.
Let's.
How to make this work.
The step one is think observative first architecture.
Designing any system.
You should first think how many data points I'm going to get it, and is it
aligned to my standards and how this will visualized into am I unified system?
So how this will tie to the cost, how this impacts the customer.
Usage and behaviors.
So the step one is clear strategy.
So you define your goals of observability.
On top of it, the KPIs for all US systems, implement a comprehensive
instrumentation for telemetric collection.
On top of this, you should have a strategy to apply a and ML for, and automations.
It cannot be done everything.
Intelligent systems to automatically extract insights and automate
your alerts and other activities.
Next one is add this with the finops, the cost data that will make you justify which
type of optimization drive how much cost.
Also, it'll understand how user behavior.
I think drives the cost based on different usage.
This framework s the observative from the start and creating a
sustainable and efficient platforms.
So now let's see, few best practices, which are highly
important, which you cannot skip.
I think when you start on this journey.
The first one is you should have a centralized collection of these data
so that you'll able to understand your systems and customers and
cost in a single source of truth and implement a standardized
tagging across all these clouds.
The tags shouldn't
ta.
And more important thing is not number of tags, but the what type of value you
added and how you manage those values where everybody else able to understand
the different type of applications, A, the purpose all these systems are working on.
Next, you should adopt as a code to deploy monitoring alongside infrastructure.
IT independent.
You should make sure that when the system designed and when we implement
the code and then when we deploy systems, the ity should be part of those code.
And you should create a framework from the time the design starts at
the till, the time it gets deployed and align the business how value.
How each and every work drives how much value.
In a more concrete terms, in a financial terms, you are able to prove it.
So these practices will make the ab observ scalable and very
impactful in the organization.
Are you ready to start?
So I can just give you a high level action plan.
How you can approach this.
Of course every system has some level of observability.
So first thing is you should understand the current state.
Most of our systems are already built and you are trying to add some things
over the step one, make sure you are not thinking the ideal state immediately.
Just assess the current state.
Understand how much maturity this absurdity has, and
find out the critical gaps.
And then document all the CA type of tools and systems I think currently
being used and what type of expected baseline performance we need.
And then you find the opportunity for the enhancement once we identify
all the current state and expected requirements and to achieve.
Now think of target architecture.
So each and every application system is different.
So I'm not going to give anything as a blueprint to implement it.
But the thinking should be based on the different type of cloud services
and applications you deploy and the different workloads you run.
You should have a target architecture defined properly, and then
you select the suitable tools.
So the architecture should be how the data you collected.
And what aggregation or alerting interval should be sufficient for your business.
For example, we have five minutes aggregation of these outlets, and
of course we will have real time platform failure and all those
outlets, but on top of it, the more important is understanding the
multi-cloud, multi-tenant architecture.
We should able to aggregate every five minutes to understand
how the systems are working.
So all the health check alerts can happen in a more milliseconds and seconds type
of it, but this performance and all these alerts can have at the final interval.
So you can define your goal and target architecture based on that.
And then you create a comprehensive instrumentation of these
metrics coming into your system.
Set the proper governance and compliance standard, don't
implement in a different way.
Next, focus on high impact workload First.
Because as we do this, as a fundamental change to the platform, you should not
try to solve all the problem at once.
Take the highest impact, one, implement this and show the value
how much money you're able to save, how efficient the scaling is.
That will make I think success for the next level of, and that will.
Larger context in the long term basis.
So take the high impact upload first, implement it efficiently on that one,
and show the business value out of it so that irate over it for other services.
That's how it should be approached whenever you go for this thing.
Yeah, we have come to the end of it.
In closing, I wanna say Observ at scale is about transforming
multi-cloud operations from reactive to proactive, from costly to efficient.
By leveraging AIOps finops, you can build resilient, cost-effective data
platforms that drive business success.
So when you focus on the don't.
As one cost and customer usage, all these things shouldn't
consider as a separate one.
So we should integrate the AOPs and finops as observability platform and we should
be able to drive an outcome based on that.
With the cost and finops data, the a operational things of platforms will
get benefited, but the platform data, the cost, and then forecasting will.
So we should think in a way that, how it can provide maximum business
value, not like a siloed operations.
Thank you for your time.
I'm looking forward to hear how you were able to take this journey
and move forward and share your experience with everybody else.
Thank you.