Transcript
This transcript was autogenerated. To make changes, submit a PR.
Art.
Oh, hi.
Welcome everyone.
Today we are gonna talk about this generation observability,
leveraging AI and data pipeline.
And I'm Sara, director of Cloud Infrastructure and
Observability Informatica.
And I have Katie with me who is infrastructure architect at Informatica.
The agenda for today.
Is establishing the problem statement, what is the current problem we add
at Informatica and our existing tele telemetry pipeline and, and our futuristic
next generation telemetry pipeline?
How we embedded AI in each part of the telemetry pipeline, reducing
the cost and bringing better MTTR.
So Informatica, right?
Informatica infrastructure.
We operate at four different cloud providers.
Hyperscalers, Amazon, Azure, GCP, and Oracle spread across 20 different
regions, 600 and more Kubernetes clusters and several thousands of worker no.
And virtual machines being just about 60 terabyte logs every day with 3.8 trillion
Documents get ingested every month.
And if you look at our storage, it's of 15 petabytes on S3 and on on ECQ.
Virtual machines is about two, 2.5 petabytes.
So given, given the scale we operate and given, given that.
We operate in different cloud and different region.
It brings its own challenges in cost, compliance, and operational
aspect of things, right?
The source of logs come from different region, different availability zones get
consolidated into one centralized message queue, which we currently use Kafka.
And ingested into the data store and visualized using multiple
tools like Kibana, Grafana, and homegrown analytics tool.
So this in turn increased the cost, operational expense, and slower dashboards
with the amount of data we operate.
So this, this problem make us to rethink on our existing observability.
Pipeline and come up with a new pipeline.
Bring gen AI and ML flavor in each stages of the pipeline.
So we, we bring things into the source.
So consolidated the collection using open telemetry and bring filters.
And ML models at the source.
So can, so we can clutter noisy, unwanted data and declutter
them in the source itself.
So we reduce right in the source, 20% reduction and move the data to the
centralized message queue with better optimized pipeline producing 50%.
Saving right there.
Then ingested into data storage throughout.
By doing so, we reduce noise right in the source, more in the left.
And using AI and ML models, we are able to apply better control of the data, bringing
clean compliance result for us as the data ingestion reduced at the source, we
also reduced storage cost, backup cost.
And quicker dashboard instantaneously available to the end users.
So with that, I would like to have Keith join me to go a little
deeper into the architectural pieces of this telemetry pipeline.
Keith.
Hey.
Thank you, sir for this creative overview.
Yeah.
So as Sarah mentioned.
We have been dealing with HU'S infrastructure which ingest a lot of
telemetry data and it's actually massive.
We have to operate at massive scale.
So what we did is to control, to get a control over this telemetry data.
We have implemented something called Telemetry Pipeline.
So we first started with the centralized telemetry pipeline
because observability is very much important to every piece of the.
Application and infrastructure.
It is being used by application security for, for every purpose.
And today's world where there is a lot of pipe coding AI is generating your code.
So it's very important to make sure that your observability as in right safe so
that in case there is a incident, you have the right set of data to troubleshoot.
So what we did is since our system is getting more and more telemetry data,
so we put this telemetry pipeline to make sure that we are ingesting only the
actionable data that is required for us.
And we, we need to store that for longer time.
So by doing so by adding a centralized telemetry pipeline,
we could able to reduce our logs.
But what we realized is the majority of cost is coming from the A collectors
from where we are collecting the logs and sending it to the centralized system.
So we started using a open telemetry based agent where we can add filters.
To filter out all the noise at the source so that we don't have to send them through
the network where we where we incur a lot of network transfer or n gateway cost.
And once that data comes to our centralized telemetry pipeline, so there
we massage the data on how we want to send them and how we want to store them.
S for example sometimes there are a lot of logs people are sending
which they're using per metrics.
So, but logs are quite expensive.
So with the telemetry pipeline, we can convert them to metrics and
store it in metrics system instead of storing it in the logging system.
So similarly that will be a lot of health checks or in, or
similar kind of things where.
Probably we may not need all the raw data that is coming into our system.
So what we can do is we just need a summarized version of the information
which will be useful for us.
So in that case, what we can do is use the telemetry pipeline to down
sample the amount of telemetry that's coming in, and write list data to
the the backend observability system, which is more, which is more useful.
So with this, what we achieved is there is we reduced the telemetry noise and
we got better control over the data.
For example, if there is a lot of logs that's that is having some
kind of PI data or something.
So we have a central place to manage that.
Basically add some kind of masking on top of it before we write in write
it for a long long-term stories.
So what happens with la with the lace data coming into the system, our
dashboards at the are running faster.
The cardinality of the telemetry is reducing, and also it's increasing
our search, which is giving our which is giving a good user experience.
So th this is kind of give us more control over the data, but.
It, it also introduced more human effort to maintain this.
So to create the telemetry pipelines, to maintain these pipelines and also to
keep keep the track of all the different patterns that's coming in into the system.
So what we did is we leveraged AI to solve this.
That problem.
So with ai adding, added to the, to our telemetry pipeline, so we implemented one.
AI isn't, which takes your natural language query and it, it knows what
are the transforms and and what are the pipeline's pipeline query languages.
So it can create new pipelines.
It can manage existing pipelines, and also based on the based based out of the
structure of the data or the pattern of the data, it can recommend you basically
if there is any scope of optimizations that you can take any action.
And at the same time, we, we are not completely adding AI
agents to create our pipelines.
We are also keeping human in loop to review and approve on the suggestions
that's given by the AI agent.
So once the human reviews and approves it, then AI isn't goes and deploys it.
So that way there is less resource required to maintain
and build these pipelines.
So with less amount of data we we are able to get better observability
systems built on top of it.
So this is how we operate.
Basically we have a single unified portal based agent, which collects all the
different telemetry signals, and these agents are managed with fleet management.
So we don't have to go through each a location for any contribution changes.
So that makes our contributions t as well, and all our applications
are basically they're they're instrumented with, with different a PM.
So with, with those instrumentation, it helps us to propagate the
context about all the, all the all the a PA calls that's.
That's traveling from service to service.
And with that we are able to create these different service maps and
correlation between services.
So this helps us reducing a lot of toil during our investigation of any issues.
And moreover, you don't need an expert to expert to be present in all the
investigation calls to go through how the service is behaving and how
each service basically correlated or when, when is service A is calling
service B, or it's calling service C.
And also we have implemented some ML models.
To detect the animal in the system.
So so that we can alert in case there is some kind of let's say let's say there
are the yeah, the a PA response is high or maybe there is some errors errors that
has increased in a particular service.
So.
With, with all these things we have it required massive efforts from I mean,
human efforts to maintain all this thing.
So that is where we have implemented our AI assistant.
So this is, again, another AI agent.
Where, you don't have to really understand the different query languages that
is required to build your dashboards or get the required data what,
whenever investigating any incidents.
So you, you can just go to the a assistant and ask what you need in
a simple sim simple natural query.
It'll, it'll convert that.
To the query language that the observability system understands it it
goes through your observability system.
It correlates your traces, metrics and logs, and it, it can provide
you basically basic visualization, which will help you to understand
what's going on in the system.
And also it can help you to automate some of the runbooks it, it can run
some of the runbooks to do a regular maintenance task that is also required
whatever is required, some human efforts.
So this is the way we have leveraged AI in our observability to reduce the
human efforts that is required to.
Build and maintain these systems and at the same time we are also.
Making sure that the observable stack we are running the cost is the cost
is within the ranges, and also at the same time, the user experience is good.
And also whenever, during instance and all, whenever people need.
Some specific data, then they can easily get that instead of depending
on others are searching for which dashboard or which query to run it.
So this is also reducing our MTTR of our incidents.
With that, I will conclude here.
So thank you.
Thank you everyone for tuning in.
Thanks everyone.
Thank you.
Thank you everyone.
Thanks, Katie.