Transcript
This transcript was autogenerated. To make changes, submit a PR.
So guys, let's get started.
As I said before, Lux is a live blow of our system.
They can record everything.
What just be once, it can be good thing, bad thing, routine, and unusual.
As loggers has different levels like warning, error and stuff.
But when you running a modern platform, you generate not just hundreds, but
millions of low entries every single day.
So we gotta separate available signals like em, emerging outage or a security
incident from all the background noise.
That's where embedded contact agents really come in.
The challenge of logs, think about the scale of the problem.
In the microservices based architecture, each service produces logs.
Multiply that by dozens of services.
Multi, multiple environment.
It can be like staging, production, QA as well.
We have different countries they locate in different regions
like Europe, Asia, and us.
And as I said before, it we, since we use microservices their amount can be like.
Thousands at least.
For example, when it comes to find company, like as I know, Netflix
has 700 at least just let's imagine how many of them has Google.
So most of those logs are boring things.
Like for an example, connection open or user disconnected, or
some web socket is open and stuff.
User logged in, cache kit and stuff.
It can be a lot.
But hidden on those streams are the needles and the hashtag.
The early signs of something going wrong, maybe just request as timing.
More than usual, maybe login failures are increasing.
For a specific country, probably some database chart is slower than the rest.
It's not really something important, but when we have huge amount of
them is getting the problem, so we got recognized when it's okay
and when finding them is a noise.
Is a real challenge.
And why is it's really hard?
Local analysis is hard for free reasons.
As I said before, we have huge volume.
You simply cannot you cannot have humans reading through millions of entries.
Even automated dashboards can be overwhelming and.
Heterogeneity, just let's imagine it's really frequent ation and one
team is using some strict structure using js o with some specific keys,
but some other just using simple string and it's creating the case.
And the third, the lack of context.
The, a single error message means nothing without knowing what
cells are happening in the system.
Timeout may be fine, maybe it's happening frequently, and team know
what's happening, but when it happens during checkout your e-commerce system
on a Black Friday, it's a huge problem.
So traditional tools don't give us the context.
They just tell us accurate, and it's up to us what we gonna do.
And we have, by the way, some classical approaches what people use.
The first one, first hold based rules.
For example, when we have more than a hundred errors in a minute, we
got sent an alert to the slack.
Or any other messenger to check it on our graph.
And so it's simple and easy to implement, but it often creates
alerts when nothing is truly wrong.
And sometimes it misses subtle issues.
For example, when developer is doing something important and he
gotta change, he gotta change his context and check what's happening.
It's getting problematic.
The second statistical methods using moving averages re
that scores to reactivations.
These are useful for metrics like latency or throughput, but they don't
understand logs in their semantic sense.
And of course the third one.
Basic ML models like isolation forest, or clustering these three logs as a data
points and try to identify outliers.
It's promising, but the models often break down when new types of logs appear,
which is actually when we need them most because we learn on the logs that we saw
before and why classical approaches fail.
So why do this method fail in the practice?
They create too many false positives.
If every little deviation triggers an alert, engineers stop paying attention.
We we've, we had all the situation, the pages goes off, you look
and it's something harmless.
They like business context.
As I said before.
They gotta change their context and have poor generalization When the system
encounter something new, classical models simply say, I don't know what is that?
Which either means they ignore it or they overreact.
In short, this approach has flu us with noise.
While missing a signals we really care about.
And let's enter our embedded context agents.
What is alternative?
This is where embedded context really come in.
Instead of positive, and passive rules or static models.
We embedded small intelligent components directly into Observ JT Pipeline.
These agents are.
They embedded, they live inside our system, analyzing logs in real time.
The second one, contextual.
They add business and system context to every log.
And autonomous agents, they don't just look at the data, they act alert,
correlate, even fix small problems.
In this case, we don't need to to change the context of developer.
It's like moving from passive CCTV camera to an active security guard.
And what does an agent do?
So step one, it ingest logs in real time.
Step two, it enriches them with context.
That means attaching metadata, like which service, which deployment,
which user, which region.
And step three, it analyzes this enriched logs using ML models,
embeddings, or pattern recognition.
And step four.
It acts data might mean raising an hour cluster, an anomaly
with previous incident, or even recommending an automated mitigation.
It's not about just seeing data, it's about understanding it
and taking the right next step.
So to make it more clear, let's use an analogy.
For example, let's consider the mall where we have security cards, and in the mall we
have cameras that can observe everywhere.
And it can check that people walking shops, opening deliveries happening.
That's useful, but it cannot be interpreted.
So for that, we need some guy who will check everything
and says, what's happening?
The guard doesn't just watch, they interpret if someone is walking around
the night in a closed mall, that's suspicious and he need to react.
But the same person walking on a Black Friday during a sale.
Nobody will pay attention.
There's different context, makes logs without agents are just raw video, but
logs with agents are actionable insights.
So let's talk a little bit about architecture.
Initially it'll be something simple, but later we will deep dive.
Regarding system architecture, how it looks like logs are ingested.
For tools, it can be like fluid D logs test or open telemetry.
There are a lot of solutions.
So they go through contextual enrichment, really, agents at metadata, like
user id, region deployment details.
And the next one is detection core.
It can be all the things that we discussed before, like classical approaches, but
as well it can be ML models that analyze the sequence of logs and look for ena.
And finally the action layer.
This is where alerts are raised, dashboards updated in some cases at
automated response or trigger it.
It's a pipeline in jest and reach, analyze and connect.
So in the next slide, we'll take a look.
Some primitive architecture, a lot of details are omitted but
some of them I will just consider.
So let's imagine this pipeline.
We have a few things.
The architecture on the screen illustrates full end to end and a malo detection.
That system processes logs from multiple sources and enriches, then individual
information, applies machine learning, and finally roads to result in the
different destinations and for visibility.
Action automation.
The first one is, as I said before, log sources.
The pipeline begins with a diverse resources.
It can be some application logs regarding APIs, like air
performance transactions as well.
It can be some infrastructure logs like service containers,
cloud resources and stuff.
And the third thing is, security thing.
For example, when we have like assess attempts or Audi trials or
firewall events, and this represent a whole telemetry of the system.
The second step, this is a local collector.
It can be, for example, open telemetry or fluent geo and stuff.
A log collector normalizes and ships the logs in real time.
Fluent tea, for example, responsible for collecting log files.
Parsing them and applying lightweight transformations.
It gives us when we have different format to standard standardizes
this ensure consistent structure and reduces noise before ingestion.
The next step is message bus.
Guys that we have billions of logs every day and we need to send them
somehow without breaking the system.
So the solution here can be some message bus, and in my
opinion, the best one is Kafka.
When we normalize the logs, then we push them to Kafka.
Message Bus Kafka provides durable, scalable, high throughput
streaming, and decouples produces local collector from consumers.
Analysis engine enables replay and breath pressure handling in
case of downstream slowdowns.
And the next step is embedded contact station and the core sheets.
The embedded context agent, which orchestrate normal detection.
It consists three main layers.
The first one, contact enrichment ends.
It can be like metadata session info, user device contact, and the stuff helps models
understand the story behind raw evens.
ML models is ation.
Forest detect outliers in high dimension data and out encoder reconstruct
normal patterns, flag deviations.
But by the way here may be used as well LLM based models.
It can be a little bit problematic.
And later in this presentation I will explain why, but it provides semantic
semantic interpretation, animal reasoning, and correlation across logs.
So decision layer, aggregate result from multiple models.
And applies rules for an logic to classify, lys, determine d severity
type and the required response.
Of course, the next step is backend services, where we need to
orchestrate everything, what we got.
In the previous layer this component act as a hub, that roads classified ano
analyst into different destinations.
It can be, for example, some dashboards like Grafana, Kibana and stuff.
And for example, wanting understand that this is something urgent we can
notify by Slack and other messages.
As well, we need to keep this data somehow to, to use it again, and for this purpose.
We have a databases like Postgres, Radis and elastic search and stuff.
And for example, we can react on it.
At the moment.
And for that we can use like Kubernetes auto mitigation.
It triggers workflow, for example, scaling ports, a flow that will
detected or blocking an IP.
If instructions suspected reduces MTTR mean to recovery
to eliminate manual steps.
Or some other actions can be done on the programming language.
And the last step is end users that get this information and can react.
So this architecture provides resilience, scalable, and intelligent animal
detection pipeline, local collection and normalization, and ensure clean data.
Kafka guarantee scalable streaming and reply and EBITDA engine
combined context, ML and decisions and output ensure both human
visibility and automated remediation.
It balances the real time detection with historical analysis and automated action,
making it suitable for modern cloud native and security crucial environments.
So now let's take a look on detection Core ML models.
Detection core uses a combination of techniques.
For example, out encoders, compress and reconstruct lock patterns.
Plug in anything that doesn't fit.
Isolation forest are great for identifying outliers.
Log bird brings transformer based s and capturing the semantic meanings of logs.
For example, we have as well deep lock use LSTM to model sequence of log events.
And by combining these agents can handle both structured and un
structured logs, as well as a new pattern we've never seen before.
Let's talk about multi-agent collaboration.
Another important point, we don't rely on a simple agent to do everything.
Instead, we use specialized agents.
One focuses on network traffic, another on authentication pattern,
another on performance metrics.
They, each of them act as a domain expert, but they can of course,
collaborate with shared bus, can be usually kaf or a similar system.
And together they can form a system of small, specialized.
Experts.
For example, let's take a look on different agents.
We have three main Skinners.
The first one, for example, when we got some problem like 500 TTP, when
we did some order, the agent compares this with MBX of non neuro classes.
Not much.
It's in new, so it raises anomaly flack.
This gets an attention of DevOps before an error spreads without the agent.
This error might just be borrowed in a thousand of line of logs and gone
notice until customers complain.
The next scenarios when we have, for example suspicious logs like the
authentication agent notices a spike in the login attempts from region, we don't.
Really, you usually see the activity.
We need to enrich a metadata.
For example, we see same API range, same data frame or timeframe
and hundreds of flights logins.
It's not just technical error, it's potential and tech.
And by flagging it early, we can prevent account takeover of data breaches.
And the last scenario that I would consider is early performance warning.
For example, sometimes the system interest looks fine.
Like we don't have problem with RAM or CPU and latency Dashboards are green, but
agents are noticing more and more looks like slow query detected individually.
It doesn't trigger an alert, but trend is obvious.
The performance agent raises an normally alert, something is
degrading and engineers can fix it before it becomes an outage.
This is power of context and sequence based detection.
So let's take a look more about ML models.
For example, deep log we got in 2017.
It's a pioneer sequence based on ly detection.
So it's a kind of recent technologies special log brought.
It's 2021, applied to transform to logs.
Log AI from Microsoft as an open source library for log
par and antibody detection.
And as well there is a dent free, for example, that helps
build streaming log templates.
This is a building blocks and agent.
Combine them with domain specific context, so I didn't say something before.
Some about LLM, like charge GPT where it can be fit.
This is actually a very good thing for example, analyzing.
Patterns and human friendly language that makes a great tool for engineers
as well day to day activity, but it has some limitations, unfortunately.
For example, it's not real time.
Do you know how it's happening?
We are doing some kind of request and we got a wait until we got the full request.
Yeah, it doesn't continuously monitor logs.
It only works when you ask.
And it's really hard to scale 2 million of logs line per minute.
You would spend a fortune trying to process all logs with this way.
And finally, there are issues of privacy and cost.
As guys, it costs a lot, many logs, contain sensitive data, send
sending everything to large models, not always safe and affordable.
Of course we can use like local models, but still it's not like a
thing that can resolve everything, now let's compare a little bit chat.
GPT or LLM, but l ML agents that we discussed before.
The reality is following LLM and ML agents serve different purposes.
For example, agents, they can be real time, they can be scalable,
they can be domain specific, cheaper.
But Chad GPT is a great explanation of course, but still it's not
suitable for billions of re requests.
But the best thing we can use, like a hybrid approach, I
think this is the best setup.
Agents continuously scan and detect analyst in real time.
They raises alerts, feed dashboards, and trigger automations with
something serious is found.
But when we're talking about LLM, it helping engineers quickly understand
what's happening, summarizing patterns and suggesting possible root causes.
I guess this approach will give us the best results.
So let's take a look a little bit benefits from embedded context agents.
What are the benefits of using cabinet context agents?
Easy.
They reduce noise by filtering out false positives.
They operate in real time, giving you early detection and extensible.
You can add new engines for new domains.
They integrate seamlessly with the DevOps workflows and CICD pipelines and
bring intelligent closers to the data.
Let's take a look.
Some challenges.
But let's be honest, it's not easy.
There are challenges like training models, especially when it comes to the LLM based.
It can be expensive.
Context must always be on up to date.
New deployments, new services, snowflakes confis.
Explainability is hard in if an agent says this, an anomaly
engineer want to know why?
And agent coordination is tricky.
Multiple agents might overlap or even conflict.
This is serious.
While we still solve as a community, let's take a look a bit the future Directions.
First rack, four locks.
Combining retrieval of similar incidents with generative models.
The second self-healing systems agents that don't just detect animals but also
fix simple issues an automatically, as I said before, for example, with the
Kubernetes that can kill the ports, but it's still not something huge.
And the third federated anomaly detection agent across cluster, sharing their
knowledge to improve detection globally.
This is where the field is going, and it's very exciting time to be working on it.
Let's take a look a little bit.
The use cases, for example, when I was talking about DevOps for
DevOps, agents are game changing.
Imagine a deployment pipeline as a new version roll out.
Agents watch the logs in the real time.
If unusual error spike beyond what expected during the world
they adjusted automatically fly or even roll back the deployment.
This reduces downtime and makes releases safer.
And for example.
It can be used as well for security.
Agent provide another layer of defense.
They monitor login pattern, user behavior and error errors.
If they detect an ly, say a sudden burst of failed logins or unusual, API calls
from a specific region, they raise an alert by coaching this, by catching
this early, you can prevent incident before they escalate into breaches.
So let's me conclude.
Logs by themselves are just noise.
But by the embedded context, agents logs are transformed with the
meaningful and actionable signals.
These agents bring context, intelligence and autonomy into log analysis.
And for example, if we combine it with like systems like Charge GPT
or other LLM, they empower engineers to detect, understand, and respond
to problems faster than ever before.
And most likely, this is the future of platform engineering.
Smarter, more proactive, more adaptive systems.
So thank you all for your attention.