Conf42 Platform Engineering 2025 - Online

- premiere 5PM GMT

Anomaly Detection for Platform Logs using Embedded Context Agents

Video size:

Abstract

This talk will explore how AI agents can be used to analyze logs and trace data, automatically tagging anomalies and alerting engineers only when critical outliers occur—making incident management much more efficient.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
So guys, let's get started. As I said before, Lux is a live blow of our system. They can record everything. What just be once, it can be good thing, bad thing, routine, and unusual. As loggers has different levels like warning, error and stuff. But when you running a modern platform, you generate not just hundreds, but millions of low entries every single day. So we gotta separate available signals like em, emerging outage or a security incident from all the background noise. That's where embedded contact agents really come in. The challenge of logs, think about the scale of the problem. In the microservices based architecture, each service produces logs. Multiply that by dozens of services. Multi, multiple environment. It can be like staging, production, QA as well. We have different countries they locate in different regions like Europe, Asia, and us. And as I said before, it we, since we use microservices their amount can be like. Thousands at least. For example, when it comes to find company, like as I know, Netflix has 700 at least just let's imagine how many of them has Google. So most of those logs are boring things. Like for an example, connection open or user disconnected, or some web socket is open and stuff. User logged in, cache kit and stuff. It can be a lot. But hidden on those streams are the needles and the hashtag. The early signs of something going wrong, maybe just request as timing. More than usual, maybe login failures are increasing. For a specific country, probably some database chart is slower than the rest. It's not really something important, but when we have huge amount of them is getting the problem, so we got recognized when it's okay and when finding them is a noise. Is a real challenge. And why is it's really hard? Local analysis is hard for free reasons. As I said before, we have huge volume. You simply cannot you cannot have humans reading through millions of entries. Even automated dashboards can be overwhelming and. Heterogeneity, just let's imagine it's really frequent ation and one team is using some strict structure using js o with some specific keys, but some other just using simple string and it's creating the case. And the third, the lack of context. The, a single error message means nothing without knowing what cells are happening in the system. Timeout may be fine, maybe it's happening frequently, and team know what's happening, but when it happens during checkout your e-commerce system on a Black Friday, it's a huge problem. So traditional tools don't give us the context. They just tell us accurate, and it's up to us what we gonna do. And we have, by the way, some classical approaches what people use. The first one, first hold based rules. For example, when we have more than a hundred errors in a minute, we got sent an alert to the slack. Or any other messenger to check it on our graph. And so it's simple and easy to implement, but it often creates alerts when nothing is truly wrong. And sometimes it misses subtle issues. For example, when developer is doing something important and he gotta change, he gotta change his context and check what's happening. It's getting problematic. The second statistical methods using moving averages re that scores to reactivations. These are useful for metrics like latency or throughput, but they don't understand logs in their semantic sense. And of course the third one. Basic ML models like isolation forest, or clustering these three logs as a data points and try to identify outliers. It's promising, but the models often break down when new types of logs appear, which is actually when we need them most because we learn on the logs that we saw before and why classical approaches fail. So why do this method fail in the practice? They create too many false positives. If every little deviation triggers an alert, engineers stop paying attention. We we've, we had all the situation, the pages goes off, you look and it's something harmless. They like business context. As I said before. They gotta change their context and have poor generalization When the system encounter something new, classical models simply say, I don't know what is that? Which either means they ignore it or they overreact. In short, this approach has flu us with noise. While missing a signals we really care about. And let's enter our embedded context agents. What is alternative? This is where embedded context really come in. Instead of positive, and passive rules or static models. We embedded small intelligent components directly into Observ JT Pipeline. These agents are. They embedded, they live inside our system, analyzing logs in real time. The second one, contextual. They add business and system context to every log. And autonomous agents, they don't just look at the data, they act alert, correlate, even fix small problems. In this case, we don't need to to change the context of developer. It's like moving from passive CCTV camera to an active security guard. And what does an agent do? So step one, it ingest logs in real time. Step two, it enriches them with context. That means attaching metadata, like which service, which deployment, which user, which region. And step three, it analyzes this enriched logs using ML models, embeddings, or pattern recognition. And step four. It acts data might mean raising an hour cluster, an anomaly with previous incident, or even recommending an automated mitigation. It's not about just seeing data, it's about understanding it and taking the right next step. So to make it more clear, let's use an analogy. For example, let's consider the mall where we have security cards, and in the mall we have cameras that can observe everywhere. And it can check that people walking shops, opening deliveries happening. That's useful, but it cannot be interpreted. So for that, we need some guy who will check everything and says, what's happening? The guard doesn't just watch, they interpret if someone is walking around the night in a closed mall, that's suspicious and he need to react. But the same person walking on a Black Friday during a sale. Nobody will pay attention. There's different context, makes logs without agents are just raw video, but logs with agents are actionable insights. So let's talk a little bit about architecture. Initially it'll be something simple, but later we will deep dive. Regarding system architecture, how it looks like logs are ingested. For tools, it can be like fluid D logs test or open telemetry. There are a lot of solutions. So they go through contextual enrichment, really, agents at metadata, like user id, region deployment details. And the next one is detection core. It can be all the things that we discussed before, like classical approaches, but as well it can be ML models that analyze the sequence of logs and look for ena. And finally the action layer. This is where alerts are raised, dashboards updated in some cases at automated response or trigger it. It's a pipeline in jest and reach, analyze and connect. So in the next slide, we'll take a look. Some primitive architecture, a lot of details are omitted but some of them I will just consider. So let's imagine this pipeline. We have a few things. The architecture on the screen illustrates full end to end and a malo detection. That system processes logs from multiple sources and enriches, then individual information, applies machine learning, and finally roads to result in the different destinations and for visibility. Action automation. The first one is, as I said before, log sources. The pipeline begins with a diverse resources. It can be some application logs regarding APIs, like air performance transactions as well. It can be some infrastructure logs like service containers, cloud resources and stuff. And the third thing is, security thing. For example, when we have like assess attempts or Audi trials or firewall events, and this represent a whole telemetry of the system. The second step, this is a local collector. It can be, for example, open telemetry or fluent geo and stuff. A log collector normalizes and ships the logs in real time. Fluent tea, for example, responsible for collecting log files. Parsing them and applying lightweight transformations. It gives us when we have different format to standard standardizes this ensure consistent structure and reduces noise before ingestion. The next step is message bus. Guys that we have billions of logs every day and we need to send them somehow without breaking the system. So the solution here can be some message bus, and in my opinion, the best one is Kafka. When we normalize the logs, then we push them to Kafka. Message Bus Kafka provides durable, scalable, high throughput streaming, and decouples produces local collector from consumers. Analysis engine enables replay and breath pressure handling in case of downstream slowdowns. And the next step is embedded contact station and the core sheets. The embedded context agent, which orchestrate normal detection. It consists three main layers. The first one, contact enrichment ends. It can be like metadata session info, user device contact, and the stuff helps models understand the story behind raw evens. ML models is ation. Forest detect outliers in high dimension data and out encoder reconstruct normal patterns, flag deviations. But by the way here may be used as well LLM based models. It can be a little bit problematic. And later in this presentation I will explain why, but it provides semantic semantic interpretation, animal reasoning, and correlation across logs. So decision layer, aggregate result from multiple models. And applies rules for an logic to classify, lys, determine d severity type and the required response. Of course, the next step is backend services, where we need to orchestrate everything, what we got. In the previous layer this component act as a hub, that roads classified ano analyst into different destinations. It can be, for example, some dashboards like Grafana, Kibana and stuff. And for example, wanting understand that this is something urgent we can notify by Slack and other messages. As well, we need to keep this data somehow to, to use it again, and for this purpose. We have a databases like Postgres, Radis and elastic search and stuff. And for example, we can react on it. At the moment. And for that we can use like Kubernetes auto mitigation. It triggers workflow, for example, scaling ports, a flow that will detected or blocking an IP. If instructions suspected reduces MTTR mean to recovery to eliminate manual steps. Or some other actions can be done on the programming language. And the last step is end users that get this information and can react. So this architecture provides resilience, scalable, and intelligent animal detection pipeline, local collection and normalization, and ensure clean data. Kafka guarantee scalable streaming and reply and EBITDA engine combined context, ML and decisions and output ensure both human visibility and automated remediation. It balances the real time detection with historical analysis and automated action, making it suitable for modern cloud native and security crucial environments. So now let's take a look on detection Core ML models. Detection core uses a combination of techniques. For example, out encoders, compress and reconstruct lock patterns. Plug in anything that doesn't fit. Isolation forest are great for identifying outliers. Log bird brings transformer based s and capturing the semantic meanings of logs. For example, we have as well deep lock use LSTM to model sequence of log events. And by combining these agents can handle both structured and un structured logs, as well as a new pattern we've never seen before. Let's talk about multi-agent collaboration. Another important point, we don't rely on a simple agent to do everything. Instead, we use specialized agents. One focuses on network traffic, another on authentication pattern, another on performance metrics. They, each of them act as a domain expert, but they can of course, collaborate with shared bus, can be usually kaf or a similar system. And together they can form a system of small, specialized. Experts. For example, let's take a look on different agents. We have three main Skinners. The first one, for example, when we got some problem like 500 TTP, when we did some order, the agent compares this with MBX of non neuro classes. Not much. It's in new, so it raises anomaly flack. This gets an attention of DevOps before an error spreads without the agent. This error might just be borrowed in a thousand of line of logs and gone notice until customers complain. The next scenarios when we have, for example suspicious logs like the authentication agent notices a spike in the login attempts from region, we don't. Really, you usually see the activity. We need to enrich a metadata. For example, we see same API range, same data frame or timeframe and hundreds of flights logins. It's not just technical error, it's potential and tech. And by flagging it early, we can prevent account takeover of data breaches. And the last scenario that I would consider is early performance warning. For example, sometimes the system interest looks fine. Like we don't have problem with RAM or CPU and latency Dashboards are green, but agents are noticing more and more looks like slow query detected individually. It doesn't trigger an alert, but trend is obvious. The performance agent raises an normally alert, something is degrading and engineers can fix it before it becomes an outage. This is power of context and sequence based detection. So let's take a look more about ML models. For example, deep log we got in 2017. It's a pioneer sequence based on ly detection. So it's a kind of recent technologies special log brought. It's 2021, applied to transform to logs. Log AI from Microsoft as an open source library for log par and antibody detection. And as well there is a dent free, for example, that helps build streaming log templates. This is a building blocks and agent. Combine them with domain specific context, so I didn't say something before. Some about LLM, like charge GPT where it can be fit. This is actually a very good thing for example, analyzing. Patterns and human friendly language that makes a great tool for engineers as well day to day activity, but it has some limitations, unfortunately. For example, it's not real time. Do you know how it's happening? We are doing some kind of request and we got a wait until we got the full request. Yeah, it doesn't continuously monitor logs. It only works when you ask. And it's really hard to scale 2 million of logs line per minute. You would spend a fortune trying to process all logs with this way. And finally, there are issues of privacy and cost. As guys, it costs a lot, many logs, contain sensitive data, send sending everything to large models, not always safe and affordable. Of course we can use like local models, but still it's not like a thing that can resolve everything, now let's compare a little bit chat. GPT or LLM, but l ML agents that we discussed before. The reality is following LLM and ML agents serve different purposes. For example, agents, they can be real time, they can be scalable, they can be domain specific, cheaper. But Chad GPT is a great explanation of course, but still it's not suitable for billions of re requests. But the best thing we can use, like a hybrid approach, I think this is the best setup. Agents continuously scan and detect analyst in real time. They raises alerts, feed dashboards, and trigger automations with something serious is found. But when we're talking about LLM, it helping engineers quickly understand what's happening, summarizing patterns and suggesting possible root causes. I guess this approach will give us the best results. So let's take a look a little bit benefits from embedded context agents. What are the benefits of using cabinet context agents? Easy. They reduce noise by filtering out false positives. They operate in real time, giving you early detection and extensible. You can add new engines for new domains. They integrate seamlessly with the DevOps workflows and CICD pipelines and bring intelligent closers to the data. Let's take a look. Some challenges. But let's be honest, it's not easy. There are challenges like training models, especially when it comes to the LLM based. It can be expensive. Context must always be on up to date. New deployments, new services, snowflakes confis. Explainability is hard in if an agent says this, an anomaly engineer want to know why? And agent coordination is tricky. Multiple agents might overlap or even conflict. This is serious. While we still solve as a community, let's take a look a bit the future Directions. First rack, four locks. Combining retrieval of similar incidents with generative models. The second self-healing systems agents that don't just detect animals but also fix simple issues an automatically, as I said before, for example, with the Kubernetes that can kill the ports, but it's still not something huge. And the third federated anomaly detection agent across cluster, sharing their knowledge to improve detection globally. This is where the field is going, and it's very exciting time to be working on it. Let's take a look a little bit. The use cases, for example, when I was talking about DevOps for DevOps, agents are game changing. Imagine a deployment pipeline as a new version roll out. Agents watch the logs in the real time. If unusual error spike beyond what expected during the world they adjusted automatically fly or even roll back the deployment. This reduces downtime and makes releases safer. And for example. It can be used as well for security. Agent provide another layer of defense. They monitor login pattern, user behavior and error errors. If they detect an ly, say a sudden burst of failed logins or unusual, API calls from a specific region, they raise an alert by coaching this, by catching this early, you can prevent incident before they escalate into breaches. So let's me conclude. Logs by themselves are just noise. But by the embedded context, agents logs are transformed with the meaningful and actionable signals. These agents bring context, intelligence and autonomy into log analysis. And for example, if we combine it with like systems like Charge GPT or other LLM, they empower engineers to detect, understand, and respond to problems faster than ever before. And most likely, this is the future of platform engineering. Smarter, more proactive, more adaptive systems. So thank you all for your attention.
...

Konstantin Berezin

Back End Developer @ Rapyd

Konstantin Berezin's LinkedIn account



Join the community!

Learn for free, join the best tech learning community

Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Access to all content