Scaling Observability for Real-Time Personalization Engines

Video size:

Abstract

Is your personalization engine working for you—or hiding its flaws? Dive into the high-stakes world of real-time AI at scale. Learn how we made the invisible visible with next-gen observability—tracing decisions, exposing failures, and unlocking true performance.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. This is ra. I'm working as a principal software engineer at and t. Today I'm going to talk about scaling observability for real-time personalization engines. So let's get into the talk. Let me share. So first let's talk about why observability matters for personalization. So that's the first question, right? So personalization drives engagements and they get revenue. For example, companies that accelerate personalization, they see to, they see up to 40% of more revenue costs. And as, as basically users respond for highly relevant contents, right? So they got to be good at observability to reach that personalization revenue growth and for the real time experiences, the, there, there is zero room for errors. Users always expect the response to be minimally seconds, and there shouldn't be any deviation of the content that present to the customer. So the so the delays or failures that directly hurt the satisfaction and conversions. Apart from that, there is like complex ML decisions that need validations, right? So whenever we present the recommendations to the customers, we gotta see are we presenting the right content at the right time? Are we having any issues while presenting the content to the customers? Is lack of insight. Is provisionally equal to lost users trust and missed opportunity. We got to be careful on while in the recommendation engine there it's basically for personalization engine products. Apart from that, there is competitive edge when we are collecting these metrics, logs, traces, that helps continue to improve algorithms and user experiences altogether, right? So we gotta keeping our personalization engine a step ahead in the competition from other other competitors, whoever is in the same phase as personalized engine product level. So these so observability overall is like very important and it's, plays a key role for personalization. So next we are going to talk about at the same time there are challenges to, to get to this observability for real time machine learning systems. As I said observability is important for personalization. At the same time, there are a couple challenges it has so basically like the personalization systems, how these complex distributed pipelines. Where there are many components will get involved, like services, data pipelines, ML models, and all those components. So we got to be good track of these, all the logs traces and metrics. Otherwise we get lost. So end-to-end tracking should be important. So the second complexity we could see is like there would be high data volume, like millions of events could come for the real time events when the users are interacting with the website, right? So these real time without adding latency we got to capture these telemetry logs. Which is difficult part. So that's one more challenge. And these ML decisions is like a black box thing. We don't know when there is a drop of drop of growth to the recommendations we, what we are suggesting to the customer. So it's hard to track like whether the recommendation is driving the recommendations properly or our rules have been executing properly or we lost anything or the source systems or capturing the data, what we wanted for the specific user behavior and et cetera. So this stacking of these ML models fail loose is also hard. And these personalization. Now architecture is like multi-channel user journey. Sometimes like user can come to the website at the same time he can go to the store or he can browse through mobile apps. So all sorts of these channels will be there. We got to be track the customer throughout these multiple channels so that we can stitch the user properly and we can troubleshoot if anything goes wrong. So Observ, everybody place a challenge here for multichannel user journey. And the fifth challenge is going to be the latency sensitivity. So we all saying like an observability and tag these metrics and things, but at the same time it's not that easy. Without without having I. Heavy overhead on these telemetric collections and things. So we know, we all know that the, it is important to collect the data but at what cost? Like we got to see how we maintain lightweight real time engines. So that's one thing. The latency and sensitivity is going to be challenged. Yeah. Now and next we are going to talk about a. System architecture overview. So this is a typical representation architecture for real time personalization. Engines like where the request comes into APIs the three APIs, and it goes to decision engine where that recommendation takes place. From there, it get interacts with the model service where models been retain and rules been executed. And these model service interact with the data report where the user profiles or what or whatnot are the content recommendations will be placed in. So meanwhile there is an event ingestion will happen where the users user interface event streams to these event ingestions will happen, for example, through Kafka. These event tions will be happening into the model. So all these components multiple component attractions the metrics, logs, traces all together to be feeding or ingested. Into the centralized observability pipeline. So this is a typical system architecture for these personalization engines in thel services like ingestion, pipelines, station engine, API layer data, and then all together all these telemetry sources to be it in to aggregate the data and ingest them into observability pipeline, which is centralized. So this is how the observability comes into the picture for for personalization engines. And now we go to the scaling observability in a real time architecture, right? So I. As we speak, like this is really high scale architectures in a real time real time event, conceptions and personalization. So end to end tracing is very important. User coming to and the a p is getting hit. And then it interact with model, interface and trying to interact with data data we post to face the data and execute the rules. And enabling end-to-end visibility of each personalization response is critical. And the second thing is unified instrumentation. Like we got to adopt open telemetry as DK across services for standard metrics. Logs and trace collection is important unified instrumentation so a PS instrument code once and export telemetry to multiple back backend. So we got to be like, unified instrumentation instead of bits and pieces. And the second, third thing is high throughput metrics. There would be as we speak, like millions of events could could happen in production environments. When the application been accessed by external users, right? So we already have a time series database like Prometheus and and all sorts of metrics to be collected. Aggregate and avoid overwhelming storage, for example, like compute percentile, latencies and sample fine-grained events as such thing. So high throughput metrics are very important to be collected and log aggregation, like streaming logs Jason structure logs and the model decision into a central system like elk. So whenever we got issues double two, troubleshooting through these kind of a and cloud logging things would be easy for us. Like we can track them as like user sessions and through trace IDs and things. And the last thing is like scalable storage and retention, like employee, a backend that can handle high data volumes like scalable observability platform or data lake or long term analysis. So we need to ensure retention policy sometimes when we want to go back to the time and then review what was the trend and. What was that issue is about? So historical trends can be analyzed without infin storage growth. So we gotta be careful about scalability of storage as well. Part of these real time architecture when we are scaling for observability. And the and the other thing is like key metrics that matter, right? So there will be these key metrics which are important for the troubleshoot. We got to be collected. One is like latency and throughput, like measures how fast and how many recommendations we are saw. And, at now, how much SLA you would personalize the APIs been responding and how APIs been interacting with the model interface or for the data reports? LA throughput one thing to be a key metric for us for observability. And the second thing is error rates and timeout. So we always got to watch out for five accessories or four accessories and and are we. Returning any default content or default recommendations to the customers and all that data help us, like when we see when our conversion rates are go down and any, all of these spikes indicate issues in the ML service or data layer. So error rates and timeouts is one thing as a key metric. We got to be log. And the other thing is engagement metrics, tie observability to business KPIs, like we got to be engaged, click through rates and conversion rates dwell time, et cetera, in real time. And so a drop in engagement might signal always personalization isn't happening fine to the customer. So there is a drop in the conversion rate and system. So that's engagement metrics is one of the important thing. And another thing is model performance indicator. So custom metrics like ML model itself. Example, the confidence score distribution of recommendations. Or frequency of each model, variant being used on drift metrics, comparing live input data to training data. These help detect when the model's quantity is degrading. Model performance indicator is also a good thing a good key metric for us for the observation. And another is like resource utilization. So always we got to keep an eye in our infrastructure level. What is the CPU consumption? What is the memory consumption and what is the throughput to our steam like example Kafka topics and what are the key lens in it? So all sorts of things. We got to be, keep an eye on it to ensure infrastructure can always hand load Certain changes could explain like performance outlets. So these are the key metrics that the, that, that matter. For us. And let's jump into tracing the personalization journey. So as part of this personalized personalized engines architecture the journey how easy do we can trace that is the thing, right? So we got to distribute trace example like when, when a user visits personalization content and a trace follows through the request, so the request through the API to the decision engine and into the model, so we sent to the database, right? So we got to be each segment level. We got to have a trace record. So we will know like how the interaction is happening among these components with the handshake. Apart from that w in telemetry for tracing. We can, you, we got have these w in telemetry logs, traces and metrics together where we have to have these tracing in recommendation engine level. And after that we got to have in a model interface and you enter at DB level. We will, we got, we chunked it into different levels to track what is happening and in which layer it is happening to. It'll be easy later to visualize the call flow and bottleneck identification. So traces always help to pinpoint which layer is taking time. For example, where API is taking this much milliseconds and your. DB interaction is taking this much level and your model interfaces interaction is taking this much level. So it gives a clear, in depth view of those metrics alone. It enables targeted optimization for us. And other thing is like correlating with user action. When we are tracing these user requests through session IDs or trace IDs, so it always like correlate, we can correlate the break comes of that request like this, a low score correlating a pool experience, for example, user size, scale recommendation. We can easily identify throughout that trace and we can provide like visual context for debugging. So correlating with user actions is one of the good thing for tracing and sampling strategy. So for example, trace every request may be infeasible, right? So we got to go by batches wise categorize wise, collect traces for a normal scenarios and represent to slice of traffic. So this keeps like overhead low while enduring us to collect traces. So sampling strategy is like one, one good tracing for personal personalization journeys. And the other is like logging for making. So whenever we have logging comes into picture for observation, we gotta have a rich and structured logs. Logging in a personalization engine should capture key imports, outputs of the ML engine for each request. So log details like user ID or relevant features like which brackets segments customer fall into, and maybe an explanation score or a reason could. Could be possible. If you're available these kind of rich and structure logs help us later to make our life easy when we troubleshoot issues. Apart from that, like we can enable root cause analysis, like when something goes wrong we, we always show like relevant content to the customer, right? So logs can reveal what the customer model saw. And for instance log might show input or user segment is now like. That means like it didn't give the response as anticipated, give bad default recommendation to the customer. So root cause analysis, it enables us when we follow through these logging properly. And we have a trace log correlation, integrate logs with tracing context. As earlier we talk whenever the movement request comes in, everything, the trace completely goes with the trace ID or session ID or request id. Through that we can easily track at these distributes distributed traces so you can jump into associated logs across services for the same request. So it would be easy to raise, log with these correlation IDs. Other is anonymize the protect, anonymize and protect data. So logging always we feel just simply log, but always we got to take care of any sensitive data. Are we logging? And because of GDPR concerns and other things and observability platform should enforce data handling policy. So while still giving the engineers. Enough info to troubleshoot. So we got to be careful while logging. As we have a freedom to have enough to troubleshoot. At the same time, we gotta be cautious about sensitive data of the customers to be logged. Another is like use log levels wisely. So whether you go with info log or error log or warning log and and we don't need to, we don't need to like, puff up. The logs like always make sure only warning level at what level we have to give and error. Error logging type which log patterns we need to be log. So it could be, it would be easy when we segregate, when we troubleshoot in the logs. Always the go faster and then warning and then info for help. And the other thing is like scaling observability patterns and best practices. So avoid metrics overload always practices like avoid metrics overload. Be mindful. Of metric cardinality, for example. This metric labeled with every user or item id, right? So it would be a huge for each customer if we log for each user. So we got to be aggregate them as a category wise or percentile, latency wise or something like that. So to be smart so we can we can avoid the overload in such a way if we go with the login. And raise and log sampling. Sampling we can go with the sampling of couple requests and then we see the pattern and we can group them as so we can go with reduce noise and cost for it. So while retaining diagnostic power. Exactly. So trace and log sampling as a pattern we got to follow. And batching and B, buffering. So use characters like ING telemetry characters, so Kafka and buffer telemetry data, batch taste, exports, and sending chunks. So the application is in block or sending each span, so buffer logs to avoid disk Iwo becoming bottleneck. So that's one thing. And scalable storage and querying, implement a tiered storage observability data. So recent data stays fast. Queryable stores and everything would be like, fast enough if we have a storage and querying properly. And other thing is like a resiliency of observability. Systems like T, your monitoring pipeline, whatever that pipeline is about and critical part of the system. Scale out your metrics and logging backends, always set up alerts. Don't forget about it. And observability systems themselves. Example, if the log forward lags or from the is behind and observability outage can be especially painful during a production incident so we got to be mindful of that as well. And the other thing is like incident observability in action. The scenario. Is like silent model drifting. It is when it tells about it a story of which is an incident of observability interaction imagine an engagement of an engagement gradually drop over a week for personalization field. When we are. Showing a component to the customer and it's its engagement got dropped. Then component crashed but component didn't crash. And B less visibility of those components by the customers. But the recommendations became lesser relevant. So without observability, this scenario is hard to tag down what was happening to that components, right? So detection through metrics and alerts. So without strong observability, setup, an alert tried. When click through rate or it got deeper, 10% below baseline. At the same time, the custom model drift metrics showed the distribution. So input have, has shifted from training data, but this correlation hinted the model was no longer turned to current user behavior. So through that detection also it was, it become hard to identify what was happening, right? And the next is like using tracers to pinpoint impact. So engineers pulled up to distribute at tracers for user session and they analyze the things and the tracers show normal latency and everything. A pattern image, like many users received a default content. So that's where the table light in. So it indicated that model was unsure and it fall back to the default content, which is recommended to the customers. So root cause, logs. So the other thing is we could see the logs as earlier we discussed. What is a good good things or metrics, key metrics to log. So we, we used to log for a model as well whether. We recommended a personalized recommendation or a default content recommended to the customer. We could see in the logs like, always the default content being recommended because customer is not falling into a specific segment. So that's where the dev identified, like they see like upstream whether etel is collecting the customer's segment properly. Example, customer segment is if age is greater than 30 and recommend this content. So the age attribute profile attribute is missing to be collection. So that's where that would cause light and they identify through that. Logs played a key role there and the resolution and takeaway. So thanks for observability. The team being identified the issue of root cause yours, and they fix the ETL things and postmodern showed prior to their observability tools, like after, after the observability they three x times they debugging time got sailed. And all those metrics were, are looking awesome. So this is Abu story of observability as a sample way. And the other is like tools and stack in action like, open Telemetry. So there are multiple tools like part of these personalization engines we gotta use. There are a stack of tools which we are going to talk right now. So open telemetry is one of the thing that is a open source standard for instrumentation. So it use, its collects metrics, traces, logs in a unified way. So we, it has it come up, it comes up with SDK so we can we can integrate this into personalized microservices and allowing easy export of telemetry. And other is like Prometheus and Grafana. So all the metrics can be blend in and and they even that, that heavy load at a scale it can easily and Grafana provides like dashboards, visualization, and from all the requests high rates of requests as well. And we have distributed rein system. So Jaguar kin is deployed for tracing, so receiving spans from open telemetry engineers, so distributed tracing system jargon, kin are upgrade. So we can use those tools and logging pipeline example, elk elastic log session and Kibana where we use like session IDs or user IDs that can be easily tracked and traced down even with multiple channels. That request comes through. We can stitch that customer request through and we can easily troubleshoot those. And other thing is like advanced. There are a PM and advanced tools like in addition to these open source tools, there are Dynatrace Datadog the, these a PM solutions provide very good very good anomaly detection on metrics and a assisted boost cause analysis and the tools can completely the DIY stack by catching sub subtle. Issues. So these are the tool stack. We have through, and there is this this is the same thing, like what are the tools enabling observability scale, instrumentational and telemetry collection. We can use the open telemetry and for metrics collection dashboards, we can use Prometheus Grafana. And for distributed tracing, we can use Jaguar Grafana tempo and log aggregation and search. We use Elastic Elk Stack and for advanced monitoring and a PM as we discussing time now, trace and Log or New Relic they're too good. And for trays, log storage at Scale. Kafka Open Telemetry Collector, or Click House. At, on data Lake those are, those, these are the good tools that could be helpful for personalization engine built in products. So these tools really helpful for observability. Another is like platform impact of scaled observability, so if we use this observability in a better way, so the impacts are going to be like, dramatic reduction in outages, so there won't be much outages like, now with the robust observability issues are caught and resolved faster, and one case study saw that no resolution of these issues could be like, improved by 75% after adopting modern observability practices. So it's really helps for reduction in outages. And another is like improved performance and reliability. The team can identify bottlenecks and they can clearly see through where exactly we fall back in the performance stage and they can improve those company's performance so they can easily solve the customer issues. And that, that is important thing in the real time real time environment. To deliver the quick thing with no issues. There is like higher competence and deployment. So observability data gave developers a good boost. Like it's not even internal developers and for the external customers as well, like push updates faster, they know whenever a new model to be introduced and new future come in they immediately can incorporate and they could see how it's behaving. So this improved the velocity of experimentation, personalization. And other is like a better alignment with the business goals. So because telemetry was tied into engagement metrics, right? So the platform team now speaks the same language as product owners, they can demonstrate how the latency been improved and everything so better. So we got to be now the product owners and business goals and developers could be aligned properly when we have a proper observability. So cost worth has been benefits managed, like initially adding so much instrumentation raise it. Concerns about like cost. Hey we got to bring that tool and this tool, and now we have this cost been increased and everything, but at what cost? These all things, or in fact it, it brought a lot of observability. So a lot of insight to what's happening in our product. So sometimes putting money on those tools is like what, while when compared to the revenue, what we are going to get after we have a good good product. And next is like lessons learned and best practices. So always build in observability from day one. So don't bottom monitoring later. So design personalization systems with observability in mind. So always keep always don't think about performance thing. Always think about observability once we deploy any product into the production or we got to always think like how easily we can track, like to troubleshoot if an issue comes in. So that's one thing. And treat logs, metrics status as first class citizen. So all three telemetry types hand in hand goes. So we got to be detailed logging of the important components and things. So always treat those things as first class agents. And other is like design for visibility, non just performance. So always whenever whenever we do something to developers, they always think how can I improve the performance? They always keep that, you always wear that suit of how can I improve the performance? What we got to be there is other side we got to see is like observability. So how better coding I can done in part of observability so that, logging and all the systems interactions, how can properly can go. And what are tools I built in to track these ones? I push into the pro and there is like import the team, the data. Train new engineers or data scientists importance of observability and new knowledge on these testing and tools, and encourage the culture so that this makes incidents response more collaborative and effective. So it can be easily fixable and troubleshoot and improve our product health and others like irate and improve like observability is not set. Not set and forget it. Continuously refine what you monitor, so today you built in something to monitor dashboards or or logging and anything. But tomorrow you introduce one new use case and new feature. So you got to adapt to that and enhance that properly so that new use cases can be track even properly. And final thoughts and call to action. Observability is product critical. So it's basically in the real time ML environments. Monitoring isn't just an ops concern, it directly influences user experience and trusting of product. So we always get to be careful. So observability is a product critical and trust through a transparency. So whenever, as we earlier discuss that model, always this model model or recommendation engine part is a black box. Whether we always recommending the right content at right time to the customers or not. Proper observations, having at model level is the key so that we can clearly see through what is going on and why the conversion rate been dropped down. And take action, like one way to your current observability in your current projects which are projects you guys working on. And take a look back, step back and look at the observability, what all we have and when things fall back things go bad. How can I immediately fix and how can I keep my product healthy so that it can bring more revenue, without without any inter intervene to the production floors and there is like continuous improvement. So scaling observability is an ongoing journey. So use telemetry site and devices improvements both them up. So we got to always continuously improve metrics, cases and all those things in product. And call to action. So we always embrace the mindset that if you can't measure it, so you can't improve it, right? So bring a clarity to the complexity of realtime personalization. So by observing its model, so your personalization ending and your users will thank you later. So once you have this mindset suit yourself for the observation. Okay guys, sir, thank you. That's all from my end and you can reach me for anything to my LinkedIn profile and you guys have a wonderful rest of conference. Thank you.

Slides

Download slides (PDF)

See all 61 talks at this event!

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Scaling Observability for Real-Time Personalization Engines

Video size:

Abstract

Summary

Transcript

Slides

Sai Kumar Bitra

Principal Software Engineer @ AT&T

Join the community!

Featured event

2026

2025

Info

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Scaling Observability for Real-Time Personalization Engines

Video size:

Abstract

Summary

Transcript

Slides

Sai Kumar Bitra

Principal Software Engineer @ AT&T

Join the community!