Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Hello everyone.
            
            
            
              This is ra.
            
            
            
              I'm working as a principal software engineer at and t. Today I'm going to
            
            
            
              talk about scaling observability for real-time personalization engines.
            
            
            
              So let's get into the talk.
            
            
            
              Let me share.
            
            
            
              So first let's talk about why observability matters for personalization.
            
            
            
              So that's the first question, right?
            
            
            
              So personalization drives engagements and they get revenue.
            
            
            
              For example, companies that accelerate personalization, they see to, they
            
            
            
              see up to 40% of more revenue costs.
            
            
            
              And as, as basically users respond for highly relevant contents, right?
            
            
            
              So they got to be good at observability to reach that personalization revenue
            
            
            
              growth and for the real time experiences, the, there, there is zero room for errors.
            
            
            
              Users always expect the response to be minimally seconds, and there
            
            
            
              shouldn't be any deviation of the content that present to the customer.
            
            
            
              So the so the delays or failures that directly hurt the
            
            
            
              satisfaction and conversions.
            
            
            
              Apart from that, there is like complex ML decisions that need validations, right?
            
            
            
              So whenever we present the recommendations to the customers,
            
            
            
              we gotta see are we presenting the right content at the right time?
            
            
            
              Are we having any issues while presenting the content to the customers?
            
            
            
              Is lack of insight.
            
            
            
              Is provisionally equal to lost users trust and missed opportunity.
            
            
            
              We got to be careful on while in the recommendation engine there it's basically
            
            
            
              for personalization engine products.
            
            
            
              Apart from that, there is competitive edge when we are collecting these
            
            
            
              metrics, logs, traces, that helps continue to improve algorithms and
            
            
            
              user experiences altogether, right?
            
            
            
              So we gotta keeping our personalization engine a step ahead in the competition
            
            
            
              from other other competitors, whoever is in the same phase as
            
            
            
              personalized engine product level.
            
            
            
              So these so observability overall is like very important and it's,
            
            
            
              plays a key role for personalization.
            
            
            
              So next we are going to talk about at the same time there are challenges
            
            
            
              to, to get to this observability for real time machine learning systems.
            
            
            
              As I said observability is important for personalization.
            
            
            
              At the same time, there are a couple challenges it has so basically like
            
            
            
              the personalization systems, how these complex distributed pipelines.
            
            
            
              Where there are many components will get involved, like services, data pipelines,
            
            
            
              ML models, and all those components.
            
            
            
              So we got to be good track of these, all the logs traces and metrics.
            
            
            
              Otherwise we get lost.
            
            
            
              So end-to-end tracking should be important.
            
            
            
              So the second complexity we could see is like there would be high data volume,
            
            
            
              like millions of events could come for the real time events when the users are
            
            
            
              interacting with the website, right?
            
            
            
              So these real time without adding latency we got to capture these telemetry logs.
            
            
            
              Which is difficult part.
            
            
            
              So that's one more challenge.
            
            
            
              And these ML decisions is like a black box thing.
            
            
            
              We don't know when there is a drop of drop of growth to the recommendations we,
            
            
            
              what we are suggesting to the customer.
            
            
            
              So it's hard to track like whether the recommendation is driving the
            
            
            
              recommendations properly or our rules have been executing properly or we
            
            
            
              lost anything or the source systems or capturing the data, what we wanted for
            
            
            
              the specific user behavior and et cetera.
            
            
            
              So this stacking of these ML models fail loose is also hard.
            
            
            
              And these personalization.
            
            
            
              Now architecture is like multi-channel user journey.
            
            
            
              Sometimes like user can come to the website at the same time
            
            
            
              he can go to the store or he can browse through mobile apps.
            
            
            
              So all sorts of these channels will be there.
            
            
            
              We got to be track the customer throughout these multiple channels so that we can
            
            
            
              stitch the user properly and we can troubleshoot if anything goes wrong.
            
            
            
              So Observ, everybody place a challenge here for multichannel user journey.
            
            
            
              And the fifth challenge is going to be the latency sensitivity.
            
            
            
              So we all saying like an observability and tag these metrics and things, but
            
            
            
              at the same time it's not that easy.
            
            
            
              Without without having I. Heavy overhead on these telemetric
            
            
            
              collections and things.
            
            
            
              So we know, we all know that the, it is important to collect
            
            
            
              the data but at what cost?
            
            
            
              Like we got to see how we maintain lightweight real time engines.
            
            
            
              So that's one thing.
            
            
            
              The latency and sensitivity is going to be challenged.
            
            
            
              Yeah.
            
            
            
              Now and next we are going to talk about a. System architecture overview.
            
            
            
              So this is a typical representation architecture
            
            
            
              for real time personalization.
            
            
            
              Engines like where the request comes into APIs the three APIs, and
            
            
            
              it goes to decision engine where that recommendation takes place.
            
            
            
              From there, it get interacts with the model service where models been
            
            
            
              retain and rules been executed.
            
            
            
              And these model service interact with the data report where the user profiles
            
            
            
              or what or whatnot are the content recommendations will be placed in.
            
            
            
              So meanwhile there is an event ingestion will happen where the users user interface
            
            
            
              event streams to these event ingestions will happen, for example, through Kafka.
            
            
            
              These event tions will be happening into the model.
            
            
            
              So all these components multiple component attractions the metrics, logs, traces
            
            
            
              all together to be feeding or ingested.
            
            
            
              Into the centralized observability pipeline.
            
            
            
              So this is a typical system architecture for these personalization engines in
            
            
            
              thel services like ingestion, pipelines, station engine, API layer data, and
            
            
            
              then all together all these telemetry sources to be it in to aggregate the
            
            
            
              data and ingest them into observability pipeline, which is centralized.
            
            
            
              So this is how the observability comes into the picture for
            
            
            
              for personalization engines.
            
            
            
              And now we go to the scaling observability in a real time architecture, right?
            
            
            
              So I. As we speak, like this is really high scale architectures
            
            
            
              in a real time real time event, conceptions and personalization.
            
            
            
              So end to end tracing is very important.
            
            
            
              User coming to and the a p is getting hit.
            
            
            
              And then it interact with model, interface and trying to interact
            
            
            
              with data data we post to face the data and execute the rules.
            
            
            
              And enabling end-to-end visibility of each personalization response is critical.
            
            
            
              And the second thing is unified instrumentation.
            
            
            
              Like we got to adopt open telemetry as DK across services for standard metrics.
            
            
            
              Logs and trace collection is important unified instrumentation so a PS
            
            
            
              instrument code once and export telemetry to multiple back backend.
            
            
            
              So we got to be like, unified instrumentation
            
            
            
              instead of bits and pieces.
            
            
            
              And the second, third thing is high throughput metrics.
            
            
            
              There would be as we speak, like millions of events could could
            
            
            
              happen in production environments.
            
            
            
              When the application been accessed by external users, right?
            
            
            
              So we already have a time series database like Prometheus and and all
            
            
            
              sorts of metrics to be collected.
            
            
            
              Aggregate and avoid overwhelming storage, for example, like compute
            
            
            
              percentile, latencies and sample fine-grained events as such thing.
            
            
            
              So high throughput metrics are very important to be collected and log
            
            
            
              aggregation, like streaming logs Jason structure logs and the model decision
            
            
            
              into a central system like elk.
            
            
            
              So whenever we got issues double two, troubleshooting through
            
            
            
              these kind of a and cloud logging things would be easy for us.
            
            
            
              Like we can track them as like user sessions and through trace IDs and things.
            
            
            
              And the last thing is like scalable storage and retention, like employee, a
            
            
            
              backend that can handle high data volumes like scalable observability platform
            
            
            
              or data lake or long term analysis.
            
            
            
              So we need to ensure retention policy sometimes when we want
            
            
            
              to go back to the time and then review what was the trend and.
            
            
            
              What was that issue is about?
            
            
            
              So historical trends can be analyzed without infin storage growth.
            
            
            
              So we gotta be careful about scalability of storage as well.
            
            
            
              Part of these real time architecture when we are scaling for observability.
            
            
            
              And the and the other thing is like key metrics that matter, right?
            
            
            
              So there will be these key metrics which are important for the troubleshoot.
            
            
            
              We got to be collected.
            
            
            
              One is like latency and throughput, like measures how fast and how
            
            
            
              many recommendations we are saw.
            
            
            
              And, at now, how much SLA you would personalize the APIs been responding
            
            
            
              and how APIs been interacting with the model interface or for the data reports?
            
            
            
              LA throughput one thing to be a key metric for us for observability.
            
            
            
              And the second thing is error rates and timeout.
            
            
            
              So we always got to watch out for five accessories or four
            
            
            
              accessories and and are we.
            
            
            
              Returning any default content or default recommendations to the customers and
            
            
            
              all that data help us, like when we see when our conversion rates are go down
            
            
            
              and any, all of these spikes indicate issues in the ML service or data layer.
            
            
            
              So error rates and timeouts is one thing as a key metric.
            
            
            
              We got to be log.
            
            
            
              And the other thing is engagement metrics, tie observability to business
            
            
            
              KPIs, like we got to be engaged, click through rates and conversion rates
            
            
            
              dwell time, et cetera, in real time.
            
            
            
              And so a drop in engagement might signal always personalization isn't
            
            
            
              happening fine to the customer.
            
            
            
              So there is a drop in the conversion rate and system.
            
            
            
              So that's engagement metrics is one of the important thing.
            
            
            
              And another thing is model performance indicator.
            
            
            
              So custom metrics like ML model itself.
            
            
            
              Example, the confidence score distribution of recommendations.
            
            
            
              Or frequency of each model, variant being used on drift metrics, comparing
            
            
            
              live input data to training data.
            
            
            
              These help detect when the model's quantity is degrading.
            
            
            
              Model performance indicator is also a good thing a good key
            
            
            
              metric for us for the observation.
            
            
            
              And another is like resource utilization.
            
            
            
              So always we got to keep an eye in our infrastructure level.
            
            
            
              What is the CPU consumption?
            
            
            
              What is the memory consumption and what is the throughput to our
            
            
            
              steam like example Kafka topics and what are the key lens in it?
            
            
            
              So all sorts of things.
            
            
            
              We got to be, keep an eye on it to ensure infrastructure can always
            
            
            
              hand load Certain changes could explain like performance outlets.
            
            
            
              So these are the key metrics that the, that, that matter.
            
            
            
              For us.
            
            
            
              And let's jump into tracing the personalization journey.
            
            
            
              So as part of this personalized personalized engines architecture
            
            
            
              the journey how easy do we can trace that is the thing, right?
            
            
            
              So we got to distribute trace example like when, when a user visits personalization
            
            
            
              content and a trace follows through the request, so the request through the
            
            
            
              API to the decision engine and into the model, so we sent to the database, right?
            
            
            
              So we got to be each segment level.
            
            
            
              We got to have a trace record.
            
            
            
              So we will know like how the interaction is happening among
            
            
            
              these components with the handshake.
            
            
            
              Apart from that w in telemetry for tracing.
            
            
            
              We can, you, we got have these w in telemetry logs, traces and metrics
            
            
            
              together where we have to have these tracing in recommendation engine level.
            
            
            
              And after that we got to have in a model interface and you enter at DB level.
            
            
            
              We will, we got, we chunked it into different levels to
            
            
            
              track what is happening and in which layer it is happening to.
            
            
            
              It'll be easy later to visualize the call flow and bottleneck identification.
            
            
            
              So traces always help to pinpoint which layer is taking time.
            
            
            
              For example, where API is taking this much milliseconds and your.
            
            
            
              DB interaction is taking this much level and your model interfaces
            
            
            
              interaction is taking this much level.
            
            
            
              So it gives a clear, in depth view of those metrics alone.
            
            
            
              It enables targeted optimization for us.
            
            
            
              And other thing is like correlating with user action.
            
            
            
              When we are tracing these user requests through session IDs or trace
            
            
            
              IDs, so it always like correlate, we can correlate the break comes of
            
            
            
              that request like this, a low score correlating a pool experience, for
            
            
            
              example, user size, scale recommendation.
            
            
            
              We can easily identify throughout that trace and we can provide
            
            
            
              like visual context for debugging.
            
            
            
              So correlating with user actions is one of the good thing for
            
            
            
              tracing and sampling strategy.
            
            
            
              So for example, trace every request may be infeasible, right?
            
            
            
              So we got to go by batches wise categorize wise, collect traces
            
            
            
              for a normal scenarios and represent to slice of traffic.
            
            
            
              So this keeps like overhead low while enduring us to collect traces.
            
            
            
              So sampling strategy is like one, one good tracing for personal
            
            
            
              personalization journeys.
            
            
            
              And the other is like logging for making.
            
            
            
              So whenever we have logging comes into picture for observation, we
            
            
            
              gotta have a rich and structured logs.
            
            
            
              Logging in a personalization engine should capture key imports, outputs
            
            
            
              of the ML engine for each request.
            
            
            
              So log details like user ID or relevant features like which brackets segments
            
            
            
              customer fall into, and maybe an explanation score or a reason could.
            
            
            
              Could be possible.
            
            
            
              If you're available these kind of rich and structure logs help us later to make
            
            
            
              our life easy when we troubleshoot issues.
            
            
            
              Apart from that, like we can enable root cause analysis, like when something
            
            
            
              goes wrong we, we always show like relevant content to the customer, right?
            
            
            
              So logs can reveal what the customer model saw.
            
            
            
              And for instance log might show input or user segment is now like.
            
            
            
              That means like it didn't give the response as anticipated, give bad
            
            
            
              default recommendation to the customer.
            
            
            
              So root cause analysis, it enables us when we follow through these logging properly.
            
            
            
              And we have a trace log correlation, integrate logs with tracing context.
            
            
            
              As earlier we talk whenever the movement request comes in, everything,
            
            
            
              the trace completely goes with the trace ID or session ID or request id.
            
            
            
              Through that we can easily track at these distributes distributed traces
            
            
            
              so you can jump into associated logs across services for the same request.
            
            
            
              So it would be easy to raise, log with these correlation IDs.
            
            
            
              Other is anonymize the protect, anonymize and protect data.
            
            
            
              So logging always we feel just simply log, but always we got to
            
            
            
              take care of any sensitive data.
            
            
            
              Are we logging?
            
            
            
              And because of GDPR concerns and other things and observability platform
            
            
            
              should enforce data handling policy.
            
            
            
              So while still giving the engineers.
            
            
            
              Enough info to troubleshoot.
            
            
            
              So we got to be careful while logging.
            
            
            
              As we have a freedom to have enough to troubleshoot.
            
            
            
              At the same time, we gotta be cautious about sensitive data
            
            
            
              of the customers to be logged.
            
            
            
              Another is like use log levels wisely.
            
            
            
              So whether you go with info log or error log or warning log and and we don't
            
            
            
              need to, we don't need to like, puff up.
            
            
            
              The logs like always make sure only warning level at what
            
            
            
              level we have to give and error.
            
            
            
              Error logging type which log patterns we need to be log.
            
            
            
              So it could be, it would be easy when we segregate, when
            
            
            
              we troubleshoot in the logs.
            
            
            
              Always the go faster and then warning and then info for help.
            
            
            
              And the other thing is like scaling observability patterns and best practices.
            
            
            
              So avoid metrics overload always practices like avoid metrics overload.
            
            
            
              Be mindful.
            
            
            
              Of metric cardinality, for example.
            
            
            
              This metric labeled with every user or item id, right?
            
            
            
              So it would be a huge for each customer if we log for each user.
            
            
            
              So we got to be aggregate them as a category wise or percentile,
            
            
            
              latency wise or something like that.
            
            
            
              So to be smart so we can we can avoid the overload in such
            
            
            
              a way if we go with the login.
            
            
            
              And raise and log sampling.
            
            
            
              Sampling we can go with the sampling of couple requests and then we see the
            
            
            
              pattern and we can group them as so we can go with reduce noise and cost for it.
            
            
            
              So while retaining diagnostic power.
            
            
            
              Exactly.
            
            
            
              So trace and log sampling as a pattern we got to follow.
            
            
            
              And batching and B, buffering.
            
            
            
              So use characters like ING telemetry characters, so Kafka
            
            
            
              and buffer telemetry data, batch taste, exports, and sending chunks.
            
            
            
              So the application is in block or sending each span, so buffer logs to
            
            
            
              avoid disk Iwo becoming bottleneck.
            
            
            
              So that's one thing.
            
            
            
              And scalable storage and querying, implement a tiered
            
            
            
              storage observability data.
            
            
            
              So recent data stays fast.
            
            
            
              Queryable stores and everything would be like, fast enough if we
            
            
            
              have a storage and querying properly.
            
            
            
              And other thing is like a resiliency of observability.
            
            
            
              Systems like T, your monitoring pipeline, whatever that pipeline is
            
            
            
              about and critical part of the system.
            
            
            
              Scale out your metrics and logging backends, always set up alerts.
            
            
            
              Don't forget about it.
            
            
            
              And observability systems themselves.
            
            
            
              Example, if the log forward lags or from the is behind and observability
            
            
            
              outage can be especially painful during a production incident so we
            
            
            
              got to be mindful of that as well.
            
            
            
              And the other thing is like incident observability in action.
            
            
            
              The scenario.
            
            
            
              Is like silent model drifting.
            
            
            
              It is when it tells about it a story of which is an incident of observability
            
            
            
              interaction imagine an engagement of an engagement gradually drop over
            
            
            
              a week for personalization field.
            
            
            
              When we are.
            
            
            
              Showing a component to the customer and it's its engagement got dropped.
            
            
            
              Then component crashed but component didn't crash.
            
            
            
              And B less visibility of those components by the customers.
            
            
            
              But the recommendations became lesser relevant.
            
            
            
              So without observability, this scenario is hard to tag down what was
            
            
            
              happening to that components, right?
            
            
            
              So detection through metrics and alerts.
            
            
            
              So without strong observability, setup, an alert tried.
            
            
            
              When click through rate or it got deeper, 10% below baseline.
            
            
            
              At the same time, the custom model drift metrics showed the distribution.
            
            
            
              So input have, has shifted from training data, but this correlation
            
            
            
              hinted the model was no longer turned to current user behavior.
            
            
            
              So through that detection also it was, it become hard to identify
            
            
            
              what was happening, right?
            
            
            
              And the next is like using tracers to pinpoint impact.
            
            
            
              So engineers pulled up to distribute at tracers for user session and they
            
            
            
              analyze the things and the tracers show normal latency and everything.
            
            
            
              A pattern image, like many users received a default content.
            
            
            
              So that's where the table light in.
            
            
            
              So it indicated that model was unsure and it fall back to the default content,
            
            
            
              which is recommended to the customers.
            
            
            
              So root cause, logs.
            
            
            
              So the other thing is we could see the logs as earlier we discussed.
            
            
            
              What is a good good things or metrics, key metrics to log.
            
            
            
              So we, we used to log for a model as well whether.
            
            
            
              We recommended a personalized recommendation or a default content
            
            
            
              recommended to the customer.
            
            
            
              We could see in the logs like, always the default content being
            
            
            
              recommended because customer is not falling into a specific segment.
            
            
            
              So that's where the dev identified, like they see like upstream
            
            
            
              whether etel is collecting the customer's segment properly.
            
            
            
              Example, customer segment is if age is greater than 30
            
            
            
              and recommend this content.
            
            
            
              So the age attribute profile attribute is missing to be collection.
            
            
            
              So that's where that would cause light and they identify through that.
            
            
            
              Logs played a key role there and the resolution and takeaway.
            
            
            
              So thanks for observability.
            
            
            
              The team being identified the issue of root cause yours, and they fix the ETL
            
            
            
              things and postmodern showed prior to their observability tools, like after,
            
            
            
              after the observability they three x times they debugging time got sailed.
            
            
            
              And all those metrics were, are looking awesome.
            
            
            
              So this is Abu story of observability as a sample way.
            
            
            
              And the other is like tools and stack in action like, open Telemetry.
            
            
            
              So there are multiple tools like part of these personalization
            
            
            
              engines we gotta use.
            
            
            
              There are a stack of tools which we are going to talk right now.
            
            
            
              So open telemetry is one of the thing that is a open source
            
            
            
              standard for instrumentation.
            
            
            
              So it use, its collects metrics, traces, logs in a unified way.
            
            
            
              So we, it has it come up, it comes up with SDK so we can we can integrate
            
            
            
              this into personalized microservices and allowing easy export of telemetry.
            
            
            
              And other is like Prometheus and Grafana.
            
            
            
              So all the metrics can be blend in and and they even that, that heavy
            
            
            
              load at a scale it can easily and Grafana provides like dashboards,
            
            
            
              visualization, and from all the requests high rates of requests as well.
            
            
            
              And we have distributed rein system.
            
            
            
              So Jaguar kin is deployed for tracing, so receiving spans from open telemetry
            
            
            
              engineers, so distributed tracing system jargon, kin are upgrade.
            
            
            
              So we can use those tools and logging pipeline example, elk elastic log session
            
            
            
              and Kibana where we use like session IDs or user IDs that can be easily tracked and
            
            
            
              traced down even with multiple channels.
            
            
            
              That request comes through.
            
            
            
              We can stitch that customer request through and we can
            
            
            
              easily troubleshoot those.
            
            
            
              And other thing is like advanced.
            
            
            
              There are a PM and advanced tools like in addition to these open source
            
            
            
              tools, there are Dynatrace Datadog the, these a PM solutions provide very
            
            
            
              good very good anomaly detection on metrics and a assisted boost cause
            
            
            
              analysis and the tools can completely the DIY stack by catching sub subtle.
            
            
            
              Issues.
            
            
            
              So these are the tool stack.
            
            
            
              We have through, and there is this this is the same thing,
            
            
            
              like what are the tools enabling observability scale, instrumentational
            
            
            
              and telemetry collection.
            
            
            
              We can use the open telemetry and for metrics collection dashboards,
            
            
            
              we can use Prometheus Grafana.
            
            
            
              And for distributed tracing, we can use Jaguar Grafana tempo
            
            
            
              and log aggregation and search.
            
            
            
              We use Elastic Elk Stack and for advanced monitoring and a PM as
            
            
            
              we discussing time now, trace and Log or New Relic they're too good.
            
            
            
              And for trays, log storage at Scale.
            
            
            
              Kafka Open Telemetry Collector, or Click House.
            
            
            
              At, on data Lake those are, those, these are the good tools that could
            
            
            
              be helpful for personalization engine built in products.
            
            
            
              So these tools really helpful for observability.
            
            
            
              Another is like platform impact of scaled observability, so if we use this
            
            
            
              observability in a better way, so the impacts are going to be like, dramatic
            
            
            
              reduction in outages, so there won't be much outages like, now with the robust
            
            
            
              observability issues are caught and resolved faster, and one case study
            
            
            
              saw that no resolution of these issues could be like, improved by 75% after
            
            
            
              adopting modern observability practices.
            
            
            
              So it's really helps for reduction in outages.
            
            
            
              And another is like improved performance and reliability.
            
            
            
              The team can identify bottlenecks and they can clearly see through
            
            
            
              where exactly we fall back in the performance stage and they can improve
            
            
            
              those company's performance so they can easily solve the customer issues.
            
            
            
              And that, that is important thing in the real time real time environment.
            
            
            
              To deliver the quick thing with no issues.
            
            
            
              There is like higher competence and deployment.
            
            
            
              So observability data gave developers a good boost.
            
            
            
              Like it's not even internal developers and for the external customers as well, like
            
            
            
              push updates faster, they know whenever a new model to be introduced and new future
            
            
            
              come in they immediately can incorporate and they could see how it's behaving.
            
            
            
              So this improved the velocity of experimentation, personalization.
            
            
            
              And other is like a better alignment with the business goals.
            
            
            
              So because telemetry was tied into engagement metrics, right?
            
            
            
              So the platform team now speaks the same language as product owners, they
            
            
            
              can demonstrate how the latency been improved and everything so better.
            
            
            
              So we got to be now the product owners and business goals and
            
            
            
              developers could be aligned properly when we have a proper observability.
            
            
            
              So cost worth has been benefits managed, like initially adding
            
            
            
              so much instrumentation raise it.
            
            
            
              Concerns about like cost.
            
            
            
              Hey we got to bring that tool and this tool, and now we have
            
            
            
              this cost been increased and everything, but at what cost?
            
            
            
              These all things, or in fact it, it brought a lot of observability.
            
            
            
              So a lot of insight to what's happening in our product.
            
            
            
              So sometimes putting money on those tools is like what, while when compared
            
            
            
              to the revenue, what we are going to get after we have a good good product.
            
            
            
              And next is like lessons learned and best practices.
            
            
            
              So always build in observability from day one.
            
            
            
              So don't bottom monitoring later.
            
            
            
              So design personalization systems with observability in mind.
            
            
            
              So always keep always don't think about performance thing.
            
            
            
              Always think about observability once we deploy any product into the
            
            
            
              production or we got to always think like how easily we can track, like
            
            
            
              to troubleshoot if an issue comes in.
            
            
            
              So that's one thing.
            
            
            
              And treat logs, metrics status as first class citizen.
            
            
            
              So all three telemetry types hand in hand goes.
            
            
            
              So we got to be detailed logging of the important components and things.
            
            
            
              So always treat those things as first class agents.
            
            
            
              And other is like design for visibility, non just performance.
            
            
            
              So always whenever whenever we do something to developers, they always
            
            
            
              think how can I improve the performance?
            
            
            
              They always keep that, you always wear that suit of how
            
            
            
              can I improve the performance?
            
            
            
              What we got to be there is other side we got to see is like observability.
            
            
            
              So how better coding I can done in part of observability so
            
            
            
              that, logging and all the systems interactions, how can properly can go.
            
            
            
              And what are tools I built in to track these ones?
            
            
            
              I push into the pro and there is like import the team, the data.
            
            
            
              Train new engineers or data scientists importance of observability and
            
            
            
              new knowledge on these testing and tools, and encourage the culture so
            
            
            
              that this makes incidents response more collaborative and effective.
            
            
            
              So it can be easily fixable and troubleshoot and improve our product
            
            
            
              health and others like irate and improve like observability is not set.
            
            
            
              Not set and forget it.
            
            
            
              Continuously refine what you monitor, so today you built in something to monitor
            
            
            
              dashboards or or logging and anything.
            
            
            
              But tomorrow you introduce one new use case and new feature.
            
            
            
              So you got to adapt to that and enhance that properly so that new
            
            
            
              use cases can be track even properly.
            
            
            
              And final thoughts and call to action.
            
            
            
              Observability is product critical.
            
            
            
              So it's basically in the real time ML environments.
            
            
            
              Monitoring isn't just an ops concern, it directly influences user
            
            
            
              experience and trusting of product.
            
            
            
              So we always get to be careful.
            
            
            
              So observability is a product critical and trust through a transparency.
            
            
            
              So whenever, as we earlier discuss that model, always this model model or
            
            
            
              recommendation engine part is a black box.
            
            
            
              Whether we always recommending the right content at right
            
            
            
              time to the customers or not.
            
            
            
              Proper observations, having at model level is the key so that we can clearly
            
            
            
              see through what is going on and why the conversion rate been dropped down.
            
            
            
              And take action, like one way to your current observability in
            
            
            
              your current projects which are projects you guys working on.
            
            
            
              And take a look back, step back and look at the observability, what all we have
            
            
            
              and when things fall back things go bad.
            
            
            
              How can I immediately fix and how can I keep my product healthy so that it can
            
            
            
              bring more revenue, without without any inter intervene to the production floors
            
            
            
              and there is like continuous improvement.
            
            
            
              So scaling observability is an ongoing journey.
            
            
            
              So use telemetry site and devices improvements both them up.
            
            
            
              So we got to always continuously improve metrics, cases and
            
            
            
              all those things in product.
            
            
            
              And call to action.
            
            
            
              So we always embrace the mindset that if you can't measure it,
            
            
            
              so you can't improve it, right?
            
            
            
              So bring a clarity to the complexity of realtime personalization.
            
            
            
              So by observing its model, so your personalization ending and
            
            
            
              your users will thank you later.
            
            
            
              So once you have this mindset suit yourself for the observation.
            
            
            
              Okay guys, sir, thank you.
            
            
            
              That's all from my end and you can reach me for anything to my
            
            
            
              LinkedIn profile and you guys have a wonderful rest of conference.
            
            
            
              Thank you.