Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
This is ra.
I'm working as a principal software engineer at and t. Today I'm going to
talk about scaling observability for real-time personalization engines.
So let's get into the talk.
Let me share.
So first let's talk about why observability matters for personalization.
So that's the first question, right?
So personalization drives engagements and they get revenue.
For example, companies that accelerate personalization, they see to, they
see up to 40% of more revenue costs.
And as, as basically users respond for highly relevant contents, right?
So they got to be good at observability to reach that personalization revenue
growth and for the real time experiences, the, there, there is zero room for errors.
Users always expect the response to be minimally seconds, and there
shouldn't be any deviation of the content that present to the customer.
So the so the delays or failures that directly hurt the
satisfaction and conversions.
Apart from that, there is like complex ML decisions that need validations, right?
So whenever we present the recommendations to the customers,
we gotta see are we presenting the right content at the right time?
Are we having any issues while presenting the content to the customers?
Is lack of insight.
Is provisionally equal to lost users trust and missed opportunity.
We got to be careful on while in the recommendation engine there it's basically
for personalization engine products.
Apart from that, there is competitive edge when we are collecting these
metrics, logs, traces, that helps continue to improve algorithms and
user experiences altogether, right?
So we gotta keeping our personalization engine a step ahead in the competition
from other other competitors, whoever is in the same phase as
personalized engine product level.
So these so observability overall is like very important and it's,
plays a key role for personalization.
So next we are going to talk about at the same time there are challenges
to, to get to this observability for real time machine learning systems.
As I said observability is important for personalization.
At the same time, there are a couple challenges it has so basically like
the personalization systems, how these complex distributed pipelines.
Where there are many components will get involved, like services, data pipelines,
ML models, and all those components.
So we got to be good track of these, all the logs traces and metrics.
Otherwise we get lost.
So end-to-end tracking should be important.
So the second complexity we could see is like there would be high data volume,
like millions of events could come for the real time events when the users are
interacting with the website, right?
So these real time without adding latency we got to capture these telemetry logs.
Which is difficult part.
So that's one more challenge.
And these ML decisions is like a black box thing.
We don't know when there is a drop of drop of growth to the recommendations we,
what we are suggesting to the customer.
So it's hard to track like whether the recommendation is driving the
recommendations properly or our rules have been executing properly or we
lost anything or the source systems or capturing the data, what we wanted for
the specific user behavior and et cetera.
So this stacking of these ML models fail loose is also hard.
And these personalization.
Now architecture is like multi-channel user journey.
Sometimes like user can come to the website at the same time
he can go to the store or he can browse through mobile apps.
So all sorts of these channels will be there.
We got to be track the customer throughout these multiple channels so that we can
stitch the user properly and we can troubleshoot if anything goes wrong.
So Observ, everybody place a challenge here for multichannel user journey.
And the fifth challenge is going to be the latency sensitivity.
So we all saying like an observability and tag these metrics and things, but
at the same time it's not that easy.
Without without having I. Heavy overhead on these telemetric
collections and things.
So we know, we all know that the, it is important to collect
the data but at what cost?
Like we got to see how we maintain lightweight real time engines.
So that's one thing.
The latency and sensitivity is going to be challenged.
Yeah.
Now and next we are going to talk about a. System architecture overview.
So this is a typical representation architecture
for real time personalization.
Engines like where the request comes into APIs the three APIs, and
it goes to decision engine where that recommendation takes place.
From there, it get interacts with the model service where models been
retain and rules been executed.
And these model service interact with the data report where the user profiles
or what or whatnot are the content recommendations will be placed in.
So meanwhile there is an event ingestion will happen where the users user interface
event streams to these event ingestions will happen, for example, through Kafka.
These event tions will be happening into the model.
So all these components multiple component attractions the metrics, logs, traces
all together to be feeding or ingested.
Into the centralized observability pipeline.
So this is a typical system architecture for these personalization engines in
thel services like ingestion, pipelines, station engine, API layer data, and
then all together all these telemetry sources to be it in to aggregate the
data and ingest them into observability pipeline, which is centralized.
So this is how the observability comes into the picture for
for personalization engines.
And now we go to the scaling observability in a real time architecture, right?
So I. As we speak, like this is really high scale architectures
in a real time real time event, conceptions and personalization.
So end to end tracing is very important.
User coming to and the a p is getting hit.
And then it interact with model, interface and trying to interact
with data data we post to face the data and execute the rules.
And enabling end-to-end visibility of each personalization response is critical.
And the second thing is unified instrumentation.
Like we got to adopt open telemetry as DK across services for standard metrics.
Logs and trace collection is important unified instrumentation so a PS
instrument code once and export telemetry to multiple back backend.
So we got to be like, unified instrumentation
instead of bits and pieces.
And the second, third thing is high throughput metrics.
There would be as we speak, like millions of events could could
happen in production environments.
When the application been accessed by external users, right?
So we already have a time series database like Prometheus and and all
sorts of metrics to be collected.
Aggregate and avoid overwhelming storage, for example, like compute
percentile, latencies and sample fine-grained events as such thing.
So high throughput metrics are very important to be collected and log
aggregation, like streaming logs Jason structure logs and the model decision
into a central system like elk.
So whenever we got issues double two, troubleshooting through
these kind of a and cloud logging things would be easy for us.
Like we can track them as like user sessions and through trace IDs and things.
And the last thing is like scalable storage and retention, like employee, a
backend that can handle high data volumes like scalable observability platform
or data lake or long term analysis.
So we need to ensure retention policy sometimes when we want
to go back to the time and then review what was the trend and.
What was that issue is about?
So historical trends can be analyzed without infin storage growth.
So we gotta be careful about scalability of storage as well.
Part of these real time architecture when we are scaling for observability.
And the and the other thing is like key metrics that matter, right?
So there will be these key metrics which are important for the troubleshoot.
We got to be collected.
One is like latency and throughput, like measures how fast and how
many recommendations we are saw.
And, at now, how much SLA you would personalize the APIs been responding
and how APIs been interacting with the model interface or for the data reports?
LA throughput one thing to be a key metric for us for observability.
And the second thing is error rates and timeout.
So we always got to watch out for five accessories or four
accessories and and are we.
Returning any default content or default recommendations to the customers and
all that data help us, like when we see when our conversion rates are go down
and any, all of these spikes indicate issues in the ML service or data layer.
So error rates and timeouts is one thing as a key metric.
We got to be log.
And the other thing is engagement metrics, tie observability to business
KPIs, like we got to be engaged, click through rates and conversion rates
dwell time, et cetera, in real time.
And so a drop in engagement might signal always personalization isn't
happening fine to the customer.
So there is a drop in the conversion rate and system.
So that's engagement metrics is one of the important thing.
And another thing is model performance indicator.
So custom metrics like ML model itself.
Example, the confidence score distribution of recommendations.
Or frequency of each model, variant being used on drift metrics, comparing
live input data to training data.
These help detect when the model's quantity is degrading.
Model performance indicator is also a good thing a good key
metric for us for the observation.
And another is like resource utilization.
So always we got to keep an eye in our infrastructure level.
What is the CPU consumption?
What is the memory consumption and what is the throughput to our
steam like example Kafka topics and what are the key lens in it?
So all sorts of things.
We got to be, keep an eye on it to ensure infrastructure can always
hand load Certain changes could explain like performance outlets.
So these are the key metrics that the, that, that matter.
For us.
And let's jump into tracing the personalization journey.
So as part of this personalized personalized engines architecture
the journey how easy do we can trace that is the thing, right?
So we got to distribute trace example like when, when a user visits personalization
content and a trace follows through the request, so the request through the
API to the decision engine and into the model, so we sent to the database, right?
So we got to be each segment level.
We got to have a trace record.
So we will know like how the interaction is happening among
these components with the handshake.
Apart from that w in telemetry for tracing.
We can, you, we got have these w in telemetry logs, traces and metrics
together where we have to have these tracing in recommendation engine level.
And after that we got to have in a model interface and you enter at DB level.
We will, we got, we chunked it into different levels to
track what is happening and in which layer it is happening to.
It'll be easy later to visualize the call flow and bottleneck identification.
So traces always help to pinpoint which layer is taking time.
For example, where API is taking this much milliseconds and your.
DB interaction is taking this much level and your model interfaces
interaction is taking this much level.
So it gives a clear, in depth view of those metrics alone.
It enables targeted optimization for us.
And other thing is like correlating with user action.
When we are tracing these user requests through session IDs or trace
IDs, so it always like correlate, we can correlate the break comes of
that request like this, a low score correlating a pool experience, for
example, user size, scale recommendation.
We can easily identify throughout that trace and we can provide
like visual context for debugging.
So correlating with user actions is one of the good thing for
tracing and sampling strategy.
So for example, trace every request may be infeasible, right?
So we got to go by batches wise categorize wise, collect traces
for a normal scenarios and represent to slice of traffic.
So this keeps like overhead low while enduring us to collect traces.
So sampling strategy is like one, one good tracing for personal
personalization journeys.
And the other is like logging for making.
So whenever we have logging comes into picture for observation, we
gotta have a rich and structured logs.
Logging in a personalization engine should capture key imports, outputs
of the ML engine for each request.
So log details like user ID or relevant features like which brackets segments
customer fall into, and maybe an explanation score or a reason could.
Could be possible.
If you're available these kind of rich and structure logs help us later to make
our life easy when we troubleshoot issues.
Apart from that, like we can enable root cause analysis, like when something
goes wrong we, we always show like relevant content to the customer, right?
So logs can reveal what the customer model saw.
And for instance log might show input or user segment is now like.
That means like it didn't give the response as anticipated, give bad
default recommendation to the customer.
So root cause analysis, it enables us when we follow through these logging properly.
And we have a trace log correlation, integrate logs with tracing context.
As earlier we talk whenever the movement request comes in, everything,
the trace completely goes with the trace ID or session ID or request id.
Through that we can easily track at these distributes distributed traces
so you can jump into associated logs across services for the same request.
So it would be easy to raise, log with these correlation IDs.
Other is anonymize the protect, anonymize and protect data.
So logging always we feel just simply log, but always we got to
take care of any sensitive data.
Are we logging?
And because of GDPR concerns and other things and observability platform
should enforce data handling policy.
So while still giving the engineers.
Enough info to troubleshoot.
So we got to be careful while logging.
As we have a freedom to have enough to troubleshoot.
At the same time, we gotta be cautious about sensitive data
of the customers to be logged.
Another is like use log levels wisely.
So whether you go with info log or error log or warning log and and we don't
need to, we don't need to like, puff up.
The logs like always make sure only warning level at what
level we have to give and error.
Error logging type which log patterns we need to be log.
So it could be, it would be easy when we segregate, when
we troubleshoot in the logs.
Always the go faster and then warning and then info for help.
And the other thing is like scaling observability patterns and best practices.
So avoid metrics overload always practices like avoid metrics overload.
Be mindful.
Of metric cardinality, for example.
This metric labeled with every user or item id, right?
So it would be a huge for each customer if we log for each user.
So we got to be aggregate them as a category wise or percentile,
latency wise or something like that.
So to be smart so we can we can avoid the overload in such
a way if we go with the login.
And raise and log sampling.
Sampling we can go with the sampling of couple requests and then we see the
pattern and we can group them as so we can go with reduce noise and cost for it.
So while retaining diagnostic power.
Exactly.
So trace and log sampling as a pattern we got to follow.
And batching and B, buffering.
So use characters like ING telemetry characters, so Kafka
and buffer telemetry data, batch taste, exports, and sending chunks.
So the application is in block or sending each span, so buffer logs to
avoid disk Iwo becoming bottleneck.
So that's one thing.
And scalable storage and querying, implement a tiered
storage observability data.
So recent data stays fast.
Queryable stores and everything would be like, fast enough if we
have a storage and querying properly.
And other thing is like a resiliency of observability.
Systems like T, your monitoring pipeline, whatever that pipeline is
about and critical part of the system.
Scale out your metrics and logging backends, always set up alerts.
Don't forget about it.
And observability systems themselves.
Example, if the log forward lags or from the is behind and observability
outage can be especially painful during a production incident so we
got to be mindful of that as well.
And the other thing is like incident observability in action.
The scenario.
Is like silent model drifting.
It is when it tells about it a story of which is an incident of observability
interaction imagine an engagement of an engagement gradually drop over
a week for personalization field.
When we are.
Showing a component to the customer and it's its engagement got dropped.
Then component crashed but component didn't crash.
And B less visibility of those components by the customers.
But the recommendations became lesser relevant.
So without observability, this scenario is hard to tag down what was
happening to that components, right?
So detection through metrics and alerts.
So without strong observability, setup, an alert tried.
When click through rate or it got deeper, 10% below baseline.
At the same time, the custom model drift metrics showed the distribution.
So input have, has shifted from training data, but this correlation
hinted the model was no longer turned to current user behavior.
So through that detection also it was, it become hard to identify
what was happening, right?
And the next is like using tracers to pinpoint impact.
So engineers pulled up to distribute at tracers for user session and they
analyze the things and the tracers show normal latency and everything.
A pattern image, like many users received a default content.
So that's where the table light in.
So it indicated that model was unsure and it fall back to the default content,
which is recommended to the customers.
So root cause, logs.
So the other thing is we could see the logs as earlier we discussed.
What is a good good things or metrics, key metrics to log.
So we, we used to log for a model as well whether.
We recommended a personalized recommendation or a default content
recommended to the customer.
We could see in the logs like, always the default content being
recommended because customer is not falling into a specific segment.
So that's where the dev identified, like they see like upstream
whether etel is collecting the customer's segment properly.
Example, customer segment is if age is greater than 30
and recommend this content.
So the age attribute profile attribute is missing to be collection.
So that's where that would cause light and they identify through that.
Logs played a key role there and the resolution and takeaway.
So thanks for observability.
The team being identified the issue of root cause yours, and they fix the ETL
things and postmodern showed prior to their observability tools, like after,
after the observability they three x times they debugging time got sailed.
And all those metrics were, are looking awesome.
So this is Abu story of observability as a sample way.
And the other is like tools and stack in action like, open Telemetry.
So there are multiple tools like part of these personalization
engines we gotta use.
There are a stack of tools which we are going to talk right now.
So open telemetry is one of the thing that is a open source
standard for instrumentation.
So it use, its collects metrics, traces, logs in a unified way.
So we, it has it come up, it comes up with SDK so we can we can integrate
this into personalized microservices and allowing easy export of telemetry.
And other is like Prometheus and Grafana.
So all the metrics can be blend in and and they even that, that heavy
load at a scale it can easily and Grafana provides like dashboards,
visualization, and from all the requests high rates of requests as well.
And we have distributed rein system.
So Jaguar kin is deployed for tracing, so receiving spans from open telemetry
engineers, so distributed tracing system jargon, kin are upgrade.
So we can use those tools and logging pipeline example, elk elastic log session
and Kibana where we use like session IDs or user IDs that can be easily tracked and
traced down even with multiple channels.
That request comes through.
We can stitch that customer request through and we can
easily troubleshoot those.
And other thing is like advanced.
There are a PM and advanced tools like in addition to these open source
tools, there are Dynatrace Datadog the, these a PM solutions provide very
good very good anomaly detection on metrics and a assisted boost cause
analysis and the tools can completely the DIY stack by catching sub subtle.
Issues.
So these are the tool stack.
We have through, and there is this this is the same thing,
like what are the tools enabling observability scale, instrumentational
and telemetry collection.
We can use the open telemetry and for metrics collection dashboards,
we can use Prometheus Grafana.
And for distributed tracing, we can use Jaguar Grafana tempo
and log aggregation and search.
We use Elastic Elk Stack and for advanced monitoring and a PM as
we discussing time now, trace and Log or New Relic they're too good.
And for trays, log storage at Scale.
Kafka Open Telemetry Collector, or Click House.
At, on data Lake those are, those, these are the good tools that could
be helpful for personalization engine built in products.
So these tools really helpful for observability.
Another is like platform impact of scaled observability, so if we use this
observability in a better way, so the impacts are going to be like, dramatic
reduction in outages, so there won't be much outages like, now with the robust
observability issues are caught and resolved faster, and one case study
saw that no resolution of these issues could be like, improved by 75% after
adopting modern observability practices.
So it's really helps for reduction in outages.
And another is like improved performance and reliability.
The team can identify bottlenecks and they can clearly see through
where exactly we fall back in the performance stage and they can improve
those company's performance so they can easily solve the customer issues.
And that, that is important thing in the real time real time environment.
To deliver the quick thing with no issues.
There is like higher competence and deployment.
So observability data gave developers a good boost.
Like it's not even internal developers and for the external customers as well, like
push updates faster, they know whenever a new model to be introduced and new future
come in they immediately can incorporate and they could see how it's behaving.
So this improved the velocity of experimentation, personalization.
And other is like a better alignment with the business goals.
So because telemetry was tied into engagement metrics, right?
So the platform team now speaks the same language as product owners, they
can demonstrate how the latency been improved and everything so better.
So we got to be now the product owners and business goals and
developers could be aligned properly when we have a proper observability.
So cost worth has been benefits managed, like initially adding
so much instrumentation raise it.
Concerns about like cost.
Hey we got to bring that tool and this tool, and now we have
this cost been increased and everything, but at what cost?
These all things, or in fact it, it brought a lot of observability.
So a lot of insight to what's happening in our product.
So sometimes putting money on those tools is like what, while when compared
to the revenue, what we are going to get after we have a good good product.
And next is like lessons learned and best practices.
So always build in observability from day one.
So don't bottom monitoring later.
So design personalization systems with observability in mind.
So always keep always don't think about performance thing.
Always think about observability once we deploy any product into the
production or we got to always think like how easily we can track, like
to troubleshoot if an issue comes in.
So that's one thing.
And treat logs, metrics status as first class citizen.
So all three telemetry types hand in hand goes.
So we got to be detailed logging of the important components and things.
So always treat those things as first class agents.
And other is like design for visibility, non just performance.
So always whenever whenever we do something to developers, they always
think how can I improve the performance?
They always keep that, you always wear that suit of how
can I improve the performance?
What we got to be there is other side we got to see is like observability.
So how better coding I can done in part of observability so
that, logging and all the systems interactions, how can properly can go.
And what are tools I built in to track these ones?
I push into the pro and there is like import the team, the data.
Train new engineers or data scientists importance of observability and
new knowledge on these testing and tools, and encourage the culture so
that this makes incidents response more collaborative and effective.
So it can be easily fixable and troubleshoot and improve our product
health and others like irate and improve like observability is not set.
Not set and forget it.
Continuously refine what you monitor, so today you built in something to monitor
dashboards or or logging and anything.
But tomorrow you introduce one new use case and new feature.
So you got to adapt to that and enhance that properly so that new
use cases can be track even properly.
And final thoughts and call to action.
Observability is product critical.
So it's basically in the real time ML environments.
Monitoring isn't just an ops concern, it directly influences user
experience and trusting of product.
So we always get to be careful.
So observability is a product critical and trust through a transparency.
So whenever, as we earlier discuss that model, always this model model or
recommendation engine part is a black box.
Whether we always recommending the right content at right
time to the customers or not.
Proper observations, having at model level is the key so that we can clearly
see through what is going on and why the conversion rate been dropped down.
And take action, like one way to your current observability in
your current projects which are projects you guys working on.
And take a look back, step back and look at the observability, what all we have
and when things fall back things go bad.
How can I immediately fix and how can I keep my product healthy so that it can
bring more revenue, without without any inter intervene to the production floors
and there is like continuous improvement.
So scaling observability is an ongoing journey.
So use telemetry site and devices improvements both them up.
So we got to always continuously improve metrics, cases and
all those things in product.
And call to action.
So we always embrace the mindset that if you can't measure it,
so you can't improve it, right?
So bring a clarity to the complexity of realtime personalization.
So by observing its model, so your personalization ending and
your users will thank you later.
So once you have this mindset suit yourself for the observation.
Okay guys, sir, thank you.
That's all from my end and you can reach me for anything to my
LinkedIn profile and you guys have a wonderful rest of conference.
Thank you.