Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
I'm Gar Siner.
I work at Google Cloud and I'm excited to talk to you all today about
building intelligent retention engines that can help with real time churn
detection using AI on Kubernetes.
As part of my prior experience, I have worked across multiple organizations.
Mainly dig digital marketplaces.
In the past that would be eBay, Walmart, StockX.
One of the common themes that most of the digital marketplaces had was trying
to understand and reduce customer churn.
You'd say, why is that important?
I would say for four major reasons.
One, organizations want to make sure that they're able to increase revenue.
Which is based on customer lifetime value, which basically goes back
to making sure that the customer is retained in the system.
Next, they also wanna make sure that the customer acquisition cost or CAC is low.
It normally costs higher to acquire a new customer as opposed to
retaining an existing customer.
Third, to make sure that we are able to have targeted retention
campaigns, having a system that's able to identify and reduce churn.
Ensures that marketing and customer success teams can deploy highly targeted
and personalized offers or interventions instead of broad, un targeted campaigns.
And I would say lastly.
Improved operational efficiency.
A system that's able to automate the process of identifying at-risk customers
and even trigger automated workflows can free up bandwidth to focus on more
strategic and important actions that can evolve the customer journey even further.
Alright, so without further ado, let's chop straight into it.
My agenda for today is gonna be firstly talking about the churn challenge,
which will focus on addressing.
Why traditional retention models fail and what makes them insufficient in
today's dig, digital marketplaces.
Next, we'll touch upon AI power detection, and how does AI really help
scale up where traditional models fail.
Third, we'll talk the technical infrastructure where I
detail my proposed solution.
We'll talk about what this means in terms of successful implementation
and deploying into production.
Fourth, we look into ethical considerations.
Whenever we talk about AI fairness, model accuracy, user privacy, consent are
very important terms that come in handy.
And so we need to make sure that we are able to address those, and
then we finally close and look at what the future looks like.
All right, so touching upon why traditional retention models fail.
I would say there's a couple of reasons why these models aren't
necessarily well set for today's modern digital environment.
One, they're reactive, not necessarily proactive.
What this means is by the time they have processed patterns and got an
output, it's generally too late in the game and a customer has already
churned and we can't necessarily do something retroactively next.
They generally are based on core segmentation logic, which makes sure.
That they are focused on homogeneous cohorts of customers, but aren't
necessarily able to take in unique customer characteristics or behavioral
tra trajectories into consideration.
Third, they anomaly are based on unlimited data, which is generally
transactional, so they are missing.
Attributes such as customer communications, support,
tickets, even payment requests, product behavior patterns.
So those are things which I think are essential to make sure we have a
holistic understanding of the customer.
And lastly, they are necessarily very flexible.
They have fixed rules and they are built on very static thresholds
that make it hard to evolve in the new modern environment to adapt
to marketplace dynamics, even on different seasonality aspects and
changing customer expectations.
So then talking about how does AI help in this case?
This is a new age of ai.
I would say there's three major advantages that AI can offer here.
One instantaneous response times.
So we want models that are able to detect churn signals and trigger interventions
right when the behavior happens.
And so they need to be proactive rather than reactive.
Second, we want these models to be very dynamic, so we wanna make sure that
they're able to continuously learn from different data streams, they're able
to adapt their detection algorithms and also evolve to more changing customer
dynamics as well as market dynamics.
And lastly.
These models need to be multi-source capable, which means that they're
able to look at not just transactional data, but they're also looking
at support interactions, product usage metrics, engagement patterns.
So what all this would enable us to do is to have a very holistic and comprehensive
churn risk assessment, and that's where it can really scale up versus how
traditional models have normally worked.
Alright talking about the Kubernetes Native AI Architecture Foundation, right?
Building this real time system requires a cloud native foundation
that can scale dynamically with data volume and model complexity.
So this is where Kubernetes comes in, right?
Kubernetes is the essential orchestration layer.
For these containerized AI workloads.
Let's break this down for simpl simplicity.
What do we mean by containerized AI workloads?
What do we mean by orchestration layer?
Let's think about this, say like an orchestra, right?
So musical orchestra that most of you have, or many of us must have attended.
We have different musicians playing instruments, but then we have a
central conductor who is coordinating across these different people.
He's making sure that the music that we listen to is melodious.
Not c He makes sure that when the drums are playing, the banjo
is probably gonna be silent.
Or when the flute is playing, the guitar is not gonna be playing at the same time.
So orchestrating these different aspects, what is an essential
element of hand of a conductor?
So think about Kubernetes as that conductor in this cloud native
environment which is able to scale, deploy, and automate different
workflows and different apps.
In this case, different containers.
And it basically is able to make sure that we have a very scalable
infrastructure to deploy different machine learning algorithms.
Alright, so then touching upon the machine learning techniques itself
that I would say are more used, are more essential for churn detection.
As of today I would say there's three that I would want to touch upon.
One is recursive or recurrent neural networks ENSs.
So ENSs are able to process sequential customer behaviors, right?
They're able to identify patterns based on a more continuous timeline
and they generally look at certain aspects that would help flag if
behaviors change in log in frequency.
But then given that they our focus more on recent behaviors and
necessarily aren't able to track historical behavior patterns.
We also have L SDMs, which is long short term memory models.
And how they differ is they're able to actually capture both immediate
as well as distant behavior signals.
They're able to actually detect gradual disengagement that can
happen over weeks or even months, but then bring it all together.
There are ensemble models which are random forest gradient boostings.
These models actually integrate deep learning with traditional
machine learning for a much more comprehensive risk assessment.
So think about this as a customer who has been logging every week and then
suddenly they don't log in certain weeks.
Trying to understand if that behavior is normal or truly out
of the ordinary is something that these models can help detect.
Okay, so the next few slides, I'm actually gonna touch upon the core technologies
and orchestration that's needed to implement the system that I'm proposing.
So starting off with the different data sources and feature engineering
data is an essential element of any machine learning model, right?
So when we talk about data, what data are we truly looking at?
We did reference transactional data in the prior slides, but then trying to
understand session based data the session duration, frequency feature adoption.
Cart abandonment add to cart behaviors, interaction with
support tickets, payment typings.
All these, I would say are important aspects or important data elements that
need to be also captured to make sure that we are able to really get a holistic
understanding of the customer behavior.
Crucially, we need to also analyze micro interactions, which could be
click-through rates, navigation patterns, because these granular signals can
actually often proceed very visible churn patterns or indicators by weeks
and months, and this really gives us that intervention window to make sure
we're able to act at the right time.
Talking about the system or the proposed architecture itself, I would
say there's three major components that I'm going to touch upon.
Here we're gonna look at Apache Kafka.
We'll talk about Cube flow next, and then key Native after that.
But then starting here with Apache Kafka.
So Kafka is the heart of the real time system, and it helps to stream
all customer interaction events.
That could be from mobile apps, it could be web applications, backend services,
and it helps create a unified event.
Log Kafka Streams.
API is a real time feature extraction and aggregation system.
That helps detect anomalies and actually enable pattern recognition based
on how the data is flowing through.
And then these process events and patterns, trigger model inference requests
that enable instant churn, risk scoring, and automated intervention workflows.
So again, putting it all together in the example of the conductor.
Kafka is basically the system that is enabling, picking up everything that's
happening all around, and making sure that information is available to the conductor.
As the musicians when needed.
All right.
The next part of this system is cube flow for machine
learning pipeline orchestration.
Cube Flow is an essential orchestration framework for managing
complex machine learning workflows that we had talked about earlier
with the RNN and LSTM models.
Basically think about this as the brain.
Of the engine.
And it enables us to really build very complex and specific algorithms that
are able to understand and adapt to different customer behavioral patterns.
The advantage that Q Flow offers is it is able to automate model
training pipelines that retrain, churn detection models as new data arrives.
Also, make sure that the predictions remain accurate.
As the customer behaviors evolve, and then it does support,
experiment tracking, hyper parameter tuning, even model versioning.
All of these are critical and essential aspects that are needed to
maintain a production ready system.
The other thing that it enables us to do is data scientists can work on
testing and building new algorithms of frameworks in the backend while
DevOps can actually manage deployment.
And then the third part of the system, which is K which
enables serverless AI deployment.
Again, what does this truly mean?
So in the example of our orchestra, although the conductor is the person
who is responsible of making sure everything works as expected, you
also need a stage manager, right?
What does a stage manager do?
A stage manager is looking at how the audience is reacting.
Are people leaving?
Do we need the lights to be on or off?
Do we need to make sure that.
A certain musician needs to go on or go off the stage.
Certain aspects and just ability to scale based on behavior patterns
is what a stage manager does.
So in this case, K native is that kind of a stage manager, right?
So it enables automation, deployment, and even scaling of these workflows
in a very dynamic manner without.
Interaction from a human person, right?
So in a e-commerce kind of environment or a digital marketplace setup, what this
means is, say on a Black Friday or a Cyber Monday, you need to make sure all systems
are up and running and able to scale up.
And this needs to be really agile, but at the same time, on certain days where
there is really less customer interaction.
It should also be able to scale down and basically be very efficient at it.
So this is where K comes in handy.
K actually integrates pretty seamlessly with Kafka and it can
trigger model inference functions.
Only even specific behavior events occur.
So this truly makes the churn prevention system.
Very robust.
Alright, so before we go to the next slide, I just wanted to summarize
the few things that we talked about.
We looked at Kafka, cube flow, and K. So again, Kafka is the system
that enables collecting signals.
It could be across different data sources, clicks, payments,
logins, transaction data.
Kafka Streams, which integrates with Kafka, helps process patterns,
cleans and aggregates the data.
Cube flow is what enables us running the AI models LSTM, RNs
ensemble models, and these are run using cube flow and Kubernetes.
And then finally, K native enable serverless deployment.
Ability to interact with CRM systems and make sure that the whole system
works in a very automated manner.
Okay, so let's talk about what this means in terms of production
infrastructure requirements.
I would say there's three aspects that are essential having unified data platform.
So this is something that should enable both batch processing as
well as real-time processing.
And so we need a good unified customer data platform.
Second is having an API for CR CRM integration, so restful and GraphQL.
They actually enable real time synchronization and ensure that
these intervention workflows have complete customer context.
And then the third thing is having scalable compute pipelines.
So this helps containers have automatic resource allocation and scale up or down
as the workload requirements change.
So all these three, I would say essential elements to make sure we're able to
make this a production ready system.
All right.
So putting this into what it means in terms of model performance and monitoring.
Although the system should work in an automated manner, once it's set
up, we need to ensure that we are able to monitor for model accuracy,
prediction, latency, and even data drift.
The advantages that Kubernetes native tools like Prometheus and Grafana.
Actually provide these insights upfront.
And to maintain reliability.
We also need to make sure that any kind of updates that happen need to
be done in a safe and accurate manner.
Again what Kubernetes allows us to do is to make sure that we're able
to validate any new model versions against production traffic without
impacting customer experience.
And that's truly where it sets itself apart.
All right, so bring it all together.
Let's talk about what this means and then in terms of an implementation roadmap.
So there's four steps to implement the system in a production environment.
First is the foundation setup, which is making sure that we are deploying
the Kubernetes clusters with tube flow, establishing data, pipeline
architecture, and having basic event streaming with Kafka to make sure that
real time data is getting into the system.
The second aspect, which is basically having the machine learning model
constructed or created all the system would eventually run with these models
and automate and update as they go along.
The initial aspect of developing and training these models using historical
data, implementing feature engineering pipelines and establishing model
validation frameworks are aspects that essential in the model development stage.
Once the model is developed, the production deployment is done
using can, as I mentioned, this is.
Serverless is allows for scalability and automation of these workflows.
It also helps integrate with CRM systems.
It implements monitoring and editing for production workflows.
And finally, over a longer term, I would say optimization and scaling is
an essential component of the system.
So implementing a b testing for model improvements, enhancing
real-time processing capabilities and scaling infrastructure
based on production testing.
Are very critical aspects to make sure this system continues to operate
as expected over the longer term.
Alright, let's talk about model fairness and transparency.
A truly an efficient system must also be an ethical one.
So bias detection and ability to make sure that we are able to
mitigate it is an important aspect.
So we need to regularly audit predictions across different customer segments
to make sure that there's equitable treatment and no discriminatory outcomes.
What could this mean?
Say there are certain customers who don't necessarily log in very frequently.
We don't want to.
Exclude these customers just because of certain behavioral patterns.
Wanna make sure that it is all encompassing and able to address
certain aspects of behaviors.
Second, having an explainable AI implementation, right?
Integrating tools like SHA and Lyme.
They basically provide interpretable explanation for churn prediction.
Why?
Why this is critical is it helps customer success teams to
understand if a customer is at risk.
How do they intervene effectively and finally having good documentation.
So algorithms can run in the backend, but we need to make sure that there's
enough logging of model decisions and intervention triggers, both for compliance
purposes as well as to make sure that we can continuously improve the system and
our retention strategies as we go along.
All right.
Talking further on the aspects of ethics in terms of user content and
privacy management retention systems must adhere, I would say, to privacy
by design principles and comply with regulations like GDPR and CCPA.
Retention is not just about keeping customers, it's about
actually keeping their trust.
So the ML models need to be flexible enough to adapt future feature
selection based on individual customer consent preferences.
This is where, again, Kubernetes really shines because it can automate
compliance workflows like consent verification, data retention policies
as part of the ML pipeline in Cube flow.
So what does this all mean when we talk about the future?
Before we talk about next steps, I just wanted to recap what we
talked about and give the roadmap of how this all would come together.
I would say there's four steps to building the system having the foundation
set up using Kubernetes Kafka and Cube flow training and deploying our models.
This could be R ns, l sst m. Ensemble models using again on cube flow and then
production deployment using K native, making sure there's CRM integration there.
And then optimizing and scaling these models using AB testing and monitoring
over the longer period of time.
So in summary, I would say intelligent retention engines are a convergence
of advanced machine learning techniques or algorithms, cloud native
infrastructure and real time data.
And then by leveraging this Kubernetes native AI architecture
that I just proposed.
I believe organizations can actually build systems that are not just
predictive, but truly intelligent.
They are scalable, ethical, and also responsive, so they're able to
detect churn signals much earlier before they become irreversible.
So what I just reviewed is essentially a blueprint for building a super
smart, productive customer service engine that can run on cloud.
Its main job is to make sure that customers stop churning or don't leave
before we even know that they're unhappy.
And this can truly help transform customer tention strateg.
Using Cloud native ai.
So again, thank you so much for your time.
I hope you found this helpful.
It's been a pleasure talking to you all.
If one of you wants to connect with me, feel free to reach out on LinkedIn, my
at LinkedIn dot coms slash in slash ker.
Or I'd be happy to connect to discuss further anything at all.
Thanks again and have a great day.