Transcript
This transcript was autogenerated. To make changes, submit a PR.
Good afternoon.
My name is Mi Riz and I'm Chief Data Officer.
At Stop, we'll talk about how we approach the uplift ing challenge.
At Mode TV we'll discuss how we frame the problem, the data we use to
solve and how we work with the data.
Next, we'll focus on the process of training the model
and challenges we tackled.
Finally, we'll review the results of, we obtained a brief overview of our company.
More TV is online video services where users can buy a subscription and renew
it monthly or here to watch ads instead of subscribing in the what scenario.
We are a revenue as an advertising platform.
Our goal is to grow our subscribers.
Subscription base or way to do this is by offering user a discount.
The logic is simple.
It's better to somewhat less from a user if that secures a recurring
payment, that it's is still more variable than advertising for revenue.
However, the challenge is in the defined user who will have subscribed
without a discount, so we don't lose profit unnecessarily at uplift.
Denning, help us to solve this problem.
The goal of applying for Modern Inc is to predict how users target variable.
That subscription purchase will differ if they are targeted, for example,
offer a discount versus if they're not targeted, we can both target
and not target the same person.
So we're really on the average treatment effect.
The difference in subscription purchases between the group that receives the
discount test and the group that didn't control because we randomly
split users into the test and control.
This metrics meet will in additional to predict accuracy.
Our main non-functional requirements are stability and adaptability.
They fall from the high cost of user acquisition and hence
of user communication data.
And from the elastic nature of movie TV show demand.
Since there are frequent new release, the model must not over
feed to specific shows, but should focus on more timeless indicators
for, we use the data from experiment in which the test group was offered
a free six days TRIO extension.
While the control group received no offer, Ana will show on the website so that it
couldn't be missed After the bonus trial period ended, we checked where the P users
converted to a paid subscription or not.
Hence, we know where the users was offered the extension and
whether they ultimately subscript.
We, we derive searching features per users based on how they interact
with smart tv, the player, the search function and project pages.
For example, for the player, it's how often he watched project
TV, shows, movie, et cetera.
search function.
It's how.
The number of key that he yielded, no results and yielded with the
results and, features from project.
It's something connected with, similar to this most user accuracy
in the player activity in the player.
So that's where most features come from.
Of course, we log much more data points, but we select the features most
indicative of eventually subscription.
Our ION criteria was a genius score ing each feature to conversion,
sorry, operation falls.
A standard flow.
We split into train and test sets, train model, then validate on whole al set.
We measure the score on both sets and ensure there neither
overfeeding nor underfeeding.
That's it.
Our model is ready.
What is the score we used?
Gene is difference in shared of success of conversion in test
versus control, but the test.
But for the entire sample, the difference is always the same.
So how we incorporate the models prediction, we saw users
in the standing order of the prediction uplift from high tools.
Lower for each segment, we calculate gene among the users with a higher score.
We then plot gen against the number of users conducting something
like the solu line on the chart.
This one, this is a gene curve representing additional revenue.
We compare it with a random assignment with this line area between the.
Cur is the gene score V six.
To maximize this one, gene measures the difference in success of conversion
between the test and control groups.
For the total samples, the difference is static.
To leverage op prediction, we start users by the predict, update and compute.
By step Ploting is produced the gen curve, which shows the
potential extra gain from targeting.
We compare that curve to a random baseline there.
Between there and the gen score is our K Metric.
For example, we use sums.
Initially we tried random splits.
Each user had an equal probability of being in the training or tested.
However, we noticed that depending on how this speed was done, model performer
could swing to opposite extremes.
Is close to equal or was the random ranking, which was.
To confirm that the issue was not memory, unlike validation sets, we
fixed the validation set and assembled the ring set from the remaining
samples randomly with replacement, similar to a bolt strip approach.
The validation score still showed a wide variance and the remains stable
even after 400 experience integrations.
This indicates we need to different.
Splitting approaches to achieve stable results.
You can hear there is 95% of rep is stably high,
or we have solution was to split by user registration date.
We put the more experienced older users into the training set and
the newest users into the test set.
This also nicely mirrors how we had handled production.
Whereas a model will predict for a new arrivals, we used an 80% to trade, 20%
train validation split and train on a forest from the occasional library.
Looking at the graph, however, users, spread across a few distant peaks.
You can't is delay the most pro promising users with just one trace hold.
There are three spikes.
So we tried an other approach as well.
First, second, sir, and we can't use the threshold unfortunately.
The second approach was to stratify the data.
If the feature in the training and test set as disturbed simul,
sorry, similarly and similarly to the population distribution model
result should be more stable.
But which feature do we use to certify, including them all produced to many Strat?
So we settled on three, the user's number of active days.
Where they converted and where they will offer the discount.
We choose number of activity days because it has the highest genius score
for prediction subscription conversion.
Our goal is to end up with 80 to 20% speed in the drink.
Train and test sets.
If you're certified to select 50% of the data as the initial training set
from the remaining half repeatedly sample 1000 examples, reserve being
stratum proportion, and then to the training set, train a model, and
then test it on the leftover of that.
If the journey on drain and test differs by less than 5% while out
performing random ranking, we keep this 1000 samples in the training set.
Otherwise, we discard them.
We discard them.
We continue until the train set eight.
If 80% of the rolled data, this reduced the number of
picks on the graphs somewhat.
Two, two picks, but we still didn't see the desire.
Tight clustering.
At the top of the rank, we continue can choose the threshold.
Our short approach was to stomp on sampling.
The idea is that the remaining 50% of data not in the training test,
it speeds into clusters and we pick data from the cluster with the higher
rate of successful training attempts.
This is a way we in incorporate prior each duration outcomes and any
environmental changes for each cluster.
We assign a better function distribution and update its parameters based on where
the drink attempt was successful or not.
This algorithm is similar to the previous one, expect except, instead of pulling
data from the entire pool, we pulled it from just one of the four clusters,
and after trying, we adjust the better distribution parameter for that cluster.
In theory, it should converge to a better outcomes.
Unfortunately, in about health cases, we can get.
It converge.
Converge within one 10,000 iterations.
So there is also, we now, we don't have any pix, but we continue,
it's hard to choose a threshold.
and our final method, the first one is combined database spliting
with tons sampling, sorry, is combination of our, that database.
Registration.
And with Thompson Sampling, we first pick the oldest 50% of users of for
training and Zen Row from the other half in cluster via Thompson sampling.
This approach almost always converts and learn to rank users more
tied in near the top of the list.
So now we can choose the threshold and we choose the 20%, which is likely zero.
The model's performance.
We take a cohort of new users to TR on trial.
Five percentage are offered no discount, and five percentage percent
receives the discount in any key.
This provides our goal standard control and test group to compare conversion
and see if there a genuine effect.
This scores remind 90% with our model.
Select the top 20% by prediction, uplift, and give them an offer.
This one, then we compare subscription conversion.
We expect higher conversion among those offers.
Discount versus the non no discount group.
While the group that didn't receive the offer should perform on par
with no offer control group.
Meaning we are not losing potential subscribe subscribers who should
have been given a discount.
Our next plan to test different discount among the lengths.
I also personalize lies the offer even more.
We also intend to find the.
Optimum timing of discount should appear the first, second, or third weeks
after the initial period and experience with the offer for users who have
multiple subscription or already, for example, offer for first subscription,
second subscription, and this third.
So that's all.
Let me show the overview.
What we have done.
We used uplift modeling to.
the, to improve our subscription base.
For that, we use uplift with, few steps of analyze data.
First one we used, we just, randomly speeds the test and
control and test and control.
after that we used, Strata function, diet, that register date and Thompson
sampling and as a result of our final destination is, combined the
data registration and some sampling.
That's all.
Thank you for watching.
Bye.