Optimizing Subscription Services with Uplift Modeling

Video size:

Abstract

The talk explores how ML-based uplift modeling and Thompson Sampling can improve customer retention, marketing efficiency, and AI-driven decision-making in subscription businesses. He’ll share insights on how LLMs and AI-enhanced analytics can refine user targeting, personalization, and engagement strategies.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Good afternoon. My name is Mi Riz and I'm Chief Data Officer. At Stop, we'll talk about how we approach the uplift ing challenge. At Mode TV we'll discuss how we frame the problem, the data we use to solve and how we work with the data. Next, we'll focus on the process of training the model and challenges we tackled. Finally, we'll review the results of, we obtained a brief overview of our company. More TV is online video services where users can buy a subscription and renew it monthly or here to watch ads instead of subscribing in the what scenario. We are a revenue as an advertising platform. Our goal is to grow our subscribers. Subscription base or way to do this is by offering user a discount. The logic is simple. It's better to somewhat less from a user if that secures a recurring payment, that it's is still more variable than advertising for revenue. However, the challenge is in the defined user who will have subscribed without a discount, so we don't lose profit unnecessarily at uplift. Denning, help us to solve this problem. The goal of applying for Modern Inc is to predict how users target variable. That subscription purchase will differ if they are targeted, for example, offer a discount versus if they're not targeted, we can both target and not target the same person. So we're really on the average treatment effect. The difference in subscription purchases between the group that receives the discount test and the group that didn't control because we randomly split users into the test and control. This metrics meet will in additional to predict accuracy. Our main non-functional requirements are stability and adaptability. They fall from the high cost of user acquisition and hence of user communication data. And from the elastic nature of movie TV show demand. Since there are frequent new release, the model must not over feed to specific shows, but should focus on more timeless indicators for, we use the data from experiment in which the test group was offered a free six days TRIO extension. While the control group received no offer, Ana will show on the website so that it couldn't be missed After the bonus trial period ended, we checked where the P users converted to a paid subscription or not. Hence, we know where the users was offered the extension and whether they ultimately subscript. We, we derive searching features per users based on how they interact with smart tv, the player, the search function and project pages. For example, for the player, it's how often he watched project TV, shows, movie, et cetera. search function. It's how. The number of key that he yielded, no results and yielded with the results and, features from project. It's something connected with, similar to this most user accuracy in the player activity in the player. So that's where most features come from. Of course, we log much more data points, but we select the features most indicative of eventually subscription. Our ION criteria was a genius score ing each feature to conversion, sorry, operation falls. A standard flow. We split into train and test sets, train model, then validate on whole al set. We measure the score on both sets and ensure there neither overfeeding nor underfeeding. That's it. Our model is ready. What is the score we used? Gene is difference in shared of success of conversion in test versus control, but the test. But for the entire sample, the difference is always the same. So how we incorporate the models prediction, we saw users in the standing order of the prediction uplift from high tools. Lower for each segment, we calculate gene among the users with a higher score. We then plot gen against the number of users conducting something like the solu line on the chart. This one, this is a gene curve representing additional revenue. We compare it with a random assignment with this line area between the. Cur is the gene score V six. To maximize this one, gene measures the difference in success of conversion between the test and control groups. For the total samples, the difference is static. To leverage op prediction, we start users by the predict, update and compute. By step Ploting is produced the gen curve, which shows the potential extra gain from targeting. We compare that curve to a random baseline there. Between there and the gen score is our K Metric. For example, we use sums. Initially we tried random splits. Each user had an equal probability of being in the training or tested. However, we noticed that depending on how this speed was done, model performer could swing to opposite extremes. Is close to equal or was the random ranking, which was. To confirm that the issue was not memory, unlike validation sets, we fixed the validation set and assembled the ring set from the remaining samples randomly with replacement, similar to a bolt strip approach. The validation score still showed a wide variance and the remains stable even after 400 experience integrations. This indicates we need to different. Splitting approaches to achieve stable results. You can hear there is 95% of rep is stably high, or we have solution was to split by user registration date. We put the more experienced older users into the training set and the newest users into the test set. This also nicely mirrors how we had handled production. Whereas a model will predict for a new arrivals, we used an 80% to trade, 20% train validation split and train on a forest from the occasional library. Looking at the graph, however, users, spread across a few distant peaks. You can't is delay the most pro promising users with just one trace hold. There are three spikes. So we tried an other approach as well. First, second, sir, and we can't use the threshold unfortunately. The second approach was to stratify the data. If the feature in the training and test set as disturbed simul, sorry, similarly and similarly to the population distribution model result should be more stable. But which feature do we use to certify, including them all produced to many Strat? So we settled on three, the user's number of active days. Where they converted and where they will offer the discount. We choose number of activity days because it has the highest genius score for prediction subscription conversion. Our goal is to end up with 80 to 20% speed in the drink. Train and test sets. If you're certified to select 50% of the data as the initial training set from the remaining half repeatedly sample 1000 examples, reserve being stratum proportion, and then to the training set, train a model, and then test it on the leftover of that. If the journey on drain and test differs by less than 5% while out performing random ranking, we keep this 1000 samples in the training set. Otherwise, we discard them. We discard them. We continue until the train set eight. If 80% of the rolled data, this reduced the number of picks on the graphs somewhat. Two, two picks, but we still didn't see the desire. Tight clustering. At the top of the rank, we continue can choose the threshold. Our short approach was to stomp on sampling. The idea is that the remaining 50% of data not in the training test, it speeds into clusters and we pick data from the cluster with the higher rate of successful training attempts. This is a way we in incorporate prior each duration outcomes and any environmental changes for each cluster. We assign a better function distribution and update its parameters based on where the drink attempt was successful or not. This algorithm is similar to the previous one, expect except, instead of pulling data from the entire pool, we pulled it from just one of the four clusters, and after trying, we adjust the better distribution parameter for that cluster. In theory, it should converge to a better outcomes. Unfortunately, in about health cases, we can get. It converge. Converge within one 10,000 iterations. So there is also, we now, we don't have any pix, but we continue, it's hard to choose a threshold. and our final method, the first one is combined database spliting with tons sampling, sorry, is combination of our, that database. Registration. And with Thompson Sampling, we first pick the oldest 50% of users of for training and Zen Row from the other half in cluster via Thompson sampling. This approach almost always converts and learn to rank users more tied in near the top of the list. So now we can choose the threshold and we choose the 20%, which is likely zero. The model's performance. We take a cohort of new users to TR on trial. Five percentage are offered no discount, and five percentage percent receives the discount in any key. This provides our goal standard control and test group to compare conversion and see if there a genuine effect. This scores remind 90% with our model. Select the top 20% by prediction, uplift, and give them an offer. This one, then we compare subscription conversion. We expect higher conversion among those offers. Discount versus the non no discount group. While the group that didn't receive the offer should perform on par with no offer control group. Meaning we are not losing potential subscribe subscribers who should have been given a discount. Our next plan to test different discount among the lengths. I also personalize lies the offer even more. We also intend to find the. Optimum timing of discount should appear the first, second, or third weeks after the initial period and experience with the offer for users who have multiple subscription or already, for example, offer for first subscription, second subscription, and this third. So that's all. Let me show the overview. What we have done. We used uplift modeling to. the, to improve our subscription base. For that, we use uplift with, few steps of analyze data. First one we used, we just, randomly speeds the test and control and test and control. after that we used, Strata function, diet, that register date and Thompson sampling and as a result of our final destination is, combined the data registration and some sampling. That's all. Thank you for watching. Bye.

Slides

Download slides (PDF)

See all 40 talks at this event!

Conf42 Large Language Models (LLMs) 2025 - Online

March 20 2025 - premiere 5PM GMT

Optimizing Subscription Services with Uplift Modeling

Video size:

Abstract

Summary

Transcript

Slides

Dmitriy Zolotukhin

Chief Data Officer @ START

Join the community!

Featured event

2026

2025

Info

Conf42 Large Language Models (LLMs) 2025 - Online

March 20 2025 - premiere 5PM GMT

Optimizing Subscription Services with Uplift Modeling

Video size:

Abstract

Summary

Transcript

Slides

Dmitriy Zolotukhin

Chief Data Officer @ START

Join the community!