Conf42 Chaos Engineering 2023 - Online

A/B testing platform at scale

Video size:

Abstract

In this talk, I will touch upon how we built the experimentation platform for GoTo and how we changed the backend architecture over time to cater to the throughput from 100k to 2 million RPM.

Learning for audiences:

  • Design high throughput reliable systems
  • Scaling CPU-intensive applications

Summary

  • Hi everyone. First of all, I would like to give a big shout out to all the organizers of Conf 42. It has been a very fantastic, fantastic experience for me. Amazing job, guys.
  • In this talk, I'll be talking about experimentation platform at scale. I'll walk you through my experience of building a b testing platform at Gojek Tech. The new scale, which is right now is around 2.1 or 2.2 million rpm.
  • A B testing is a randomized experimentation process wherein two or more versions of features are shown to different segment of users at the same time. Data helps us determine which version leaves the maximum impact and drive the business metrics. We will talk about the current scale or the scale, how the number of experiments, number of clients, the adoption experimentation has increased.
  • These days a lot of companies are investing in experimentation. Not only big giants, even small startups are also doing experimentation on daily basis. Whatever apps which you are using right now, these days, must be running at least few hundreds of experiments on each app. This decision is backend by data, right?
  • Litmus is basically the name of experimentation platform at Gojek Tech. It has two core features. One is a b testing, which we call experiment. Second one is a staggered rollout. Third one is that it also supports the targeting rules.
  • Litmus server is basically a b testing platform. Clients talks to the Litmus server and litmus server fetches all the valid experiments and return those to clients. There are a couple of limitations here. We will talk about those limitations in the next slide.
  • The read throughput is around six to seven x larger than the write throughput. Decoupled deployment the deployments will be faster. Not horizontally scalable because the underlying database is a postgres.
  • The number of clients, number of experiments or the number of religious has changed over the course of time. Each client can have different requirements. Response time is proportional to the numbers of experiments running for a client. Changes in the architecture needs to be done to support the new scale.
  • Every year Litmus is receiving a new scale. We adopted a sidecar pattern to optimize the latency. By doing sidecar, if you are running the sidecars on the client server, you are distributing the resources. It is designed with the availability and consistency.
  • So this brings us to the last section of the talk. Basically how the throughput and response times look like right now. Currently we are focusing mostly on the building analytics systems. Please reach out to me on Twitter or LinkedIn for any questions you have.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. First of all, I would like to give a big shout out to all the organizers of Conf 42. Thanks a lot for this smooth conduct. It has been a very fantastic, fantastic experience for me. Amazing job, guys. Hi, thanks for joining me. In this talk, I'll be talking about experimentation platform at scale. I'm Hrithik Shwastav. I work as a lead software engineer at Kojek in data engineering team, currently in a stream called experimentation. In this talk, I'll be walking you through my experience of building a b testing platform at Gojek Tech. Challenges which we faced, right. And the changes in architecture which we did to cater the new throughput. The new scale, which is right now is around 2.1 or 2.2 million rpm. Let's proceed. Yeah, so this is the title of my talk, a B testing platform at scale. Let's move to the next slide. Yeah, so the agenda of this talk will be like. I'll be explaining about what a b testing is, and then we will talk about the objective, what the audience of this talk can learn from this. Then I'll be explaining about the litmus, the name of a b testing platform at Gojek Tech. Give a light on the old architecture from where we started building the platform. Then we will talk about the current scale or the scale, how the number of experiments, number of clients, the adoption experimentation has increased in kosher. And then we will talk about the new architecture, how we have made changes in the new architecture to cater the new scale. So what is a b testing? A B testing is basically a randomized experimentation process wherein two or more versions of features are shown to different segment of users at the same time. After running the experiment for some days or weeks, we analyze the data. And then this data helps us determine which version leaves the maximum impact and drive the business metrics. Let's for an example, understand this concept with an example. The allocation algorithm, right? The allocation algorithm is used for allocating drivers to customers. Suppose you have an existing algorithm which is allocating drivers to customers. Now you have a hypothesis that by changing few of the parameters in your allocation algorithm, you will be able to allocate your driver efficiently to your customers, which will impact your business metric, which is booking conversion metric. And this metric will increase by making this change in the algorithm. So what you will do, instead of directly rolling this out algorithm to production, which can be very critical and dangerous, because you do not know whether how this will perform in production. Right? So you will be running an experiment with the existing algorithm and the new algorithm which you are proposing, you will run this experiment for a few weeks and collect the data, and then you will analyze the data and see for which variation the booking conversion rate has increased. If it is a new variation, then you will be rolling that out to all the users in a staggered manner. If it is not, then you are sure that whatever you thought was not right and the current algorithm is performing better than the suggestion the new algorithm. Right? So let's move to the next slide. Yeah, so these days a lot of companies are investing in experimentation. Not only big giants, even small startups are also doing experimentation on daily basis. So whatever apps which you are using right now, these days, right, it must be running at least few hundreds of experiments on each app. Because this makes sense as well, right? Because it is the data which is leading you to make the decision, right? Not just that your manager or some high paying people are coming and saying that this is how we should do it and this will perform better and we are rolling this out to all the users. This decision is backend by data, right? So I'll quote one of the quote by Jeff Bezos. What he said is that if you double the number of experiments per year, you are going to double your inventiveness. He also knew that how critical experimentation is for a particular form. So Amazon also must be running pretty huge number of experiments for its different, different services. Let's move to the next slide. Yeah, Litmus. What is Litmus? So Litmus is basically the name of experimentation platform at Gojek Tech. Actually has two core features. One is a b testing, which we call experiment, where you come and define your experiment with two or more variants and you can do the A B testing. Second one is a staggered rollout, right? Once you have the result of your experiment, you can promote your winning variant to a release, like rolling that out to all the users in a staggered way. So that is called release. These are the two features currently supported by Litmus. The third one is that it also supports the targeting rules. I am purposefully mentioning this here because if you see here, right, we have a kind of rules like app version newer than 1.23.12. Sorry. So all the users who will have the app version newer than this will be part of this experiments. So these are some rules which we support. So I'm mentioning this because for each experiments we need to pass the rules for the experiments, right? Which makes this application as a cpu intensive. I want to highlight this word because it will be used in the system design. Also, it will help us design the system better because we know the expectation from this applications, right? So these are the key features supported by Litmus, which is the A B testing platform of Kojek. So what is the objective of this talk? Right. So I'll be walking you through my experience of building the litmus from scratch, to also the changes which we made to the system to cater the hydropole. So what are the non functional requirements which we will need to build the system? So these are the core non functional requirements. First is the low latency. Low latency is required because if you see, right, experimentation is a new call in your flow, right? Suppose if you are running an algorithm for driver allocation, right now you have only one algorithm, so you know which algorithm to use. If you are running an experiment, you added one more branching there, right. You will talk to the experimentation service, get to know which algorithm to use and then you will use that algorithm. So this low latency is very critical here because it is going to impact the overall flow, right. Second one is the highly available, which I think is given for all the systems. The system has to be highly available because it is being used in lot of flows, a high throughput. A high throughput in the sense, because in Gojek pretty much every services are using this platform for running experiments. So we are expected to get the high throughput, weak consistency. Basically the data should be eventually consistent. Like if anyone making some changes in the experiments, it should eventually be reflected on the client side. Yeah. These are the objectives. Let's go to the architecture, how we started. So let's look into the initial architecture, how the idea of litmus started, how we build the version one of Litmus. So we have this Litmus server, right? So Litmus server is basically a b testing platform. The application which have all the functional requirements like supporting experiments, releases rules, et cater, which is backed by postgres. We have a primary database postgres. Right. So how the initial flow was, we had a portal actually. So the experimenters can come to portal, can create experiments, update their created experiments. Basically can do any kind of crud on their experiments. So how the flow was. Whenever they are making any changes on the portal, portal will talk to the litmus server and litmus server will reflect that changes to the DB. And we have clients. Clients are basically the, you can say microservices or mobile clients, any clients which needs to talk to experimentation platform to decide which variants to use, which variation of an experiment to use. So how does this flow work? Clients talks to the Litmus server and litmus server talks to the DB, fetches all the valid experiments and return those to clients. So this is the initial architecture. You must be thinking that this is pretty knife and there are a couple of limitations here. Yeah, and we will be talking about those limitations in the next slide. Let's go to next slide. Yeah, so limitations. First one is the single cluster handling the request of read and write. Right? So what is the problem here? So the problem is the read throughput and write throughput are very different for litmus. The read throughput is around six to seven x larger than the write throughput. Right. So the tuning, the scaling required for the read will be different from the write. Right. So if we are using a single cluster, and suppose read is using a lot of resources, it might affect the write. Suppose if there is some outage happened on that cluster, read and write, both will be affected by those. So this is one of the potential problem in this whole architecture. The second one is primary database being used for read and write. So we have a primary and recovery database. Primary is basically which will handle all the writes and read ports. Other replicas are basically where you can read the data. If we had the read clients directly read from the application server through directly read through replica database via litmus application, it will be better in the performance that even if the primary database is down, they can read through the replica database, which is like reading the data is a critical flow here because a lot of the microservices will be talking to the experimentation platform at API level, right? So even if the primary database is not performing well or running some kind of maintenance, they can read from the recovery database. The third one is a single deployment. Single deployment as in like if you are making any changes related to write APIs, the deployment will go for both the read and write. Right. If you are making any changes related to read APIs, like any optimization you have done, you will have to make the deployment. And that deployment time deployment will affect both read and write, right? So the separation for read and write is not here, which we will talk about in the next slide. So what we did next, we separated the read and write cluster to scale it better because the throughput difference was quite large. So let's go to the architecture. We have Litmus write servers, we have litmus read servers, we have primary database, and we have replica, which will be replicating all the data from primary on its site. So how the portal will be talking right now, portal will talk to the write servers. And similar to the previous fashion, the write server will make those changes in the DB and all the clients now will be talking to read servers and read servers will be talking, fetching the data from replica and returning them to claims. This is how the system will look like once we segregate the read and write clusters. Let's go to the pros and cons of this architecture. Pros read and write cluster can scale independently, which is correct, right. If your read throughput is increasing insanely, you can just scale your read cluster. And even if your read cluster is for some time affecting the response time, right. Because of the high throughput, your write cluster or your portal will not face any kind of issue unless the issue is on the database side. Decoupled deployment the deployments will be faster, right? Because the read changes and the write changes are independent raw, right. The deployment on read is not affecting the write and vice versa. Because anytime on a high throughput service, if you are doing a deployment on day, right, so it will affect all the clients that time. So if we have a read and write client separately, we will be able to deploy the write cluster at any time because it is a low throughput service. Third one is the master slave setup, basically primary or replica setup. Here the write cluster is talking to the primary database and the read cluster is talking to the Replica database. So you can tune the replica database to increase the number of connections for your applications. And there won't be much load on the master because it will only be serving the right throughput. So there are some cons as well, which I think we will understand later in the talk. The cons are all the clients are sharing the resources, right? Means there are a lot of clients which are talking to experimentation platform. They all have a common resource. Like the read cluster has some amount of cpu, right? That cpu is being shared by all the clients requests, right? It is a problem. I'll explain this in the later section. Not horizontally scalable is the second con or the limitation. I would say it is because the underlying database is a postgres, right? You can have certain amount of connections there, right. If you are going to do the horizontal scaling, each server will be making a connection to database, right? So you need to keep in mind that as well, it's not easily horizontal scalable. Whenever you are reaching the max connection configured on the DB side, your server will be crashing, right? Because it will not be able to get the ideal connections with the DB. So this is one of the limitations in the real write segregation crash. Yeah. So let's talk about scale over time, which is important to understand the changes in the architecture needs to be done to support the new scale. Right? So let's see how the number of clients, number of experiments or the number of religious has changed over the course of time. Let's see the number of clients over time. So from 2019 to 2022, I would say see the number of clients, how it has increased, right. Every quarter it is increasing somewhere it is increasing slow and somewhere it has increased by a big margin. But the trend is increasing, right. It is not anytime at constant trend. So this is one of the major part in the scale that the different clients will be talking to your systems now, right? And each client can have different requirements. Some clients can run hundred of experiments, some will be running two or three experiments, right? So there is a diversity in the clients as well. We'll talk about that later. Let's see the number of active experiment at a time over a course of period from 2019 to 2022, right? If you see, we started with 178, we reached to 739, and now in 2020, and we were about 1.8k. These are the active experiments. Like running experiments, we have excluded the completed or stopped experiment. Stopped experiments. So this actually is the distribution of clients experiments. The number of experiments running for a particular client, you see it's pretty uneven, right? So the expectation from the clients is also very different. Right. Few clients will be running hundreds of experiments. They are okay with the large response time. Few clients are running only two and three experiments, but they are not okay with the large latencies. Right. Because the response time is proportional to the number of experiments. Right. Because we have the rule configured for the experiments, and we have to parse a rule for each experiment, right? So somehow this response time is proportional to the number of experiment running for a client, right? So here, if you see it is not only the overall skills, it is also the diversity in the client. Right. What is the expectation from each client if you are sharing the resources? Right. Suppose some clients is okay with large latency. Suppose 50 ms as 99 percentile. But some clients are not okay with it because they're only running one or two experiments. Why would they want their 99 percentile to be this high? Right. So this was one of the other concern. Let's look the similar graph for releases. This is the trend for the active releases over course of time. And this is the distribution of number of releases per client. This is similar to what we saw in the experiment. And the problem is also the similar. Right? And if you see the numbers, numbers are also similar. We had around 1.8k experiments at the end of 2022, built in releases, we had 1.6k. Now, you see, from 2022 to 2023, we have already contributed, like increased the active releases by two k. So you see the scale, right? How is it increasing? So let's see what changes we made in the architecture to cater this release, right? Cater this scale. Sorry. So let's look at some of the points here that every year Litmus is receiving a new scale, right? At the end of 2019, the max throughput was coming around 100k rpm. These all numbers are rpm. I forgot to mention in the slide, at the end of 2020, it was around 500k rpm. At the end of 2021, it was around 1.1 million rpm. At the end of 2022, it reached to 2 million rpm. So the last two architectures we saw it, the very basic and naive architecture with a single server and the second one with a segregated read and right cluster, were able to handle around 500k throughput within the committed SlA. Beyond that, it was like the responses were frustrated and we were seeing some errors as well because of the resource scarcity, right? So, looking at the adoption and the growth of experimentation per year, which we saw in the scale slide, right. How the number of experiments and releases are increasing over a course of time, we adopted a sidecar pattern to optimize the latency, right? Why sidecar pattern? Sidecar pattern, because by doing sidecar, if you are running the sidecars on the client server, you are distributing the resources, right? So resources are distributed. So if some clients want to run thousand experiments, right? Say thousand is a random number, say 100 experiments want to run, right? So they will be configuring that much of resources to the sidecar, right? If some clients are running only one or two experiments, why would they bother to give good resources? They will just be giving one code to that sidecar, right? Or less, for running the sidecar. So this came to our mind because of the diversity in the clients, right? We thought that let's make it distributed, let's make the resources distributed and build the platform so that it can handle the scale as well as the diversity in the scale. Let's look at the architecture diagram. Now. How does it look like we have a. Right servers, right. We have the primary database. So portal flow will be similar to what it was, right. All the portal requests will go to the right server, and write server will be reflecting those changes in the primary database. Right. Now, how the clients will be connecting. So these yellow boxes are client servers and these are the client processes in gray, right? So we have deployed the sidecar processes along with the client main process, right? These sidecars will be syncing the data from right servers periodically, let's say 5 seconds every five, which is a configurable value. Every, let's say t seconds they will fetch all the experiments for that particular client and they will store it in patcher tv. Sidecar uses Patcher tv. So periodically they are syncing the data from write server and write server has an API exposed which is basically fetch all the data and will return it to sidecar processes. So every t second they will be fetching the data from write server and store that in the badger Db. So whenever a client make a request to sitecast, Sidecar will read the data from the badger DB and do all the rule parsing and everything and return the valid experiments to client. So if you see this right here, all the computations are distributed over each servers, right? Rather than doing all the computations on a particular set of servers, it has been distributed. So by this architecture it can scale to any value, right? Because here we are not making any direct connection with the database, we are syncing the data via API. The connections are maintained by the right server with the database and also the client's response time of. Sorry, the response time of one of the client is not getting affected by others, right. Let's see the key points for the sidecar pattern. So given Sidecar is distributed, it is designed with the availability and consistency. So using this pace theorem, basically if we have a partition, we will have to choose between availability and consistency. Else we have to choose between latency and consistency. So sidecar has been built with choosing the availability over consistency in case of network partition and consistency over latency if there is no network partition, right? So we decided to go ahead with this. We had a survey with our clients and we decided to go ahead with this. We do not have right now the options to configure the sidecar to support both consistency and latency with a configurable value. Right now sidecar just support availability in case of network partition, else consistency second one is horizontally scalable, which I already talked about, right? Since it is not making direct connection to the database, right? It is horizontally scalable. Like you can literally spun up like hundreds of thousands of servers which will talk to the litmus servers periodically, right? Because the throughput on the sidecar is not directly proportional to the throughput on the right application. Because if you see here, the call between sidecar process and write servers is happening periodically every t second. So irrespective of this throughput. Right. The client process to sidecar process. This throughput is not getting affected. So it is horizontally scalable. So the third part is resources are distributed, which I already explained. Right. The response time of a client will not be affected by the experiments of other clients. Because the resources are distributed and their clients will be running the sidecar on their own resources. And also we have a central monitoring for all the sidescar on the new relic and Grafana. Because as an experimentation platform, if we are providing a sidecar, we need to check the health of all the sidescars as well. Right. If someone is complaining that my sidecar is not working, you need to have the observability. Right? What is the issue? Is it working fine? Since when it is not working? So we have monitoring as well for all those sidecars. Yeah. So this brings us to the last section of the talk. Basically how the throughput and response times look like right now. This is one of the screenshot I have took while making this slide at a particular time. So if you see this is. This is reaching up to 1.61 million rpm, the throughput and if you see the 99 percentile, right. It is pretty efficient. 9.52 ms. Right. You see the scale and the response time. This is the most optimized thing which we were targeting for. And using this sidecar pattern, we were able to deliver this, right. And make the system more available and more diversified according to support the requirement of different clients. So yeah, this is the overall learning for me while building the experimentation platform. Yes. So currently we are focusing mostly on the building analytics systems. But I think that is not the part of this talk that is out of scope of this talk. So yeah, that's all from my side. Thanks a lot. Please reach out to me on Twitter or LinkedIn for any questions you have. I have written a few blogs as well on this. If you are interested, you can go on media man. Thanks a lot. Have a nice day,
...

Riteek Srivastava

Lead Software Engineer @ Gojek Tech

Riteek Srivastava's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways