Conf42 Chaos Engineering 2024 - Online

Pushing Your Streaming Platform to the Limit

Video size:

Abstract

Join us in this talk to learn the practical art of stress testing streaming platforms, such as Kafka, Pulsar, NATs, or RabbitMQ, and discover how to optimize performance and scalability for your real-time data needs.

Summary

  • Zalad Liv will talk about how to push your streaming platform to the limit. We're going to talk about performance, about benchmarking and how to measure it. All the links from this talk will be later on posted as a thread in my account.
  • Dojo is one of the fastest growing fintech in Europe. We power mostly everything related to the face to face economy. We need to know the limits of our system, and understand what is the limiting factor while running benchmarks. These metrics are crucial for the success of the benchmark.
  • One of the best methods that I know of defining a system performance is the use method. It allows us to run benchmark on most of those common systems that we all use in a simple way. You can easily deploy OMB on every Kubernetes cluster or even deploy it outside of kubernetes.
  • The next bit is we have a unique opportunity to stress test our monitoring service and dashboards. Next, you might find the potential bottlenecks. If you run enough benchmark, like I said, different use cases, extreme cases, and so on, you can easily create some kind of system of recommendations.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everybody, and thank you for choosing this talk. Today we are going to talk about an interesting subject. We are going to talk about how to push your streaming platform to the limit. We're going to talk about performance, benchmarking and how to measure it. Now, before we begin, let me introduce myself - My name is Elad Leev, and I'm a big advocate of everything related to distributed systems, data streaming data as a whole, the whole concept of data mesh and products like Kafka, Flink, Pinot and so on. So if you want to hear my opinions about those subjects, go ahead to my Twitter account and follow me. And all the links from this talk will be later on posted as a thread in my account. So definitely check it out. Now, before we begin with the actual content, let's start with a little bit of marketing, because, that's life, and we have to do it ;) So what is Dojo? Dojo is one of the fastest growing fintech in Europe. We power mostly everything related to the face to face economy, whether it's bars, pubs, restaurants and so on. So if you are a UK best based, you probably saw our devices around those areas. And as you can imagine, we are dealing with tons of data, which is quite awesome. I want to start with a quote from a book by this guy, a computer scientist named Jim Gray. Now, I'm a big sucker of those kind of books from the 80's, 90's that are still relevant even today. I mean, almost, what, 40 years later on? It's kind of amazing. Now, the book itself, or the handbook, contains a lot of gems regarding everything from system performance, how to measure it, and so on. And those gems are actually relevant even today in the "cloud native" area of Kubernetes and so on. Now, we need to understand that measuring system is *hard*. There is no magic bullets. We can't have a single metric that will tell us whether our system is behaving well or not, right? It's a really hard task. So even if, for example, we will take the biggest streaming platform in the world, which is of course, Kafka, you all know it, the use cases may vary because we might have a different message size, we might have a different traffic pattern, we might have different configurations between the different services and the different consumers and producers in our systems. And as a result of these different aspects, we might get a completely different performance from the same machines, from the same cluster. So it's crucial to understand it, and crucial to understand that when we evaluate a system, usually most of the people just look on the TCO, which is the total cost of ownership. And it's okay. It makes sense in a way, but we can't just rely on the performance benchmark that the vendors are given. Because eventually, if you think about it, those benchmarks are a marketing jobs, right? There are almost no engineers involved in those processes. Now, it's okay, but as data driven professionals, this is not something we can trust, right? We need to actually understand and run it on our own to understand those limits. We can just believe that the system will scale as we grow our business and so on. This is something that we will actually need to test and to see with our bare eyes. Now, we need to know those limits. We need to know the limits of our system, because knowing how much our system can handle, whether if it's, I don't know, the biggest message size we can process, or how many RPS our database can handle, how many messages a second and so on, and especially knowing what is our limiting factor, might later on assist us in different aspects. for example, in preventing maintenance in a better and more accurate capacity planning, and even in eliminating toil, because eventually eliminating toil is a cost, right? So we need to know the limit of our system. Now, when you look on those systems in general, on servers and computers and performance, sorry, we can actually understand that we only have four pillars of failures, right? It's either going to be the disk, the memory, the CPU, or the network, nothing else. usually. So we need to actually understand what is the limiting factor while running those benchmarks. Now, understanding those limits is nice, but what are the key criteria for our benchmarks? Before we actually run it, we need to build it properly. Now, first and foremost, it might sound obvious, but when running those benchmarks, we can't do any tricks. We can't use, I don't know - faster machines, we can't use a different JVM, we can't use anything like that right? Even if it costs us a bit more to run those benchmarks using the same hardware and the same systems as we have in production (or any other environments that you are using) it is crucial for benchmark, because remember - it's not a game. We are not trying to aim for a better result or getting better numbers or anything like that. We actually aim for an accurate result in the benchmarks, right? So even if it costs us more, it's better to mimic our production, let's say environment, in our benchmark, and use the same machine types and amount of clusters and amount of nodes and so on. Now the second bit is that we should aim for the peak, right? Because if you think about it, metrics are crucial for the success of the benchmark. So to begin with, if you don't collect those metrics, whether it's the system performance metric, and anything else, the application metrics of course, you should start by doing that because it will be crucial later on. But also you need to understand what are your SLA, what are your SLOs, for example, what is the acceptable, I don't know, end-to-end latency from your system, because it might change between different clusters, right? Because you have different services and different use cases and so on. And also, what do we consider as a downtime? If that latency is spiking, is it considered to be a downtime? A downtime is when the system is not performing well and so on. And one of the most crucial metric to find is your peak traffic, because you should aim to that traffic, right? You should aim to your peak traffic and better yet, add some buffer because you need to expect the unexpected. There is a great post by AWS CTO Werner Vogels, where he mentions that eventually failures are given, everything will fail over time. again, whether it's the disk, the CPU or anything else. But on those cases, we will still need to serve our users during peak time. So we actually need to understand what is the peak time, what is the peak traffic that we have. And maybe we should aim for N-1, right? Because we still want to serve successfully even when we lost one or two of the machines. So on your benchmarks, aim for that point. Now, the benchmarks should be scalable and portable, because the benchmark itself, for now, we might run it on system, let's say Kafka or RedPanda or anything else. But in the future we might decide to move to a different system, right? We might decide to use RabbitMQ or Pulsar or NATS or anything that the future will bring. So the benchmark should apply to every other system that we use. It should be portable. And also we want to test our benchmark in different use cases. So our benchmark should be able to scale up and scale down the same as our services, right? The same idea. So it should be possible to scale up and down our benchmarks and our worker nodes, which we later on we'll see as a reflect of our actual performance of the cluster. The next bit is that simplicity, is the key. Don't try to overcomplicate things. Don't try to do any of those stuff. The benchmark must be understandable, and the benchmark must be reproducible, because otherwise it will kind of lose the credibility of the test. Because you want to document everything in the process and you want to document the key result. And you want your users, whether it's internal user or external user, depend on your cases. But you want your users to be able to mimic and to reproduce the same test that you saw, because later on they might test it on their end, if you know what I mean. So it's crucial that your test will be as simple as possible. Now we understand why it's super important to run those tests and how should we build those kind of tests. But what should we look for when doing those tests, right? Because this is something that it's important to understand. So one of the best methods that I know of defining a system performance is the USE method. It was created originally by an engineer called Brendan Gregg. Amazing. Check his blog. Definitely. it's an amazing blog. So Brendan actually created a method to solve 80%, to identify 80% of the server's issues with 5% of the effort. So the same as for example, you have a flight attendant, they have some kind of manual or tiny emergency checklist that it was like idiot prone to what to do when there is an emergency. So the same thing we have with the USE method, we have a straightforward, complete and fast manual how to test our system. So for each one of our resources that we already mentioned, the CPU, disk, network and so on, we want to look on the utilization, on the saturation and the errors in the logs or whatsoever. This will help us to identify what is our problem as fast as possible. Now for example, if you look on the same pillars, so if we are looking on the CPU, for example, we can look on our system time, the user, the idle and so on. We can check the load average, if we are talking about our memory. So we might want to use the metrics related to the buffers, to the cache, to see the JVM heap. If it's a JVM based system, we might want to see the GC time when we talk about networks. So we might want to use to look on the bytes in and out, to see how many package, if any, dropped and so on. So how do we benchmark those systems? Now like I mentioned, today, in the streaming platform area, there's like billions of products already, right? And we have more and more products launching every day. Now if it's a JVM product, maybe we can use the old, I don't want to say rusty, but the old, nice JMeter that everyone used to run in the past. But again we are in an area where not all of the data system are JVM based. So we want to use something else and we could use the system specific benchmark tools. For example, Kafka is packed with its own performance test shell scripts that you can run against your cluster. Same goes for RabbitMQ, and NATS for example has its own "bench" utility. But as I mentioned in the past, our goal is to seek a system that will be easy to move between different systems, right? So we don't want to use a system specific benchmark tool. So exactly for that use case we have a project, a nice project, from the Linux Foundation which is called the Open Messaging Benchmark or in short, OMB. This system is a cloud native, vendor neutral, open source distributed messaging benchmark. A lot of buzzwords. Yeah, and it basically allows us to run benchmark on most of those common systems that we all use in a simple way. Now the system itself is built out of two components, which is easy to understand. You have the drivers and you have the OMB workers. Again, OMB, open messaging benchmark, the driver itself is responsible for assigning the task to the workers. It's also responsible for everything related to the metadata itself. So for example it is the one who actually creates the topics in Kafka, creating the consumers and so on. And we have the OMB workers which is the benchmarks executor. So the driver is communicating over network with the worker and assigning task and the worker, sorry, actually execute the test again the cluster. Now it's super easy to use it. You can just install using the provided helm chart. You can easily deploy OMB on every Kubernetes cluster or even deploy it outside of Kubernetes. Of course if you want in Kubernetes case our driver will read all the configuration from a config file for example and will distribute the load as I send to the workers, which are pods eventually. So you might want to spawn the same amount of worker as your most, let's say your biggest service or something like that, like I mentioned in the criteria. So it's super easy to scale the system. You can scale up and down the amount of workers to match any number that you want. And like I said, it's a good practice to run them against the same number of pods that you have in your most crucial system or service. Of course again I will say it, use the same Kubernetes machine, use the same instance types and so on because it's crucial for the test. Now here you can see an example of the configuration. On the left side you have the Kafka configuration, on the right side you have the Pulsar configuration. You can see that we start with the name of the test and a driver class. So again makes sense. Kafka is Kafka, Pulsar is Pulsar. Next, we assign the basic configurations for the test. For example, we have the connection string to pulsar or to Kafka the port, we have the amount of I/O threads, we have the required request, timeout and so on. Next we have the producer configurations. So we might want to use different producer configuration like I mentioned, based on our services, but it allows you in Kafka use case, for example, you can set the arc to be all minus one or whatever you want. You can change the linger, match bed size and so on. Same goes to pulsar, you can change the producer. And lastly you define your consumers. Now in the project repo itself, in GitHub of course. And later on I will post a link. Like I mentioned, you have many examples. So make sure to do the market search, make sure to identify your biggest producers and consumers and what are the configurations that they are using. Because later on you might want to use those configurations in your tests. Now based on this information, make sure to generate different use cases and these edge cases and test it against your cluster. This is an example for a workload file. Eventually the workload, the message that we send is just like a binary file that we send, but the workload definition looks as you can see in here you have the message size, you have some randomized payloads configuration. So on things that you can change, you have the rate of the producer and many other stuff. So running the test, it's quite easy. With OMB you can test it the same test, you can test it again, different systems. And after you run this test, the test result will be printed to your screen, but also it will be saved as a file which later on you can share. And the project itself is also containing a nice Python script that allows you to generate a pie chart from the result and then maybe put it in your documentation so everyone could see what the result of the system. Now, after we understand how to do it, and that it's kind of better to do it with this kind of system, what are the kind of insights that we might get from these kind of tests? First of all, we could potentially find our average latency, right, the end to end latency, which is important. We can identify that latency to find any kind of problems, because maybe if we have services that the latency is crucial to them, maybe we could play a little bit with the configuration, lower the linger.ms , lower the batch size (again in the world of Kafka), and then get a better end-to-end latency. Maybe we can find a better fine tune for our services and for our configuration, right? So it's really important to find it. The next bit is we have a unique opportunity to stress test our monitoring service and dashboards because we usually, I hope so. We don't have a lot of those cases, extreme cases, where the system is super overloaded. So we have a unique opportunity to see that our Grafana dashboards or Datadog, or whatever you use, can actually handle these kind of values and these kinds of extreme cases. Because one of the things that you don't want is your monitoring service to be broken when shit hit the fan. Sorry for my language. Next, you might find the potential bottlenecks because as we mentioned, it's super important to identify our limiting factor because then we could actually address it in advance, right? We can, I don't know, put more alerts on our systems. If, for example, it's the disk type, maybe we can lower the threshold of the alert so we will be notified before the disk get into I/O starvation or something like that. Maybe we can change the type of the disk and yada yada yada. So it's super important to do it. Next, we might have some kind of, I call it scale up ballpark. Because when running those systems, you can actually add more brokers to your Kafka cluster and to see the impact of the latency during that time, but also the impact of the overall cluster performance after you added that node. So you will have some kind of ballpark to know how many nodes in the future. You have some kind of ballpark to know how many nodes you need to add to your system to your cluster in order to sustain the growth of the system. Last but not least, if you run enough benchmark, like I said, different use cases, extreme cases, and so on, you can easily create some kind of system of recommendations. Because if you have enough stress test cases, and if you test all your producers and consumer configurations, you might be able to collect them to put it on some kind of backend with a nice UI where your developers can actually change the different parameters and so the result based on your test. So maybe as a developer, I want to see if I can raise my batch size or to lower my batch size, but I want to see the impact on the latency. So based on your configurations, you might be able to give your developers the abilities to see the result in advance without them spending time and money testing it. Yeah. And this is it! First of all, I hope that you like it. And second of all, I hope that this talk give you enough reason to run those kind of benchmarks. If you have any questions, feel free to reach out to me on LinkedIn and Twitter or whatever, and I hope you enjoy the conference. Thank you again and see you later.
...

Elad Leev

Senior Data Platform Engineer @ Dojo

Elad Leev's LinkedIn account Elad Leev's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways