Conf42 Golang 2024 - Online

System Design: Simple But Common Mistakes

Abstract

Uncover the intricacies of system design with my talk on “System Design: Simple But Common Mistakes” at the Golang conference. Learn to set up services, establish effective communication, and steer clear of pitfalls in distributed systems.

Summary

  • The first issue we will cover is connected with idempotency. It can happen when there is communication between two services with data creation. In the second service it can lead to duplicated data. We will try to fix how we can fix it.
  • Next issue is external request inside transaction. Can happen when this transaction to the database and inside that transaction we request an external service. It can lead to exhausted database connection pool. How can we fix it? We can use a queue.
  • It can happen when multiple clients request a service at the same time and it can lead to overloaded service. To fix that issue, we could add some, some random time. So the load will be distributed among the day.
  • Next issue is lack of rate limiter. What if payload of a particular request is too big? How can we fix it? Next issue is connected with retries, but a bit cheaper. If we don't use retry policy, we could end up with high error rate and poor user experience.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. My name is Kirill and today we will talk about simple but common mistakes in system design. Let's get started. So the first issue we will cover is connected with idempotency. It can happen when there is communication between two services with data creation. In the second service it can lead to duplicated data. Let's consider this example. We have advertisement service in our scheme. Also we have user who sends to us data about an impression of the ad. The payload has ad id. Ad ID is the id of the advertisement. We check the impression form. Also service has a downstream dependency, external ad service. And external ad service has its own database. The final goal of the whole scheme is to increase this counter. Now let's imagine this. What if there is sometimes there is high latency between external ad service and the database. Let's also imagine this. Right after we have successfully increased the counter the user decides to break the connection because of the total high latency. In that case we have successfully increased the counter. But from the user's perspective, the request failed. So the user decides to retrieve the request. So it sends the same request twice. And now there is no high latency between external service and the database. We have successfully increased the counter again. And from the user's perspective everything is fine too. So we have increased the counter twice. We had the only expression, but the counter is two. The counter should be one, but it's what it is two. That's the adepotency issue. We will try to fix how we conf. How we can fix it. We could add event id, one more id to our payload. But unlike an id which is the id of the ad, this id, event id is the id of the particular event. Unique id for every impression, every event. And now having that id, we also should add that id to the payload between service and external ad service. And now having that id lets us to duplicate data on the database level. So even if we have two different payloads for the same impression, we can deduplicate. We can duplicate it by event id on the database level, which is fine. And now our data is consistent. Okay, let's move on. Next issue is external request inside transaction. It can happen when this transaction to the database and inside that transaction we request an external service. It can lead to exhausted database connection pool. And let's consider why. So we have an advertisement service as the example again. Also we have a user who creates an ad campaign under the hood. The service starts transaction, it inserts data to the to the database and it sends a post request to external ad service. At the first glance it can work. Plus transaction provides us data consistency. But let's imagine this. What if this high latency between service and external ad service plus these, so service has a lot of users. Eventually it will lead to exhaust database connection pool. And this is why we won't release our database connection here until the request is finished. So in this case, if we have highlighter here and high level here, we will end up with having exhausted database connection pool. So how can we fix it? We can use a queue. Basically it doesn't matter what type of queue, but the eventual scheme will depend on the type of the queue. In this scheme I use database queue, database queue, in fact just a table inside the same database. So now it's pretty safe for us to use transaction and to do these two operations inside the same transaction. Because in fact these two operations, insert and create job, are just two SQL queries to the same database. We fixed that database connection pull issue, but we also need to create data in external edge service. To do that, we could add worker which sends data which gets a job from the queue and sends data to the external ed service. So in this scheme we got rid of the connection pool issue, plus we have eventual consistency so our data is consistent. Okay, that's fine. Let's move on. Next issue is request and service at the same time. It can happen when multiple clients request a service at the same time and it can lead to overloaded service. Let's consider an example. This example is about service and client client app request service to get new plugin versions. And everything's fine if we have just a few clients. But what if we have multiple clients? And what if we have, for instance, 2 million clients and usually these 2 million clients? The load from these 2 million clients is distributed among the day because some users use the app in the morning, some of them use it in the evening, and so on and so forth. This sound something like a cron job with specified time. And the time is specified so that all the apps will request our service at the exactly same time. It can lead to an overloaded service and temporary outage. To fix that issue, we could, we could add some, some random time. I mean client client apps could, instead of requesting our service at the same time, they could request our service at a random time. So the load will be distributed among the day and it, it won't, in this case, we will not end up with overloaded service. Okay, let's move on. Next issue is lack of rate limiter. Okay, so let's let's consider the same example, the previous example by design on the client side. But let's also imagine that we added rate limiter before our service. So in this case, if we, even if there is bad design on the client side, our rate limiter won't let client apps to overload our service because we can specify on the rate limiter side a rule, for instance, something like no more than 2000 thousand rp's. Plus we can restrict a particular user from sending us to money request. So in that case we, we won't, we won't end up with these issues. Okay, so rate limiter is a very good thing, but let's now talk about memory limiter for instance. Well, let's imagine we have restricted the number of requests, but what if payload of a particular request is too big? For instance, let's consider this to example in Golang, this is a post handler, and in this handler we read all the data from the body. And what if the size of the body is too big? For instance 5gb. Most likely we'll have out of memory error in that case. So how can we fix it? How? We can restrict the client from sending us to too many bytes in Golang, we could do that with just one line of code. So with this line we specify that the body shouldn't be more than 500 kb. So now with this line of code, instead of crushing our application, the client will get an error. So we have fixed it. Okay, let's now let's talk about retries and let's consider a very simple example, just client service and external dependency request failed, which can obviously can happen because of network issues, because of temporary outage or something like that. So we should handle these cases when request failed. How can we do that? We could retry send the same request. And if we don't do that, if we don't use retry policy, we could end up with high error rate and poor user experience. Okay, so let's move on. Next issue is connected with retries, but a bit cheaper. What if we have added retries, but we haven't added back off and for instance we send a request to external service and request files. Then we retry, and next request fails two, and the third one fails too. We have the choice, which is fine. What if this service is overloaded? In that case we are making things worse, not better, because we don't let the service recover. So instead of just sending request one by one, we can use buck off strategy. And here buco strategies. So the first one is linear a linear strategy is about waiting for some constant time. For instance, we could wait for 1 second between first and second requests. Next one is linear. With Egypt, it's pretty the same as linear, but instead of waiting for 1 second, we would wait for 1 second plus random time, like 1 second plus random time between 20 between ten and 20 milliseconds, for instance. Okay, next strategy is exponential. Exponential is about waiting, not just static time, but we wait between first and second request for 1 second between second and third request for 2 seconds, then for 4 seconds, and for eight, and so on and so forth. So we double our waiting time after every try. And the last exponential, with jittery, it's something like it's the same as exponential, but with random time, like in the second approach. So that's pretty much it. Thank you for your attention.
...

Kirill Parasotchenko

Senior Software Engineer @ Delivery Hero

Kirill Parasotchenko's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways