Conf42 Site Reliability Engineering (SRE) 2024 - Online

Resilience in Fintech: Scala-Powered Strategies for Building Fault-Tolerant Systems

Abstract

Explore the forefront of fintech SRE with Dmitry Pakhomov from Tinkoff, as he shares Scala-based methodologies for crafting resilient, high-performance APIs. Learn how to leverage observability, resilience, and fault tolerance patterns to ensure your systems thrive under pressure.

Summary

  • Resilience patterns are practices that help software survive when things go wrong. Bulkhead can be used to isolate any kind of resources. Bulkheads can help you isolate failures in teach your application to gracefully degrade in case of emergencies.
  • Next pattern is cache caching. Cache is a high speed data buffer that stores the most frequently used data. Caches stored in memory have zero chance of error. Unlike fetching data over the network, caching also reduces network traffic. Why would we need fast startup in the first place?
  • Next pattern next pattern is fallback. The idea is to have a backup data source, probably with lower quality of data, from which the data can be retrieved when the primary source is unavailable. This cache stores last successful response from the service. So now user will be able to see some data instead of an error.
  • Instead of caches in case of error, we simply send the request again. This is actually a pattern called retries. Circuit breaker pattern is a mechanism that detects an increase in error rate. By implementing these patterns, we can fortify our application from emergencies.
  • That's it. Thank you for watching. Cheers. Bye.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. Good day. Let's talk about resilience. My name is Dima. For the past five years I've been developing high load fintech applications that process a huge amount of user money and have the highest requirements for stability and availability. During this time, me and my team has encountered a lot of problems and corner cases, and we've came up with our own recipes for resilience patterns, which ones work best for us, and how we prefer to cook them so we can sleep peacefully at night, knowing that our systems are equipped to survive any disaster. What is resilience? Resilience patterns are practices that help software survive when things go wrong. They like safety nets that make sure your application keeps running even if they are facing some problems. So let's take a look at them and see if they can help your software be better. Starting with Bulkhead imagine we have a simple, ordinary application. We have several backends behind us to get some data from and inside our application. We have several HTTP clients connected to those backends. What can go wrong? Simple application, right? But it turns out they all share the same connection pool, and they share other resources like cpu and ram. What will happen if one of backends experiences some sort of problems resulting in high request latency? For example, due to high response time, the entire connection pool will be completely filled by requests awaiting responses from backend one and request to healthy backend two or backend three, won't be able to be sent because the pool is exhausted. So a failure of one of our backends will result in a failure across the entire functionality of our application. But we don't want that. We would like the only functionality associated with a failing backend to experience degradation, but the rest of the application must continue operating normally. To protect our application from this problem, we can use the bulkhead pattern. This pattern originating from shipbuilding, such as creating several compartments within a ship isolated from each other. If a leak happens in one compartment, it fills with water, but the other compartments remain undamaged. How can we apply this idea? To our example? We can introduce individual limits on the number of concurrent requests for each HTTP client. Therefore, if one backend starts to slow down, it will lead to degraded functionality related only to that HTTP client, but the rest of the application will continue to operate normally. Bulkhead can be used to isolate any kind of resources. For instance, you can limit the resources consumed by background activities in your application, or you can even set restrictions on the number of incoming requests coming to your application. Upstream service or frontend may also experience some kind of failure. For example, reset with caches and it may lead to critical traffic increase. Coming to your APK input limit ensures that your application won't crash due to out of memory issues and will continue to function even though it will be responding with errors to request exceeding the limit. You can implement simple bulkhead just by using simple semaphore. Here's an example of scala code that uses one semaphore to limit concurrent requests and another semaphore to create a queue of painting requests. Such queue can be useful for smoothing out short term spikes in traffic and avoiding error spikes. So that was bulkhead. Bulkheads can help you isolate failures in teach your application to gracefully degrade in case of emergencies. Next pattern is cache caching is a vast topic in software engineering, so today we will focus only on application level caching. Let's return to our example. We still have an application that communicates with several other applications. Let's assume we have sufficiently high SSLA in our backends and they have very low probability of error. Let's consider a scenario where an operation requires querying all these backends to be completed. Each of the backends can independently respond with an error due to error potentially happening independently. The resulting probability of error in our application will be higher than the probability of error in each individual backend. We can increase the reliability of file application by adding a cache. Cache is a high speed data buffer that stores the most frequently used data, allowing us to avoid fetching it from potential slow source every time. Caches stored in memory have zero chance of error. Unlike fetching data over the network, caching also reduces network traffic, lowering the chance of error even more. As a result, we can achieve even lower error rate in our application than in our backends. Additionally, in memory, caches are much faster than the network, which reduces the latency of our application. It's a small bonus. Such caches are excellent for non personalized data, such as news feed or some other data that are the same for all our users. But what if we want to use memory caches for personalized data, for user profiles, for example, or personalized recommendation or something like that? In that case, we need some sort of sticky sessions to ensure that all requests coming from a user always go to the same instance of our application to be able to utilize caches of personalized data of its user. Good news is for this scenario we don't need any complex ticket session mechanism. We can handle some sort of minor corner cases and minor traffic rebalancing. Therefore, it will suffice only to use, for example, stable load balancing algorithm at your balancer without the need for any complex systems to manage sticky sessions. For example, we can use consistent hashing in the event of node failure. Consistent hashing ensures that the only users who are tied to the failed node will be rebalanced, rather than rebalancing all users. That's it. Now we can use our caches for all types of our data for personalized and non personalized. But let's take a look at another scenario. What if the data we want to cache is used in every request we handle? It could be information about access policies or subscription plans, or any other crucial let's take a look at another scenario. What if the data we want to cache is used in every request we handle? It could be information about access policies or subscription plans, or any other crucial entity in our domain. In this case, the source of this data can become a major point of failure in our system, even if we will be fetching it through the cache. If the volume of data in the source is not very large, we can consider fully replicating this data directly into the memory of our applications. At the start of our application, we download a snapshot of this data and then we receive updates from some sort of topic. This way we can increase the reliability of accessing this data because every retrieval will be done from the memory. Zero error probability, and it is still very fast because it is a memory. However, since our application will need to download some data at startup, we violate one of the principles of the twelve factor application, which states that the applications should start up quickly, but we don't want to give up on advantages we gain from using this type of cache. Let's think, if there anything we can do to avoid this issue. Why would we need fast startup in the first place? Fast app is needed for platforms like Kubernetes to be able to quickly migrate your application to another physical node, for example. But platforms like Kubernetes already can handle slow starting applications by using startup robots, for example. Another issue we may encounter is updating configurations of running applications. Often in order to fix some problems, we need to change cache times or request timeouts or some other configuration properties. And let's say we know how to quickly deliver updated configuration files to our application. But we still need to restart an application to apply our new configuration. And the rolling update can now take a very long time, and we don't want to make our users wait for a fix to be applied. What can we do? What if we can teach our application to reload configuration on the fly without restarting it. Turns out it is not so hard to do. We can store configuration in some concurrent variable and have another background thread periodically update this variable. Sounds simple, right? However, to ensure that certain parameters such as timeouts for HTTP clients take effect, we need to also reinitialize our HTTP clients or database clients when corresponding config changes, and it may be a challenge. But some clients like Cassandra for Java, already supports reloading of its configuration on the fly, so reloadable config can mitigate the negative impact of long application startup times. And as a small bonus, it has other use cases like we can use it to implement feature flags, for example. So next pattern next pattern is fallback. Let's take a look at another scenario. We receive a request send request to a backend or to a database and receive an error in return. Then we will respond to the user with an error message as well. However, it is often better to show an audited data with a message like we experiencing delaying in updating information or something like that, rather than displaying big red error message box for user. Right? So to improve our behavior in this case we can use the fallback pattern. The idea is to have a backup data source, probably with lower quality of data, from which the data can be retrieved when the primary source is unavailable. Often fallback cache is used as a backup data source. This cache stores last successful response from the service, and when domain service is unavailable we can use this last successful response to give to user. So now user will be able to see some data instead of an error and the team responsible for broken backend will have more time to fix the issue. Let's talk about retries. Let's rewind a little and go back to our example where we were trying to reduce probability of errors using caches. What if instead of caches in case of error, we simply send the request again? This is actually a pattern called retry, and it can also help us reduce the likelihood of errors in our application. Retries are often easier to implement because when you use caches there, often you need to invalidate them when the data changes. And cache invalidation can become very complex tasks for your system, it is considered one of the most challenging tasks in software engineering. That's why sometimes it's simply to just retrieve a failed request. However, what happens if one of our backends experience a failure? We will start retrying requests to that backend, which will increase the traffic increase will be by several times, and it is very likely that this backend wasn't designed to handle such a sudden spike in the traffic, so we will probably make the failure even worse. Therefore, along with a retry, the circuit breaker pattern must be used. It's a mechanism that detects an increase in error rate, and if error rate exceeds a certain threshold, requests to downstream service are blocked for a period of time. After this time, we try to let one or more requests through and check if service has recovered. If it has ok, we start allowing traffic again. Otherwise we will block requests for another period of time. Often retries and socket breakers are implemented at infrastructure level, at load balancers for example. However, infrastructure usually doesn't have full error context. For instance, it's not always possible to genetically determine if error can be retried or it should be counted as expected or bad error for the circuit breaker threshold. Therefore, sometimes it is necessary to move retries and circuit breakers inside of the application to have full context for making decisions regarding error classification. So, to summarize, by implementing these patterns, we can fortify our application from emergencies, maintain high availability, and deliver seamless experience to our users. Additionally, I would like to remind you not to overlook telemetry. Good looks and metrics can enhance the quality of your services by a lot. That's it. Thank you for watching. Thank you for your time. Cheers.
...

Dmitrii Pakhomov

Architect & Scala TechLead @ Tinkoff

Dmitrii Pakhomov's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways