Conf42 DevOps 2024 - Online

Enhancing a Distributed SQL Database Engine: A Case Study on Performance Optimization

Video size:

Abstract

Learn how we optimized a distributed SQL database engine, focusing on benchmark-driven improvements, and pivotal testing strategies.

Summary

  • Alexey Ozeritskiy will talk about performance optimization of distributed SQL engine. He will discuss background information about YDB engine itself and where it is used. The final part of his talk will be about containerization and performance.
  • distributed SQL engine used by Yandex Cloud. This is 600,000 queries per day and eight petabytes data per day. Even small improvement of our distributed square engine even on 1% could give big value.
  • Wakel consists of four main components. Parser parses query and constructs abstract syntax tree. Execution plan is graph. Each stage is divided into tasks. To see improvements in performance of our engine we should set up continuous integration or CI.
  • With the help of these tools you can collect performance metrics from running processes. Also I used these two utilities, stack count and mem leak. And as you can see query 29 was improved from 15 seconds to 7 seconds. And I think this is very big improvement.
  • The most fast IPC in Linux is memory mapped files. With the help of new VEm splice call, you can read and write two pipes very fast. Next we will work on TPCh of terabyte scales. And we are going to work on tpcds benchmark. For this benchmark plan level optimizations will be important.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, I'm Alexey Ozeritskiy and today I'm going to talk about performance optimization of distributed SQL engine which is used by we project. First few words about me. I'm a software engineer and I have been working with distributed systems for many years and since the beginning of last year I was deeply involved in VDB project and I've been working on optimizing of distributed square engine. Here is my agenda for today. Firstly, I'll be talking about background information about YDB engine itself and where it is used. Then I'll discuss my testing methodology and then I'll discuss my investigations. And the final part of my talk will be about containerization and performance. Let's get started. Firstly, a few words about distributed SQL engine. The distributed SQL engine which is used by YDB is called SQL or VDB query language. SQL is a library which was designed to pass and execute SQL queries. Currently it is used by four projects. It is VDB itself. VDB is distributed to open source SQL database. Then SQL is used by Vitisaurus. Vitisaurus is an open source big data platform which is similar to Apache Hadoop and with Waco Engine Waitizaurus can provide feature like Apache Hive and. Well, there are also two projects similar to Google's Bigquery. The first one is project which is called SQL. It is internal Yandex service and the second project is Yandex Query. This project is also similar to Google Bigquery and it used by Yandex Cloud for our external customers. Yandex query is also open source project and it is part of YDB project. Now let's have a look at these numbers. I got these statistics from Yandex internal service SQL and these statistics shows that we process a lot of queries. This is 600,000 queries per day and eight petabytes data per day. These numbers are very huge I think and even small improvement of our distributed square engine even on 1% could give big value. Now I'll talk about Wakel architecture. This is a brief introduction to Wakel architecture. Wakel consists of four main components. This is parser execution plan builder, execution layer and compute layer. Parser parses query and constructs abstract syntax tree or IST. Then plan builder gets ST and constructs execution plan and also plan builder can optimize this plan. Then this plan is executed by execution layer and execution itself is made with the help of compute layer. Compute layer handles execution of individual plan nodes and compute layer is responsible for computations like SQL filters, SQL projections, expressions, SQL functions, joins and so on. And now let's have a look at this very basic example. In this example we want to join two tables. We want to filter the result and to get top first rows. Now let's have a look at WaqL's execution plan. Execution plan is graph. Graph consists of nodes or stages. Each stage is divided into tasks. And as you can see leaves of this graph can read tables, customers and orders. And also some read stages can contain filter like this read order stage. Then after read stages we have join stage, then aggregate and sort stage and the final stage. Now let's have a look at my testing methodology. For my testing methodology I use benchmark driven approach. This approach has some advantages. First of all it provide us metrics. For example it is execution times of our benchmarks. Then the help of benchmark we can find bottlenecks. Then benchmark is a good scalability test. For example we can tune scale parameter of benchmark and we can see how our system is scalable for some amount of data. Then benchmarkdriven also provide some real world simulation. And so if we improve benchmark we will improve real users tasks. And also benchmarkdriven are vendor natural. So we can run this benchmark on different systems and we can compile our system with its competitors. And this is the benchmarkdriven that I used. This is TPCH benchmark. This is very famous benchmark for OLAP systems or analytical database. This benchmark consists of 22 SQL queries, nine tables and it contains data generator. On the right side you can see TPCH database schema. Well now let's consider TPCh benchmark data generator. It is DBGen tool. With the help of DBGEn tool you can generate data of any size. For example GBGen tool has minus s or scale parameter and for instance scale 100 means to generate 100gb of data. Then there are very useful keys minus c and minus s. And with help of these keys you can generate a really big amount of data on Mapreduce system. And of course I generated everything on Mapreduce. Then I converted this generated data to paquette format and uploaded it to s three storage. And here you can see my packet files, properties like compression, row group and table split parts. Now let's talk about continuous integration. To see improvements in performance of our engine we should set up continuous integration or CI. For CI I used virtual machines and TPCh benchmark of small scale of ten. I run this continuous integration daily and I run it on packet files. Also I set up per commit run and commit to comment comparison. And as you can see on this graph when I started, we cannot pass some TPCH tests. For example, we had a lot of problems with test 29 21 and we had a lot of issues with scale 100. This graph was constructed on scale ten. And for running this continuous integration pipeline I used the utility which could execute whole engine and one process. It is possible in our architecture because we use so colon actor approach and actually our tasks in execution plan are actors. And the sectors can work in a single process or they can work on some ledge distributed system and so on. And now let's consider this utility that was used for testing for continuous integration and actually for everything. This utility is called decoran or distributed query run. And this utility can run all components of distributed engine in a single process. And this utility designed for execute SQL queries on pique files. And this utility doesn't contain a lot of layers of big YDB project for example doesn't contain transactional layer, then replication layer, storage layer and so on. And for running benchmarks with network interaction I implemented these totalities, service node and worker node. To run the test with network interaction you should start one or more worker node instances and only one service node instance. Worker nodes are responsible for compute part of our layer. The service node is execution part. Servicenode actually controls compute layer which is executed in Walker nodes. And also to run this test in distributed configuration you should construct plan and the plan can be constructed with the help of decora utility. So first you run Decoran utility. Decora utility constructs execution plan, then it sends the execution plan to service node and so on. To achieve this, you should provide two additional parameters to gcoran utility. Here they are. This is minus minus Gq host and minus minus gqpot. Now let's consider what we actually measure. For measurements. I use Unix bench styles measures and here it is, I execute each test n times, then I discard lower third of the results and I calculate the final value using geometric mean or the remaining results. And actually this is very effective method for getting a reliable measure of performance. Let's move on. This is our target values. When you are improving something, you need to compare your values with something. And I think that the best approach is to compare your values with values of your competitors. And I came across with an article about benchmarkdriven of these three database IDB, Green, plum and Apache Spark on TPCH 100. And this article provides the following numbers. So I used these numbers as my target values and as this benchmarkdriven was running on 120 cores. I also decided to use the similar hardware and this is my hardware. I use this hardware for the final result and for debugging. It is a big machine which contains of two zone processors and it contains total of 64 cores or 128 threads. Also it has 512gb of ram. Okay, let's move on to my investigations. I'll focus only on most meaningful low level improvements because I found this especially interesting. Of course I worked on low level improvements as well as high level plan improvements. Let's move on. First of all, let's consider these tools which I used. There are three tools. The first one is perfutility. This is a well known Linux profiler. With the help of these tools you can collect performance metrics from running processes. Also I used these two utilities, stack count and mem leak. Stack count is a very useful utility and it is especially useful when you use it in pair with perf top comment. In perf top you can see hottest functions and with the stack count you can find who calls these hottest functions. And there is also mem leak utility which is also very useful. With this memory utility you can find memory consumption of parts of your code. And with this utility it's very easy to resolve problems such as incorrect functionality of bug pressure companion, the back pressure often used during communication of tasks. There are also more Linux performance tools. First of all it is Bcc utility. Actually this picture that you can see was taken from BcC project. I think that this is well known picture. And BCC project uses EBPF functionality of new Linux kernel. It provides C library and Python bingens. And with the help of this library and Python binge you can collect any performance metrics of your program. Of course this project is very low level. So on top of BCc it was created a lot of useful utilities. For example in BCC repository you can find special utilities for performance benchmarkdriven of such databases like SQL and PostgreSQL. There are also the similar utility Bpftrace. Bpftrace is more high level because it is implemented as language. As programming language. It looks like avocado language. And also there is very useful script by Brandon Gregg which is called flame graph. And with the help of this script you can visualize the output of Bcc utilities and BPF trace. Let's move on to my first investigation. I run some tpch query which contains join and I collected perf counters and I constructed this flame graph and I saw that our great join algorithm consumed a lot of cpu time. And if you zoom in you will see that there is nothing interesting, just add tuple function. It's very difficult to see something on this flame graph. So after that of course you could look at your code and read it line by line. But the best solution is to use perf report to look at raw perf data. And let's do it. This is perfreport, and with perfreport you can zoom in into your code. Let's zoom in into a tuple function. Here it is. And you can see that atomic fetch ad consumes a lot of cpu. Actually this is very strange. First when I looked at it that there was something wrong because it looks like a mistake actually, and actually this was a mistake because when this code was written, someone added to this code these atomic counters for debugging purpose and he forgot to remove it. And here is the patch, these atomic counters were just removed and I got this impressive performance improvement. And as you can see query 29 was improved from 15 seconds to 7 seconds. And actually all other queries with joints was improved by half. And I think this is very big improvement. Let's move on to my second investigation. When I running benchmarks I like to run pufftop comment in parallel to see hottest functions in real time. And once when I run pf top I saw that some kernel symbol rescue lock is shown in pufftop and it was very strange. And to investigate who called this Oscar lock, I used stack count utility. And stack count utility showed me that this OSQ lock is part of a mapsis call. But why do we use this mps call? And the answer was very simple, we use it because we use our own memory allocator in our compute layer. And why do we do this? First of all, we do this because our memory allocator is optimized for concurrency and it's optimized for running in multithreaded environment. And in theory it should work very fast, but actually it wasn't very fast. The second we create an allocator instance per query. And this approach has some advantages. First of all, we isolate queries from each other. Then it's very easier to allocate memory and release memory. With this approach you can write exception unsafe code on compute layer because on the end of your query all memory will be allocated automatically. And let's have a look at our problem. Problem with high frequency of mps calls. I solved this problem in a very easy way. I think I just started to allocate 32 pages memory pages on one allocated call. Before this we allocated one memory page on each call and I started to allocate 32 pages and I return one page to caller and the rest pages I store into cache. And next time when caller calls allocator I'll get him some page from cache. This is very simple patch and here you can see performance improvements. Actually I think this is very big improvement. And here is the final execution time for Wakel. I got 154 seconds. This run was on packet files and with VGb I got 209 seconds. VGB means I run this benchmark on VGB cluster. And as you can see we outperformed our competitors. And now let's move on on my final part of this talk. First of all let's have a look at this very interesting feature of our engine. This is an SQL query with embedded Python script. You can switch on this feature on engines configuration. As you can see a user can execute any Python code. We don't have any limitations on this. And for example our user can use ctypes library for calling any C code. In this example the user calls ctypes to the reference invalid pointer and as a result he got segmentation fault. And actually one binary of our engine can execute a lot of queries of different users and if one query crashes the other queries will case too. And this is very bad and we wanted to resolve it. To resolve it, we use the following execution scammer. Let's recall our execution plan which consists of strategies and each stage is divided to tasks and so on. Now let's divide our task into two components. The first component will be responsible to network interactions and the second component will be responsible for computation itself. And for the second component we will start a container. And here it is. So we have lot of containers container per task. These containers contain compute, path and tasks are communicating with containers with the help of Unix pipe. And we use bi directional communications for communicating tasks with containers. So each task uses two pipes per container, one pipe for input and other pipe for output. I think this is very simple scammer. And of course I run TPCH benchmark with this feature switched on and I got the following numbers. I got 561 seconds and I thought that this is very slow and my first thought that it's problem with the pipe itself, why I decided that. Let's have a look at this well known picture. This picture was taken from IPC bench project by Peter Goldsborough. And as you can see pipes are very slow in Linux. And the most fast IPC in Linux is memory mapped files. And there is also very interesting article about how to write two pipes fast. An article by Francesco Mazole. And he said that with the help of new VEm splice call, you can read and write two pipes very fast. And I decided to try these two techniques. First, I tried to replace pipes with memory mapped files with in memory query on memory mapped files. And then I tried to use VM splice syscol. I spent a day on it and I achieved nothing. Nothing improvements. And after that, only after that, I decided to try PF. And PF showed me the following. Let's have a look at the second square. The second square shows that we call some kind of statistics very often. And as you can see, the statistics uses hash maps. Actually, it appeared that these statistics were designed to be called once on query, on the end of query. But these statistics were touched on every pipe message and it was very slow. It was very slow because these statistics were very ineffective. They uses string based hash maps, which are very slow. And I just removed these statistics and I got these numbers. After optimizations, I got 223 seconds and it was achieved with running all benchmarkdriven in containers. So I started a container query plan task and I think this is an excellent number. What's next? First of all, we are going to work on TPCh of terabyte scales. I think we will get a lot of issues with it, but who knows? And the second, we are going to work on tpcds benchmark tpcds is also Olap benchmark or benchmark for analytical databases, but it's more modern. It contains 99 queries and most queries contain joins and typical join consists of ten tables. So for this benchmark plan level optimizations will be important. That's it. Thank you for watching and listening. And if you like my talk, please hit the like button as well as feel free to join me on this social media. Thank you very much.
...

Alexey Ozeritskiy

Lead Software Engineer @ YDB

Alexey Ozeritskiy's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways