Conf42 Python 2021 - Online

Reproducible Builds with Bazel

Video size:

Abstract

If you run two builds with the same source code and the same commit but on two different machines, do you expect to get the same result? Well, in most cases you will not!

In this talk, we’ll identify sources of non-determinism in most build processes and look at how Bazel can be used to create reproducible, hermetic builds. We’ll then create a reproducible Flask application that can be built with Bazel so that the Python interpreter and all dependencies are hermetical.

Summary

  • A build is reproducible if given the same source code, build environment and build instructions. To achieve a reproducible build, you must remove all sources of nondeterminism. Internal randomness is an issue you have to tackle before you can achieve a reproduced build.
  • Today we will apply the concept of reproducible build in Python. We will use Python rules to tell Bazel how to create an executable Python program. Later on we will create a flask application and this will allow us two understand how to manage dependencies in Python in an aromatic way.
  • Use Bazel to build Python from scratch. Use HTTP archive to fetch and build from scratch python. For writing tests in Python, we will need Pytest. You don't need to create any virtual environment. Dependency management in Bazel is very straightforward.
  • Bazel allows you to compile a flask application in a reproducible way. The build is not fully reproducible, but there are solutions for that. From now on, don't take for granted that your build is reproducible.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. How are you doing? I hope you're enjoying comfort. Two python and all its good content. My name is Gaspar and I work for the CI CD in an autonomous driving project, and today I'm going to talk about reproducible builds with Python and beta. If you run two builds with the same sources code and the same commit, but on two different machines, do you expect to get the same result? Well, in most of the cases you will not. Today we will go through sources of nondeterminism. In most build processes, we will look at how Bazel can be used to create reproducible hermetic builds. Then we will apply these concepts to create a reproducible python environment and a flask application that can be built with Bazel so that the Python interpreter and all dependencies are hermetical. According to the reproducible Builds project, a build is reproducible if given the same source code, build environment and build instructions. Any party can recreate bit by bit identical copies of all specified artifacts. This means that to achieve a reproducible build, you must remove all sources of nondeterminism. Although this can be difficult, there are several benefits. Reproducible code is more secure and reduces the surface for attackers. You can determine the binary origin of an artifact like what sources it was built from, and it can drastically speed up the build time thanks to caching of intermediate build artifacts in large build graphs. This is not trivial for big projects, and if your build is reproducible, you can guarantee safe machines. To obtain a reproducible build, you need to tackle sources of nondeterminism. One of the most common causes of nondeterminism are inputs to the build. With that, I mean everything that is not source code, compilers, build tools, third party libraries, and any other inputs that might influence the build. All references must be unambiguous for your build to be arithmetic, either as a fully resolved version, numbers or ashes. When you get to such a situation, you can say you have arithmetic build. Your build is insensitive to the libraries and other software installed on the build machine to be arithmetic. You can start checking in all the information needed by your build as part of the source code. Hermetic builds enable also cherry picking let's say you want to fix a bug in an older release that's running in production. If you have armetic build process, you can check out the old revision, fix the but, and then rebuild the code. Thanks to arithmeticity, all the build tools are versioned in the source code repository. So a project built two months ago will not use today's version of the compiler because it can be incompatible with a two months old source code. This is very important. So you are now thinking why to be so strict with my build? Well, it may sound painful, but know which depends on what will pay off in the long term, and we will see that later on. Internal randomness is an issue you have to tackle before you can achieve a reproducible build, which can be a sneaky thing to fix. There are many sources of internal randomness, but timestamps are a common one. They are often used to keep track of when the build was done. Get rid of them. With reproducible builds, timestamps are irrelevant since you are already tracking your build environment with source control for the languages that don't initialize values, you need to do it explicitly. Avoid randomness in your build due to capturing random bytes from memory, there's no easy way around it. You must inspect your code. All this may sound a bit overwhelming, I know, but it's actually not as complex as it sounds. Bazel makes this process much easier. Bazel is a fast, scalable, multi language and extensible build system. As stated on the official Google website, Bazel can help you to achieve a reproducible build, providing off the shelf support for arithmeticity. One of the key concepts behind Bazel is sandboxing Bazel file system sandbox will run processes in a working directory that only contains known inputs, such that compilers and other tools can even see sources files they should not access. This means that you must specify all the inputs or your build will fail. As a consequence of hermeticity, Bazel allows you to encapsulate your build targets, meaning that you can hide the internals and be sure that no one can implicitly depend on your target. Another great feature of Bazel is its caching system, which can make your repeated build 50 times faster. There are a few key concepts for Bazel that we need to cover before jumping to the code. The directory that contains the source file of the project is called workspace, and it must contain a text file called workspace as well. A workspace file is where I define all the reference to the external dependencies required by the build. Here, external dependencies can be anything, kernel libraries, git repositories, phaser rules, or any other thing you may require in your build. Phaser rule specifies the relationship between inputs, outputs, and the steps needed. Two, build the outputs and is specific to the programming language you use. In our case, we will use Python rules to tell Bazel how to create an executable Python program starting from some PY files. The code is organized in packages and each package is a collection of targets. A package is defined as a directory containing a file named build build files describe how source code can be built. Basically, when you want to build your code, you can specify the package and which target you want to build, like in the example here. As I mentioned before, today we will apply the concept of reproducible build in Python. We will create a reproducible local environment using Python three eight three that we will build from scratch. We will write a test to make sure we are using the right Python binary to build our code. We'll be able to reuse the local environment the foundation to develop your next Python project. Later on we will create a flask application and this will allow us two understand how to manage dependencies in Python in aromatic way. So let's jump to the code. So this is what our workspace looks like. You need to assign a name to it. Here we just call my flask app. We define a new variable. This is a variable called compute Python based on OS, which is the command we need to execute to compile Python from scratch. We will use it later on. Note that here we need to make a distinction if we are running this example in macOS. So here we can see our first Bazel rule. This is HTTP archive, which is a basic rule that allow us to download a compressed archive file, compress it and use it. In our project we use HTTP archive to fetch and build from scratch python. With this we can be sure two have control over the Python binary inversion. Remember, you don't want to use the Python version installed on the os machine or your build will not be reproducible. The hermeticity here is ensured by the URLs field which tell Bazel where to find the dependency and the shot 156 field which is the unique identifier for it. Every build will use the same unambiguous Python version. Another important field is the patch commands that we use to define a sequence of batch commands to execute. We use it to run the build command for Python using the configured Python Bazel analyze variable that we defined earlier. Once we run this HTTP archive rule, this will fetch Python pin version of Python and build it and we will have our three data, three version of Python to use in our next we need the Python Bazel rules to create the build and test target. Since those rules don't come with Bazel, we need to fetch them using HTTP archive like this. And here again we use the SHA 256 as identifier, the version of the so we said that we want to compile code written in Python using the Python binary we defined early on. To do that, we need to define a new Bazel tool chain. Bazel tool chains are defined in build files. Here pyramtime is used from the Python rules that we fetched before we define Python three runtime using the Python interpreter. We found Python before, and then we use py runtime pair, which wraps up to two Python runtimes, one for Python three and one for Python two. Since we only want to support the three eight three version, we don't define any Py two runtime, then use the pyramtime pair to define our toolchain, the Py three toolchain that we can use in our project. But to use toolchain in Bazel, you need to register them and you do that at the end of the workspace file with this line. Remember, the registered tool chains must always be at the end of the workspace file. Nice. You now have arithmetic Bazel build environment set up, but don't just take my word for it. Let's write a test. For writing tests in Python, we will need Pytest. So let's add the requirement txt file like this, and along with Pytest we need all its child dependencies. This is a normal requirement file that use daily in Python. But since we want to be arithmetic, we need to pin the versions and ash as an identifier for arithmeticity. This means that when Bazel will try two build the test, he will look for the exact version of the dependency we want to use. This example if Bazel can find a library called Python with version five four one with this exact edge, the build will fail due to a missing dependency. Okay, now we can modify the workspace again. We add pip install. A pip install is a rule friendly dependencies. It allows importing Pip dependencies from a requirement TxT file, but by default it uses the Python interpreter that is in the OS machine. We can override this behavior by passing the Python interpreter target, the interpreter that we just built from scratch before. Cool. Now everything is ready to write the test. Let's create a new folder called test and a file called compilerversion test Py. This is a very simple test that will check that the Python executable is present and that the version is correct, and to include the test in the build process. Two, add a build file so we add it to the test folder. Here we define Pytest target pytest is just a way to say to Bazel that we want to create test that is using Python. We need to specify a name for that. We use compiler version test and the source files needed to compile and execute the test. In this case it's just compilerversion test py. We also need two define the dependencies that are needed for the test. We load dependencies using the requirement function which maps a piP package name to a label and avoids our code a dependency name into the piP file. Dependency management in Bazel is very straightforward. You don't need to create any virtual environment, you don't need to run any PiP install or any other eggy thing. Just list dependencies under the depths field and you're done. Note that up to this point everything is explicit, so this will ensure reproducibility of the build. Okay, we can now run our first bezelized Python test. So from the project root this is the way that we're using Bazel to run test, and we need to specify which test we want. So run the package test. The target is called compiler version. Okay, so this runs the test. So as you can see, the test is passing. This means that we are using the right version of the Python executable. You can notice here that it says cached. This is because I executed this test before, so now it doesn't execute it again. Since I didn't modify anything in the test, it's just using the cache resulted in radius. Up to this point we went through the foundation of a baseline environment using Python, but let's see something more complex and close to a real use case. We can create a new folder called Src with a new file flask app py and this is a simple flask application that will show the binary path and the python version of the OS machine along with the one used by Bazel. We can then check that the two paths are different. To build it, we need a build file. So let's add a build file under the SrC folder, and this time we are creating a binary. So we just say py binary again. We need to specify a name less cap the source that are needed, less cap PI and we load the dependencies here. We need to extend the requirement 60 with the flux dependencies and all the child dependencies as well, and just reload them using the requirement function again. Okay, so now we can run the application. This time we use Bazel run. Bazel run first compile and then execute the application. So let's do it. Bazel run it is SRC less. Okay, so this is compiling and executing the application. And as you can see it runs and it's running on the ost on the local lost. So if we open the browser and navigate to localost, we can see as expected, the bazel is using Python version three eight three that we compiled from scratch and not python three eight five that I have on my YoS machine. Are we sure that the build is reproducible? We can do a quick test. We run a build two times and check the output binaries for any differences by comparing the MD five ashes. Here we computed the ash of the binary that we just built. Clean all the build artifacts and dependencies with bezel clean, and then run a build again. The new binary is identical two the old one. So we have a reproducible build, right? Well, actually it's not fully reproducible and let me show you why. If we go back to the workspace file, we are trying to build python inside Bazel to achieve full reproducibility. However, using HTTP machines patch commands means that Python is built using the compiler of the OS machine that runs the build the Python interpreter, which is pin two. A precise version will depend on the machine's GCC and system libraries that are not pinned or controlled in any way. In other words, the build is not fully reproducible, but there are solutions for that. You can run Bazel build from a docker container with a pin GCC version and then check in the docker information within your project. This is a common approach in CI systems. Instead of compiling Python from scratch, you can use a precompiled binary executable, check it in the source control, and pin it on the build. Or you can use a different approach and use a tool like Nix, which allows importing thematically external dependencies into like system packages, and you can find a link in the presentation. To summarize the biggest takeaways. From now on, don't take for granted that your build is reproducible, since most probably is not. Arithmeticity enables cherry picking and can save you from uncomfortable situations. Impulse to the build must be versioned with source code or you will not have any control over them. Internal randomness can be sneaky, but must be removed. You now have a working python environment that is hermetic thanks to Bazel, and that you can reuse for your next Python project. You have seen how to compile a flask application in a reproducible way, and how to manage dependencies automatically. The following link you will find the code I presented today. Feel free to reuse it. Thank you very much. If you want to connect here, you can find my contacts. Reach out and let me know your thoughts. Enjoy the rest of the conference and talk to you soon.
...

Gaspare Vitta

CI/CD Engineer for Autonomous Driving @ FCA Fiat Chrysler Automobiles

Gaspare Vitta's LinkedIn account Gaspare Vitta's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways