Conf42 Site Reliability Engineering (SRE) 2024 - Online

Building Secure and Flexible Multi-Cloud Images with Multi-Boot Mode: A Comprehensive CI/CD Approach

Abstract

Step into the world of innovation as I unveil our SRE team’s strategy for Cisco’s ‘Resource Connector.’ Witness how we integrate GitHub Actions, Packer, Ansible, Python, Golang, and bash, revolutionizing multi-cloud delivery. Real-world examples highlight our commitment to efficiency and reliability.

Summary

  • Building secure and flexible multicloud image with multiple a comprehensive CI CD approach. In this session what we will see is why we choose the approach, our solution, some overview of the CI and the reason behind all of this approach.
  • But we choose GitHub action for because then has efficient Ci CD pipelines. The developers can easily extend or create nice features. You can create the whole end to end automation. Packer is very quickly to build image with parallelization. We support different clouds and different types of boots.
  • For each hardened image in each cloud we create a product based image. And for each image that was created, was built we triggered a deploy sanity test and then a chair down machine. We need to guarantee that any change in any stage is fully tested. Once we merge this image will be used for further testing.
  • Three different components. Why three? Because we want to have separated components that we can use in different times. The reusable workflow. Each job has steps or groups of steps that you can have many composite actions. Very simple to implement.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Thank you for joining this session. I am pleased to be here and we are going to see today building secure and flexible multicloud image with multiple a comprehensive CI CD approach. But before starting I'd like to present myself. So who I am, I am a brazilian expert living in Poland. I have about 70 years of experience in it and my main experience relies on being a developer specialist with Fox automation, infra and cloud. So in this session what we will see, we will see hints of the code and we will see why we choose the approach, our solution, some overview of the CI and the reason behind all of this approach. So the CI CD overview. So we are going to see what stands for CI CD. For those that don't know, CI CD stands for continuous integration and CD stands for deployment. Depends on which kind of product do you have. So for CI is usually when you have automated builds, automatic testing, you want early detection, you have code qualities and continuous delivery is like when you have automated deployment, it wants to reduce the time to the market. So you build and you promote this to the marketplace for example, and this you help to have a better delivery and deployment. You have a commit, you push and you build and you deploy production services. So you have an efficient feedback loop. In our case, in our approach we just need CI and the CD four coach which is delivered because we are not deploying anything to the customer in the resource connector. So in the user expectation, the user expect reliable and available system that you can have frequent updates is easier to fix bug and it's very fast to fix the bug. It's constant in the user experience, has security and you can have privacy and you can also see the logs, so you can see the transparency, you can see what's happening in the jobs has nice and features. So that's what the user usually expect from a CI CD system. So our context is the resource connected product. So the resource connected product is a piece of inside of the zero trust network access suite. So Cisco has this product to help the customer to guarantee zero trust in the network. So you are allowing security features from access to internal websites or systems or any endpoint. The resource connectors works as a kind of tunnel or proxy VPN and is deployed in the same subnet where the private app he wants to give access to the user in his smartphone or his computer and to access is not via directly but via the suite of zeropress and then using the connect resource tool to access. So because the customer can have different clouds and be in different data centers. So we, for our own reasons we want to support different clouds. So that's why we want to support Azure, Google, AWS and so on and so on, VMware and OpenStack. Whatever is where the customer is, we want to be there because his interest will be there. So that's why he wants to have a product that is much cloud and because the clouds may support TPM, may support security boot and we want to have the safety features available in each cloud. So that's why you want to have different boot. But also internally we need to disable the security boot for internal and reason or troubleshooting and debugging. So that's why you need to build Gifron boot modes. And we have a strategy on building different stages. So we are creating harden, so we get from the supply for canonical we get Ubuntu, we harden where we install and configure the OS with secure checks, mandatory regulation, anything that's necessary to create a secure and harden OS and that os we use in the product and in our own internal builds, runners, internal services, anything that we need to use. Then with the hardened image we create a baseline product where we could partitions, make os grades with for example anything that we update some software internally and that we get latest version with everything, with any cv set and so on. And it's the base image. Then we use as an input to correct the boot mode, so users base and then we install Cisco preparatory code that's like for example the inconnect and other software that makes the product WhatsApp product and enable disable the boot modes or selecting boot modes to enable in that VME machine. And we have a code overlap. So we don't want to copy and paste code all over the place, we want to share the code. For example to create the product is similar code to build the product. So instead we have one file for azure, one file for VMware, one file for AWS and so on, so on. We have a build a script that will do similar all across to the, to the clouds so it's easier to maintain. But also the other hand is tricky because if you change one line and you can maybe break something that you are not even thinking, so you are changing something for taking change for AWS. But then you start knowing you break something as you. So we need to double check how the that you support. So every time that you support a new boot mode or a new cloud, this increased the pain. So as much we are increasing the support, increasing also the pain. So this is very critical for our developers to support and are very painful as well. So because that we came with the approach that is supporting different states in different repo. So every stage will be a repo, so the harden will be a repo, the base image will be a repo and the product image will be another repo. And we will use GitHub actions because we are using GitHub. So we are placing the indigo, we are using the GitHub action as function. So with arguments and values we have a contract where I say okay for build, this is the inputs that receive and this is the values that I expect. What you do internally, I don't care because this concerns the self contained inside of the, in that component. We use Packer to build image. So Packer has infrasorce that is nice to use to build different cloud providers and we use the app one to deploy because we need to build and test. We need to deploy the machine and everything that's around of the machine, the environment that's necessary to test the VM and also we use ansible to build, to run some command lines to build and also to test. So we want to test it. Something is working properly inside the machines. We use ansible to test and we need to use regression test strategy. So every time that you change a critical part of the infra code, in any part of this stage, we need to run a full regression test. So if we change one line, one script that's shared in the hardware image, we need to test then the build, then we test the building base and we test the built in product and we test the deployment and everything. And if everything's fine, yes, then we are saying that it's safe, that our build is okay and we can merge that code without any problem. So that's how we structure the project. So we have four different repos, one for each stage view. So harden product base and the product itself and a tester repo where we have 70 tests, integration test, performance, test, vulnerabilities, scans, test, anything that concerns to different stage of testing. We are placing this repo and we are calling from the other build repos in across pipeline strategy. So the builder suite structure has a similar folder and structure across all repos. So it's easier to maintain. So the developer when reads the repo, the code will see okay, yes, stimuli is a bit different because the context is different. When you are building a hardening it will be different from building a project, but the structure is similar. So it's easier to maintain if you know how to maintain one repo you know how to maintain the other repos. So we use reusable workflows from GitHub. So then we use composite actions like is components, its functions in high level terms that allow us to use inputs and outputs and do self contained logic in niche cloud. So if needs to get some secret, some specific configuration for that cloud, we don't care because it's self contained. We just pass what we need to do like build with that version, with those parameters and give me an image with some name and I get this and we'll be using the reusable workflows Anskivo files are used to build. So we use playbook with some roles and some tabs that give us the freedom to install to configuring the van as we want. With those in symbol tasks it's very easy to provision the deployment and we use bucket script to build the cloud image. So it's multi cloud supported is if as a code sales also is a stack that we use. So the file dependency is how we manage the pin inversion. So we use pin inversion across all the repos so that this way when we are changing something it will be committed, will be reviewed in the code review in the PR so we know exactly what we changed and then we can track back by the commit hatch when was maybe some change and how some bug was injected. So we can trace back and find the bug and fix the bug very fast. And the test suite structure, we have a similar structure from the build, but GStav Packer we use the form. So we use health to deploy the VM and the environment around any resource that's necessary to deploy in that cloud. And also we use common ansible files. So in this case we don't use to the provider the VM, but we use to run some tests. So every test in the Xbox roles are one type of test. So for example, let's assume that our product needs to create some files and some folders and we need to guarantee that those files and folders are in place inside the machine because it's critical for the running software. So one type of test for example, could be a test that goes and make assessment machine check if the file and the folder exists and check if the file has the format and the content expected to be executed with success. So if everything's fine, so the asymptotes will run with success. If something is wrong, we ignore, so we place ignore the errors and later when everything is executed, we summarize our success and error or ignore tasks and we say okay, we have some testing or because they are errors, that means we run all tests but we place a failure because we should not ignore anything because that means we have a failed case. So that's how we use this strategy to build and test. But why this step? So I just give a hint why. But let's go deeper and check some nice features on for example GitHub. So because we use GitHub as their repo was very natural to use GitHub action because it has seamless integration. You don't need to use external Ci CD like Jenkins or another type of CI CD that's commonly used for the community. But we choose GitHub action for because then has efficient Ci CD pipelines. You can customize because it's made in YamL file is human readable. The developers can easily extend or create nice features. You can create the whole end to end automation. So you can build, you can test, you can deploy everything in one place, you don't need to build in one place, then you have to test in other places and you have to deploy in a third place. That is very handy and covers many aspects of the soft delivery lifecycle, software development lifecycle. So that is why you choose the GitHub action and why they have form. We choose to have formed because infra is a code. So we can define the infra with code in a very constant way, so we can see. So if we need to change something that we deploy the infra, we can use that to always have the same state we desire in that cloud, because we need to support different cloud on Prem and also public clouds. Terraform supports that you can even create your own provider if your environment is not supported officially by community or for Dev or for the hashtag. So that's how you can use and has resource graphs. That means if you create the order that makes sense and that makes your environment work. So if you need to deploy the VM and then after the VM you need to run some scripts that will call some APIs to make some hash string, for example, you can decide his order, you can place it to use some depends on properties that we okay run one after another and guarantee that everything will work as should you can reuse components. So for example if you need to register something in our builds for all cloud, so you can create a register module and reuse this across the different clouds so that you reduce the copy paste script, you have a state management file. So because we built in one job, we deploy another job and this deployment can take can run for after minutes, you can once run for a few hours or even you want to leave the machine there for a few days for some longevity tests. We want to have the build the test and the tear down in different jobs. And they are different individually, they are not dependency. But when we need to have to destroy, we need to know what we are destroying. So that's why we need to have a I state to know what we need to destroy. So that's why we use the form because it's easier. We just have to pass. Okay, this is the file. Please make a trdon of this file that has everything that you need to know about what was deployed in AWS or in Azure, in any other cloud. And let's assume that the deployment didn't occur with success. So something was deployed and something didn't apply. So you need to make a cleanup. So because the terraform has a state of what was deployed and what was not deployed. So you have, okay, make a cleanup that will destroy everything that was created. And you have a clean lab so you're not leaving anything behind on the lab. So it's good to reduce any trash, anything that should be deleted and has many integrations with the commute x system. So why you choose fucker Packer is very quickly to build image with parallelization. So that's why we need it once built very fast and we want to build with parallelization different clouds and different types of boots and we want to support different platforms as we RLC, so we want to support where the customer is. So we need to build for different platforms and we have also the seamless integration so we can integrate with all the tools. We want to have security so we want to use from trust certs that also unpacker supports and we use different plugins and the plugins are very different plugins wide supported for the community. So that's also a nice feature. And as terraform is infra and why is multiple infra for building images at least for the product, we have a mandatory reason that we need from the same input, the same files and we need to generate the same output. So that's why we are using pin versions, we are using infraso code because we can guarantee that that same commit will guarantee you generate the same result over and over. So this guarantee trust for some oddity or anything that we need to guarantee that. Okay yeah we can. That's what your, this binary, this image is this code and you can guarantee that because if you build you have the same output and asymbol why use the asymbol? Because it's agentless. So we don't need to install anything in the products to guarantee that we can test or we can build. So we just use over ssh and very simple so we can run comments as doing SSh in command line manually but via automated and programmatic code is using emo file as well. And as GitHub action is human readable so you can read it's easier to understand and is easier to change. That documentation is a good documentation has many plugins. So if you want to use third party plugins, for example you want to use AWS, not t UA from NS but instead to call this AWS ClI you can use the plugin for AWS. So you know has different plugins and you can predict the automation so it's id potent. So and also the community is very wide and has different plugins that you can install using galaxy repository. And let's talk about the workflows. So here we have the build pipeline workflow. So we have the hard image where we call the. When we change the code this will create the hardware image and then we use as input create the product and the product will create, will be used to create the the product image with the web security boot or the web security boot for the body for this example. So now imagine that we need to support last azure and any other clouds and increasing. So this is our case right? So and then for each one we need to run tests, we need to deploy the inference, we need to run the test and we need to tear down the infra as well. So here is how will be look like the full pipeline. So there one orchestrator that will know what to trigger the builds the hardware. Then for each hardened image in each cloud we create a product based image and then we'll create the product image with the proper boot mode. And for each image that was product image that was created, was built we triggered a deploy sanity test and then a chair down machine. So how we can make this, you know to guarantee regression tests in this case, in this scenario. So I will give you here an example about one user changed as shared ansible test in the harden suite via PR. So is adding a new security check or security configuration in the harder and this may affect the whole chain right? Because you add in something that will change the os that will affect, that may affect the base image that can may affect also the product that will affect the product itself and affecting the test. So we need to guarantee that any change in any stage is fully tested. So in this case we are building temporary hardened image. Then with this temporary hardened image we use as inputs to create the temporary basic product. Then we use the result of that to create a temporary product. Then for each security boot then we deploy that temporary product, then we run the test and you're done. So if everything's passing in our clouds, in our security boot modes that was affected by this change, then we are fine to merge. And we merge. Once it's merged it will create automatically APR with the new latest we build a hardened final harden image. Then we create a PR with the latest build of this harden to baseos. Then same thing. Because you create a new version you need to check if everything is still working because in the base image we are not having everything pinned. We run some OS grades software like we need to update their python, we need to run Ubuntu updates or something like that. That is not possible to pin unless we have cache. So in that case. So that's why every time that you build a final hardware image, we create a PR, then the PR will run the regression test to guarantee if everything is working. Then when the regression test is passed we merge. Then we create a final base image, same thing. We create a PR to the product and guarantee that everything is fine. But because the product image we are not running anything that is OSB grade. We are pinning everything, the base image, the Cisco proprietary code and out versions are pinned because we need to reproduce all over again the same build. We don't need to run regression tests in the PR. When we create the PR with the new versions, we are just creating a build. We will see later a bit more on that. So see about the promotion testing was saying so we create a pr, change the version, the version of the base OS or the version of the any connect or anything that is of the Cisco preparatory code that we have different repos that create binaries and other versions. Then if this file change it, we will see okay, I change it for all. This change will affect all calls and also boot mode or will change only one specific input mode and one specific cloud. So based on the permutation of the change that we made, the pipeline will run a script that we define which pipeline should execute. If you execute all pipelines for all clouds, or only for one specific boot mode and one specific cloud or subset of clouds and so on, then this will create real product images. Then with this image we deploy, we run the test, we guarantee a down. So we guarantee that it's fine, it's working. Once we merge that image created in the PR will be used for further testing. So contest we are going to scan for vulnerabilities, CV's and other stuff, and we run performance longevity tests and then we are placing stage for a few days where we are running all the kind of manual tests and other stuff that we guarantee that we are, everything's fine, we run against the provider scans, vulnerability and other tasking as well. And then when we are ready and then we are ready for GA, it's not necessary that we are going to deploy that we are going to deliver that, but it's just that it's already. So if you want to deliver a new version is ready, we just have to write the change logs to the customer that say okay, what we are changing and why we're changing and then it's ready to be on the marketplaces all over the clouds. So that's how we can have end to end from one image to the marketplace. So in depth builds. So we also need to guarantee productivity and facilitate the life of developers so they don't need to have a toil of tasks to get some buildings and some machines. In this case here for example, a user wants to place some permutation of real versions or that versions of everything and then create an image using this as input. Then this devil developer can run some tests and something that is cycle before to place in the code in the repository. So you still running some code or something that is still not mature to be merged to make. But you need some, some results, some products that you can test, that you can deploy and you can make troubleshooting or some analysis on the dev side. Another case is where you have an image, can be the image that we built in the previous step or real image that we want to, you know, run manual tests. For example, you are changing the instrument as you are adding new, new cases, new cases to guarantee more coverage in the test. So this changed the code. So if you commit and push it will trigger the regression test for outlooks. But it's not very, let's say effective because if you have a bug you have to push over and over. So this will create and destroy, create and destroy will take some time and the feedback loop is not very fast so it will be a time consuming task. One way is to use deployment, so you choose exactly which version that you want to deploy and you deploy. We create environments and you have the ephemeral SSH access for the time that you decide. I need 1 hour deployment, I need one day so the developer decide how much time you need and after that automatically you tear down. And during this access you can make troubleshooting or you can also run asybo playbook. So you change the playbook manually in your code. You don't need to push, you don't need to commit, you just run and you see the results. Okay, it's working. Okay, now I can push. No it's still not working. I have a typo, I have a bug. So you can just fix in the code and run again the playbook, fix the code and run again the code. So it's very fast feedback, very handy to developers. These features that we have in our ci. So about the code sample, so I have some here, some screenshots how we organize the repo. So you can see here our repo. So in the first column is the build. So where we have, this is how we like replace structure with the GitHub folder with the actions and the workflows, the in symbol, the backer have a packer script repair boot mode. So that's how we can know, okay, for security boot we need coding one way, for bios we need another way, for TPM we need another way. So this is just a sample how you could, you know how we can have different cloud providers building using bucker and for the testing we have also similar structure, as you can see actions. Then inside the actions we have instable test, that is action set up and chill down. So why three? Because we want to have separated components that we can use in different times. So I want to set up on setup, I want to run the AC bot testing another time and I want to destroy something. That's why it's three different components. And why is this split in different clouds. So when we need to deploy for AWS we need to know exactly the conditionals, this CPNET, VPC and other specific uses from the AWS for access, anything. So we need to know the subscription group, resource group and other stuff for Azure we need to know where is the lab, what is the game sphere formation. So this is self contained. So this concern to outside of the component. So that's, we have this self contained action and then we have the workflows, same thing that will, this workflows will call the composite actions based in the what we are building, what you are building. And as you can see this we have also callable plugin. So this is how we have the testing I mentioned before where we collect all state from the incivol and then we check what was passing what was not passing and then we break the pipeline or not based on how many tests was passed, a bit of the composite action. So for those that don't know the composite actions, this looks like more or less composite action. You have a name, you have a description, you have inputs and outputs. So the inputs, that's how you call, you pass and then you have outputs. What you expect after you call this composite and using the runs using composite is the syntax. So if you want to know more about the syntax, I recommend you to go to the GitHub documentation is well documented and Very simple to implement. Again, have some very nice examples how to implement and you can use the GitHub SAS to test by yourself. Here in the second column we have a step. So the composite is a grouping of step, and step is basically some, one common group of commands that you run inside of your crop. So in this case the packer, we are calling the initializing packer, we are validating and then we are building the machine for AWS. That's how we build for an image. In this case could be instead mi could be the security boot that we are saving. So and we received the security boot as the input. So we could make this as customized. So we said to be hard coded. The amount will be the secret boot in this case as from the base or the seat because the Basel seed has no cpu. So that's why I placed this example. The reusable workflow. So the reusable workflow. So workflow is basic script where you have jobs and each job has steps or groups of steps that you can have many composite actions or you can have specified the steps directly in the jobs and you can have many jobs in the same workflow. So in this case we have a build, we have reusable build pipeline where we have one job that is built that will generate one contract to call the composite action. As we can see in the last line, the second column where you say okay, this is the platform and this is the, the tag, the version of the code from the if I want to use and here is my inputs, the version and this version is this. So with those inputs the composite will know how to handle these informations and build a base image from the seed and the version that we want, the version that you want to use as the suffix version, how we call this reusable workflow, right. So we have another workflow that is executed by PR changes and then this PR disk workflow has different jobs. So the has a job to generate the jobs to know which. So here we are saying okay, I changed the spec file and compare with the spec file in the main, give me the diff. And then based on that we create a matrix of pipelines. And then based on the matrix we call the job. So the workflow has these, so we know pipeline AWS, BIos, pipeline AWS with security groups. So those are jobs in this, in this workflow we could use matrix that could generate automatically this script, but this doesn't work well at least today with chain of pipelines like a build that calls another build. So in the graph in the pipeline is not working well. It makes one unique block and you cannot see the flow is a bit hard to get. So this could be done automatically. But we decided to place hard coded this block or this small piece of block to guarantee that we can see in the graph everything. And we will see later in another screenshot what I need for this pipeline. In this slide you can see. So we have a job that generates the pipeline. We have then in this case we are generating for everything. Then we generate for the base image for AWS, for Azure and for Kcal, and then the base image for AWS will generate for views and for security books and each one will generate a setup of the environment. Then the test and teardown and the azure in this case only triggered in this case that simple case we are only supporting Azure bios and we are only supporting VMware bios. Just to show that we can have different numbers of security boot per cloud and each one after the base image was built will trigger the following stage. Here in the debug VM pipeline we have example of how the developer can set up the machine run for them for 1 hour and after 1 hour because we are using the environment rules from the GitHub, it automatically will wait for 1 hour and then you trigger the tier dump. So you have 1 hour window to do whatever you want inside this machine. Then later you lose access and the machine will be destroyed together with everything that was deployed on the first job. Some crucial takeaways. So we learned that with this the developers don't need to deploy the thermal efrain themselves, they just need to use if they are using integration test is done by the pipeline if they need to do for some manual reasons, the dev reasons, so they can deploy using just passing some parameters so they pass what they want to use and that will automatically apply for that. The worst case scenario that we identified was 3 hours. So we changed the hardened and then we built everything and test so it takes 3 hours. So if you fired a CVE in the hardened in the or a problem, a security issue in the harden, we can that we create APIs, we merge and then we have in 3 hours everything tested and working ready to be patched to the marketplace. If we need, we can add easily my providers without affecting the build and the testing timing because they run in parallel so they are not depending on each other. And because we are using infrastructure code principles along with the pin versions, it allows us to find bugs faster. Because if we change something, one proportion for example, and something start to break for some reason we know okay, that was working with that version. The new version is not working. So we identified what was the reason of the break and the pipeline is modified. So the difference between the cloud providers are self contained so they don't affect the rest of the pipeline or outside of the components. Here I placed some links where you can learn how to use this stack. I think this stack is pretty nice so you can have very good get started. So if you want to have some similar implementations on your pipeline you can start looking into these links to get some knowledge. And here are my social networks so you can find me on LinkedIn, you can send me a DM and be happy to answer any question that you have from this talk or if you want have more details how to about some implementations or any other subject. Happy to be answering your DM and if you want to follow me on GitHub here is my profile as well. So I appreciate your time to listen to my presentation. I hope you have enjoyed this session and have a good conference as well.
...

Ederson Brilhante

SRE Technical Leader @ Cisco

Ederson Brilhante's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways