Conf42 Incident Management 2022 - Online

How to build low-cost CI/CD solution on top of AWS

Video size:

Abstract

Typical start-up build its initial infrastructure quick and dirty to get relevant and grow fast. Its awesome, but the fee comes later as not-bestpractised tools that consume a lot of time and money to manage them. In this session we will show how to switch from huge one-node Jenkins server to high-performance Jenkins fleet based on on-spot agents.

Summary

  • How to build low cost CSD solution on top of AWS. Talk will describe my own story and my own experience. You can take it from here to wherever you want.
  • Valera Bronson is head of DevOps at Memphis dev. Memphisdev is an open source real time data processing platform. We are building a full ecosystem for in upstreaming use cases.
  • A startup needs to run fast to do as much as possible in a small amount of time. With only one Jenkins instance, we cannot run build in a parallel way. We need monitoring, and we need to monitor everything. With this Jenkins instance is production for us.
  • So, first of all, I want to be high available. In this specific scenario, I wants to run parallel. I want faster build. I also want to reduce time to market. Eventually we achieve this by running a dedicated compute node per pipeline type per pipeline logic.
  • Using auto scaling groups and launch templates and ac two instances instead of only one Jenkins. On a day to day basis I have zero instances up. But you can see all the additional value that you get from this process.
  • The first one is for my big workloads, for my massive pipelines. The second is for the cron jobs, for the backup processing. What is the difference in the cost of those two instances? And you have the difference, you can realize it.
  • The big ac two pipeline will run on this particular node and the small will trigger a new machine, a new node in my auto scaling group. While we're waiting for it to start, I want to show you how it's configured. If you have any questions, feel free to contact me in my email.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, I'm really happy to be here. Really happy to talk to you all and let's enjoy this session, this conference, and let's start. So in this session I would like to talk with you about the next topic, how to build low cost CSD solution on top of AWS. Okay, this talk will describe my own story and my own experience. And you can take it from here to wherever you want. You will see what I achieved here and what were my issues through my journey. So before we start, I would like to introduce myself. So, my name is Valera Bronson. As I said before, I am head of DevOps at Memphis dev Memphisdev. It's an open source real time data processing platform. Okay, in short, what we do, we are building a full ecosystem for in upstreaming use cases. You can read more in our website. I am 31 years old, married with two wonderful kids, more than ten years in it world. Okay. Through my career I study as a Linux administrator, storage administrator consultant, senior consultant, solution architect, and where I am today, ahead of DevOps at Memphisdev. Okay, so enough about myself, let's start with the story. Okay, this is my Jenkins story and I will try to explain it and tell it in the best way. Hope it will be clear. So how did it start? Okay, we are startup and when we started we want to do it quick and dirty. Okay, that's the exact explanation. Quick. We need to run fast to do AWS much as possible in a small amount of time. So in some point we looked for an automation tool, some CI CD tool, and our specific request. With our specific request we understood that Jenkins can be the best and quick and dirty solution. Okay, so we want Jenkins, we install it, and we're starting to use it without any deep dive, without any deep dive configurations and so on. Just as it is, let's use it. And in the first time it works good. Okay. But through our journey we understood that the initial setup is not working so well, the build taken so much more time, the disk getting full too quick. So we started to upgrade our Jenkins instance. And then we came to what I call Jenkins ninja, a really, really powerful instance with a lot of cpu, a lot of ram that can do a lot of workloads, heavy workloads in the same time. But through the time we get to the new issues, to the new obstacles that we need to solve and what we found, what was bothering me as a DevOps engineer in the company, and what I wanted to achieve in the future. So let's start. The first issue is of course single point of failure, you can understand it by yourself. Okay? I have only one machine, only one instance, and how much it can be powerful as far as I want, but still, it's one. And when something unexpected happened during the build of one of the pipelines, it will affect all the rest. And it's okay when it happened in day to day work, but it's less okay when it happens, when you are in the middle of the release and you are pushing your comes to the production, to the customers. Okay? So we don't want this scenario to happen. We need to think how to be more, how to, sorry, excuse me. We don't want this single point of failure happen. That's it. That's the point. Okay? And when we have only one Jenkins instance, we cannot run build in a parallel way. No parallelism, okay? And it's obvious, in case we have some build that will take all our resources, there will be no resources for other builds. And you can take it to every scenario. You want mongodb that catch the port, and another build that use another mongodb instance cannot run on the same node because the port is already in use, and so on. And as you can understand, the first two points already lead us to the third one. We need monitoring, and we need to monitor everything, and we need to be with the eyes on the product every day, every single moment, because with this Jenkins instance is production for us. For me, as a DevOps engineer in the company, I need to have it up and running all the time, and I don't want to do this kind of job on my daily basis. I want to be sure that my Jenkins is running all the time, no matter what, okay? And of course, we are growing, and the bill is growing with us. Okay? So now when we understand what are the problems, let's see how we can solve them and what we want to achieve. So, first of all, I want to be high available. I want to be sorry. Ha. All the time. All the time, no matter what. In case one of the pipelines is killing my node, I want to kill this node and run the pipeline on another node without any regrets, without any thought about what. Maybe I have something on this particular node that I need for my next build. I don't care. I want to destroy it and create another one. This is a chain in my perspective, okay? In this specific scenario, I want to run parallel, I want faster build. I want to reduce time to market. I want to run AWS, much aws parallel as I can, as my setup can afford. If I need to run four builds, then do the same. I want to run four builds in the same time. I don't want to wait until each one of them is end and then run the second one, the third one, the fourth one. Okay? I don't want it. I want in parallel another goal that I wanted to achieve. Eventually we achieve this. I want to run a dedicated compute node per pipeline type per pipeline logic. Okay, I'll explain. Imagine yourself, I have one pipeline that perform all the workloads, that are massive workloads that have a lot of throughput, a lot of cpu usage and drum and so on. But on the other hand, on a daily basis I have some cron jobs that push data to databases or get some backup from my GitHub repositories. I don't need those kind of highperformance machine from the first pipeline to be involved in the second pipeline. The second pipeline can be run on some free tier, maybe t two micro t three medium or something like that. Okay? So I want some kind of logic that will see which pipeline I run, and after that trigger the proper instance into that pipeline and you will see how we achieve this. And of course the build. Okay, so let's talk a little more about the build. You can make all the calculation I did here by yourself. You have AWS calculator pricing calculator, and you can use it and see the numbers by yourself. But let's take my scenario, scenario number one, before the optimization, my instant type. I eventually came to 32 vcpus and 64 ram. And the next step was already 64 vcpu. I needed this machine because my pipelines took all the cpus and all the run from the server on each run, in each run, in each build. Okay, so this machine was on demand. And I will explain why the Jenkins, the Jenkins ninja run on it's monthly usage. Okay, and aws, you can see this is my estimated cost and that time before I jump to the next level, to the next instance, type the 664 vcpus. I didn't want it. So this why we started to find to looking for another solution. And you can ask why not onspot instances and okay, these are the numbers for the onspot instances. But you know the merfilow, when you need it, it will happen. And I mean when you are in the middle of the release and something happened, your machine will be destroyed. Because in some reason this spot instance is there for someone else. Okay, it will happen in the middle of your release. Believe me, this is how Merfield low works. And I was there okay. You don't want to be there. So this is my scenario, scenario number one, before the optimization. And this is the numbers. Okay, let's go. So now when we have some background and we understand what are the issues and what was the goals that we want to achieve, let's talk about the solution itself. So before we dig into solution in the architecture, I want you to see, to show you some diagram. Okay, let's say it like this, how I change the Jenkins ninja into the Jenkins ninjas. Okay? So now I have only one instance, the Jenkins ninja that will coordinate all the others agent, all the other ninjas. The master Jenkins will say them what to do and which pipeline to run in each and every minute. Okay? So let's see how to get there, how it works. Imagine yourself, you are starting your day. You are logging into your Jenkins Ui. Choose the pipeline you need to run and click on build now. Okay. In a regular scenario, build now will trigger the pipeline and it will start running on the same instance. Okay, my scenario is working like this. Build now will trigger the relevant fleet group. Okay, fleet free group. I will show you later in the short demo how it looks and what I mean to, but in two words. I have a fleet for every pipeline group that I want to divide between them, okay? And you will see right now how it works. The relevant fleet group will trigger the easy to flip plugin. Okay. And the easy to flip plugin knows how to connect to aws and how to run the auto scaling group. There I have number of auto scaling groups and each one of them that was triggered will run the relevant launch template. In this template we can configure this instance type, network consideration, security groups and so on. But something to mention and it's important, you don't need to choose one specific instance type. And this is the beauty in this solution. Okay, you can configure in the auto scaling group, you can configure a group of instance types that suits you, that can perform the workload you need, and the auto scaling group will choose them automatically in case one of them is not available in that specific time. I don't know since we are using spot instances and it can happen that some kind of instance is not available in the specific time you need it. So the auto scaling group will choose another one and you will not feel it. Okay, your pipeline will starting. So we choose the launch template. We choose the instance type we are starting the user data scripts. They are part of the launch template. The user data scripts are as simple as that. Are the prerequisites for our build. If I need during my build libraries for node js or I need to install some specific version of Java, I will do all of these prerequisites in the user data script. So when the EC two node is coming up, it's coming up, it already have all the prerequisites. I want it to be there so the pipeline can start immediately. Okay, so after we finish the data script, all the prerequisites, we'll raise up a flag at the status is okay. And our Ec two instances, sorry, the Jenkins agents are running, they are up and running. When our master, our coordinator see that this flag is based up, it can start the pipeline. So we have some kind of another path to get to the pipeline to be started. But you can see all the additional value that you get from this process. So you will ask, okay, now we are using auto scaling groups and launch templates and ac two instances instead of only one Jenkins. So what are the numbers? Where is the build? Okay, so what we have now, now we have scenario number two after the optimization. In this particular example, I will use for you the same instance type. Okay. Aws I used before 32 vcpus 64 ram. But this time it's spot instance that launched with the SG auto scaling group. But the interesting part of it, on a day to day basis I have zero instances up. Okay, it's important if before this optimization we had one Jenkins fat Jenkins huge machine with a lot of power running all the time. Twenty four seven for all month, for all year. Let's say, let's go there. Now we have zero instances up day to day, and you will see how it reflects into the numbers. I'm taking here some assumptions for the calculations, but they are from the real world. You can understand. Let's say I have four peaks in a month. Okay, I have a release or comes build every week, and I have some massive workloads in this time. Let's say every pick like this will use all of the instances in this auto scaling group. For this example, I choose five spot instances. Okay, let's make this assumption for a second. But from my real world I use only two or three maybe, and it's not for 4 hours. But I'm taking you to the limit over here. So each instance will have 4 hours of intensive workload during the peak. Okay, so you can see this is the estimated cost. You can multiply it by the number of instances, but it's much lower from the previous one. From the $1,000 for one Jenkins machine. And yes, that's a lot. That's a lot. Imagine yourself as a growing company when every month you multiply your workload on the pipelines, on the build, the $1,000 that we started in the beginning today, after six months, after a year, it can be multiplied by two or three or five or ten. And you will understand how this number is so big and so important to us. Okay. The one who is still listening will ask, okay, it's unfair. You are talking about the spot machines, but all this time, you still have the Jenkins instance up and running. And yes, you are right. But now my Jenkins coordinator is a different instance type. It's not free, but it costs me much less. Okay. It's a t free medium. And honestly, I can take the t three micro if I want one cpu and two ram, because on a daily basis, this machine, the only thing it's doing is only run the plugins and be a coordinator to point to the right fleet and redirect the pipeline to the right agent in this fleet. That's all. And the estimated cost is, of course, is $45 in a month. Once again, there are theoretical numbers, but I can say from my own experience, there are the numbers I saw before the optimization. And after the optimization, our bill reduced significantly. So after we saw all of this theoretical, let's say, information, let's go to the Jenkins itself and you will see how it works. So, this is my Jenkins, and for this session, I prepared two pipelines. One pipeline name is big ec two. The second one is small ec two. They do the same. Okay. They take some GitHub repository and back it up. But one of the pipelines will use. And you can see over here, sorry. One of the pipelines, the small one will use the Jenkins small footprint Sg. And the second one will use. The big one will use the Jenkins fleet Sg. Okay. The names are not so aligned, but it's important to understand the Jenkins fleet SG. The first one is for my big workloads, for my massive pipelines that will run all the build, all the e two e tests, and create images, destroy images, and so on and so on and so on. These builds will use the auto scaling group that have the launch template with the huge instance type. Okay, as I saw before, 32 cpus, it can be 64 cpus. Whatever I want. The second one is for the cron jobs, for the backup processing. I don't want to trigger these massive instances for those small kind of jobs. I want to use something t two micro or t two free medium. The small instances, one cpu, two cpu, that's enough for me. I don't care if this particular job will take two minutes or three minutes. It's okay for me, but I do care how much money, what is the difference in the cost of those two instances? And you have the difference, you can realize it. So now we have the big one, I assume. I want to. Okay, I want to show you how it works. That the big ac two pipeline will run on this particular node and the small will trigger a new machine, a new node in my auto scaling group. So let's run them and see. As you can understand, the big one will take much more time. That's why it's already here. But the small one, I believe we can see it in minute or two how it starts. So I will run both of them, I repeat myself, they do the same. Okay, but one of them will trigger the small group as a fleet and the second one will trigger the big one. So while we're waiting for it to start, I want to show you how it's configured. And it's configured. It's a really simple process. Okay, you need to install the plugin. This is to fleet plugin we showed before. And then you go to the manage comes configure clouds and from here you can see your Amazon configuration. Okay. In this setup I use AWs, so you see the Amazon ec two fleet. You can check in your cloud provider how to create those fleets and the configuration is really simple. The name of the fleet in this section you will see the credentials and after that every basic information you need to enter it. What region I want to run into, what's the name of this auto scaling group? And one specific section I want to show to you is if you remember from the diagram I had this okay flag that everything is okay and we can start run. And this is how implemented. It's a prefix start agent command that I run before the Jenkins starting start the pipeline and it repeats itself every 5 seconds if I remember right. And check if this flag is raised up. It's just simple as that. Okay, this one, the first fleet, and this is one, the second one, the small one, the same credentials, the same configuration, but the auto scaling group is different. Okay, fine. Now I want to show you the auto scaling group in AWS here I've already filtered two sg that I use here, the small one and the big one, and you can see that the configuration is different. The small one I need maximum of two instances, but in the big one I want to go up to the five instances on daily basis, the desired capacity and the min capacity are zero. Remember it, zero. That's the catch in this story. Okay, so now we can see that. Okay, already one instance is up and our build I assume is starting to run right now. And yes, the big one is already finished and the small one started right now. And when will it will finish? The instance that was triggered will destroy itself in two or three minutes. Just like that I have another cron jobs and they will run the same way. They will trigger the auto scaling group. The instance, easy to instance will get up, process all the logic and go down. That's it. Okay guys, I hope you enjoyed it and thank you very much for attending my session. If you have any questions or you need some additional information, feel free to contact me in my email. Valera at Memphis dev feel free to contact me on any social network you are using and enjoy the conference. Thank you very much.
...

Valera Bronshtein

Head of DevOps @ Memphis{dev}

Valera Bronshtein's LinkedIn account Valera Bronshtein's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways