Conf42 DevOps 2024 - Online

Declarative everything: a GitOps/automation-based approach to building efficient developer platforms

Video size:

Abstract

In this session, we’ll uncover the power of GitOps and automation as we navigate the landscape of modern development, simplifying processes and enhancing efficiency for a seamless developer experience.

Summary

  • This talk today is called declarative everything, a GitHubs and automation based approach to building efficient developer platforms. On the production side, obviously, there's the true production environment. There might also be staging, which might get a small percentage of the production traffic. So yeah, the traffic split depends on how that's configured.
  • A dev workflow normally includes, and I'll use the terminologies SDLC and software development lifecycle interchangeably. Hit CI CD after that, and then pre production and then prod. Which parts of this process can benefit from automation?
  • Gitops is just trying to drive a configuration backed approach to codify various of the DevOps best practices. The ultimate rationale behind all of this is standardization will fundamentally improve the developer experience. Because it fundamentally helps us improve our development and deployment times.
  • An environment needs to have whatever runtime requirements or dependencies an application needs. Depending on the use case, if it's a production environment, obviously it shouldn't be isolated from prod. Testing at its core exists for just one core reason is acceptance criteria.
  • A GitHub based approach can let us get into a world where ephemeral preproduction environments are just coming up. Each of our changes gets tested in isolation. When all of the changes land up in the main branch, similar tests can also be run. This is where DevOps can essentially further add superpowers their capabilities.
  • GitOps can help engineers get to having a lot more confidence a lot faster in the SDLC, right? While they are in the inner loop, instead of having to discover issues after it, back into the outer loop. The fastest way before production would be when we were writing code first.
  • Thanks for listening to my talk. I work at this company called Devzero. We are trying to operationalize various things that we discussed in this talk. If you're interested in checking out how any of these things can be applied to your day to day engineering workflows, check us out at www. devzerode.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
This talk today is called declarative everything, a githubs and automation based approach to building efficient developer platforms. Before diving in too deep, let's understand what a developer platform even is. On the right hand side over here we have production ish things. On the production side, obviously, there's the true production environment. 100% of the traffic that that receives is usually from external users. There might also be staging, which might get a small percentage of the production traffic. And again, all of this depends on how the organization has implemented various blue green type of environments. So yeah, the traffic split depends on how that's configured. On the non production side of the house, we have various different environments. Development is usually one of the biggest ones that engineers get to interact with on a day to day basis. This might be a local development environment. On a local workstation, it might be a remote machine that an engineer normally sshes into, so on and so forth. Then there's CI CD. The goal for that is that environment normally serves two types of usages. One is from the various authors of all of the features and changes that are being proposed. The other is from our colleagues who are teammates who get to review the code and see the outputs of the various test executions. Then we have our classes of pre production environments, which aren't really production, but are production ish in the sense that it might be something for our QA teams to work off of, or a dev environment, which is an environment shared by multiple engineers to perform various types of end to end testing. A user acceptance testing environment might also be in that bucket, which is a place where a product manager, for example, might go to make sure the definition of done as was outlined in whatever product document has been implemented appropriately. We talked a little bit about a developer platform. Now let's talk about the developer workflows, and then we'll try to converge the both of them together. A dev workflow normally includes, and I'll use the terminologies SDLC and software development lifecycle interchangeably. They are essentially the same thing. The SDLC includes two parts. One is the inner loop and the outer loop. The inner loop is when the engineer is actively building something, which is the dev happening inside of an IDE. Some local testing might happen. Then as an engineer, we might push this branch up somewhere to get it deployed into some environment where we can do our end to end testing. And out there we can start to identify some issues. If there are any issues, goes back into dev and that loop continues till the engineer is happy with their implementation. To meet the definition of done once all of that's good goes into code review again, if everything looks fine over there, we'll move forward into the CI CD stage. If it's not, it goes back into dev, goes back into the inner loop of the SDLC, changes keep getting implemented, so on and so on the other side, the code review, that's the start of the outer loop. Hit CI CD after that, and then pre production and then prod. In every single one of these stages, if there's something that goes wrong, we go back into dev, which is the start of the inner loop of the SDLC. Otherwise, we just keep going forward into the next stage till we hit production, which is when our software is going and serving live humans. On the right hand side, it's just a different way of looking at the inner and the outer loop of the SDLC, where the inner loop, as I covered, was the repl of software development, which is the read well print loop. Once things are fine, we end up using git as our context transfer mechanism. We push a branch for code review, for example, and then it hits the outer loop of the SDLC, goes through that entire loop, and then it'll go into some sort of a pre production environment. Staging might be called a pre prod environment, or a part of the production infrastructure, depending on how the organization is usually set up. Next, I'm an engineer. I've been given a ticket. I need to ultimately get this change into production. What are the workflows that happen underneath? I will get the latest version of the source code, pull main, do a git chatout, b new user table if that's the ticket, I'm working on t 1234, and then I pull up my id, get my local dev end set up. After that, I might wait for my id to index a little bit, and then I start coding. Once some code has gotten written, I might do some form of local verification, might push it to an environment of some sort to be able to perform some level of end to end testing. And one theme across this today will be a lot of the content I cover will be more applicable to the microservice side of the house, although some of these constructs should be applicable to monolith architectures as well. So now I've done some testing, the feature seems to be working. I'll write some unit and some integration tests, and then once I'm satisfied with the definition I've done, send it out for a code review. My colleague, my teammate, will read the code. They'll check how CI was running for the tests I just implemented. They'll also make sure that older tests haven't broken as a result of menu change. Then a diligent code reviewer might also try to get this into a preview environment of some sort, right? To check how the functionality, et cetera, is working. If everything looks good, approval if there's feedback, you send it. And again, it's the same inner loop of the SDLC till things look good. After things look good, the engineer can merge the change into main. As soon as that merge action happens, tests run. If things are looking good, engineer can move it into pre prod, do some more testing, things are good, deploy to staging more testing, things are good, move it to prod. That's normally the day to day workflow of getting a change into production. Now, which parts of this can benefit from automation? I have it bolded in this image over here. The code and id process, right? Checking out the latest branch, or checking out pulling the latest main, getting into a new branch, making sure id is indexed. All the dev environment stuff is set up. All of my dependencies are there. That's a bunch of developer toil and friction that can probably benefit from automation testing in an environment. Can an engineer automatically get an environment to test in that is just based on whatever branch that they are operating on? So those are bits directly automation can help for unit and integration testing. I know there's a lot of work happening in the AI realm today, still very early, but maybe computers can also help us write some tests and also help with the code review process. Usually we all have automated tests running in CI already. As soon as a certain branch is put up for review or a request to merge domain on the functionality, checking the functionality in your preview environment. That's another area where automation can help significantly and when tests are passing approval has been given by a teammate automatically, we should be able to merge things into main, run those tests in an automated fashion, automatically provision a pre production environment, run tests in there. If things are good, get in the staging, so on and so forth. So now we talked about dev platforms, we talked about developer workflows, how changes get into production. Now Gitops, what is Gitops? Githubs at its core, you might have noticed we were talking about creating a bunch of infrastructure in various stages as we were trying to get a change into prod. Gitops is just trying to drive a configuration backed approach to codify various of the DevOps best practices which can be applied to infrastructure automation using this. Many companies call these golden paths the ultimate rationale behind all of this is standardization will fundamentally improve the developer experience, and by removing humans from having to make a variety of these almost menial decisions, to go and tear down environments, create new environments, et cetera. Automation can help us reduce costs to kill environments when they are not actively being used, for example. How can this be achieved? Some sort of configuration that lives right alongside your source code, your app logic, business logic is living in source. Why can't all of the supporting logic around codifying stages of the SDLC also not live alongside it? And ultimately, why this matters is concept CI existed for continuous integration, right? How can we have repeated automation happening every time we merge? A new change test, get run, et cetera, et cetera. Because it fundamentally helps us improve our development and deployment times. So that's why Gitops is important. Now, we talked about environment multiple times in the last three, four odd slides. What is an environment at its core? An environment needs to have whatever runtime requirements or dependencies an application needs. It could be something as simple as having a go compiler available, or something that can execute a process, a binary as a process. It might need access to some sort of dev test, et cetera. Data might need access to some form of downstream services. Might need upstream services as well. If I'm trying to, for example, test an API gateway change to make sure that the website still functions correctly also needs to be accessible to a human so that they can either test the environment or actually access it to figure out what's going wrong. Depending on the use case, if it's a production environment, obviously it shouldn't be isolated from prod. But if it's a non production environment, some level of isolation from production is important. Depending again on how the environment is being used, it might need some form of source code, build tools, your IDe, and various other developer specific tooling, and depending on when you're using it. Again, if it's a prod environment or a staging environment, that's usually a shared tenancy construct. But if it's my local development workstation, that's a dedicated tenancy construct. So ultimately, we'd like to get to a world where whatever configuration that we are storing alongside our source code that can give us environments that are well tuned and essentially configured appropriately for whatever stage of the SDLC that we are currently in testing, testing, debugging, et cetera, all of these are angles at verifying that the software we are now proposing or making a change to is fundamentally not causing regressions anywhere, and it's meeting the appropriate definitions of done testing at its core exists for just one core reason is acceptance criteria. Right? The product document that I received as an engineer has some form of features and functionality that it aims to establish. Has my implementation achieved them all? And I'm primarily, again, because I mentioned I'm going to focus more on the microservices side of the house. Again, this is also applicable to monoliths, but I'm taking more of an emphasis on services and not libraries and modules over here. The reason for that is oftentimes a database doesn't live on the same machine as where our application is running. So a network call is happening to either talk to a database or to a downstream service, or how upstream services are talking to me. So functionality, we covered that. On the interoperability side, I'm calling various downstream services, usually as part of any feature. Am I calling those systems appropriately? Am I breaking any API patterns? Is the latencies that I experience, is it something that's acceptable to my end users or my target end users? And ultimately on the confidence side of the house, which is why again, let's go. Testing exists. Is this change that I'm introducing right now? Will it cause future production deployments to be more error prone? Will it cause more alerts to go to our on call? Has the feature really been implemented in the simplest and the most resilient way possible? So normally when we talk about we have source code right on the left hand side. I've kind of tried to represent this as a Monorepo structure. All of these primitives apply just as cleanly to if a company has multiple smaller repos like micro repos, every microservice comes out of its own repo. Doesn't matter. You can see various libraries, modules, et cetera, all of it live together. There's some software that runs in the middle which will ultimately cause some form of artifacts to get built. For example, a docker container, an OCI image of some sort that then gets sent to our production workload management system or software, which gets us to go and run this latest version of a certain artifact and it runs it, and then all of our systems go and talk to each other. So how does this software actually end up in production? Normally I'll just use a couple of Kubernetes examples, but this is applicable to any sort of runtime. Kubernetes has these deployment yamls or helm charts, et cetera. Ultimately I push a new branch somewhere, or if it's a new change in the main branch, a new Docker image might get built. This image gets pushed into a container registry somewhere a field is updated in my helm chart, which is normally in the source code. This field is probably going to say like this is the latest version of this image, and then the helm chart gets applied against my production Kubernetes cluster. At that point in time the Kubernetes cluster knows okay, I need to pull this image down from this appropriate container registry. I pull that down, get it deployed, stuff works. Now we covered the Prod deployment process. What is preprod? Preprod normally involves engineers, as they were saying, pushing their branches somewhere, having the CD system go and deploy that change into a shared tenancy pre production environment of some sort. These are great for the most part till I'll give an example. I am working on a front end change. Bob is working on an API gateway update. Sally is going and running a database migration. When I push my front end change into the pre prod environment and I suddenly see an issue related to a database migration pop up in the error logs. That didn't happen because of me. A GitHub based approach, as we can see on the right hand side over here, can let us get into a world where ephemeral preproduction environments are just coming up, where each of our changes gets tested in isolation. And finally, when all of the changes land up in the main branch, similar tests can also be run to make sure the current state of the main branch is pretty healthy. It's green, but during our non production in the dev processes, being forced to share a pre production environment with all of our colleagues again leads to a lot of developer friction. Now images normally in CI we don't regularly go and spin up our own version of a full tenancy cluster of some sort where proper end to end testing is happening. There can be some cases where images get built and this is just a little spec on how something like that might happen, but ultimately it just boils down to check out the latest source code, make sure all of your relevant credentials are set up for whatever container registry you're using, build the image, push it there, and then finally call whoever your workload management system is. In this case it's a Kubernetes example. Call it with the latest version of this image that's in this registry and kubernetes are instructed to please go and apply these changes. And in this world in the CI CD process that I just talked through normally in our tests we might have configuration to set up some four environment wherein tests are run and every engineer then has to verify that the test failures are not environment related. If there are test failures, of course, they need to make sure that it's only related to the changes that they have introduced, because environment related changes again, in this world, the DevOps team is the one that's normally responsible for making sure the environment is left in the pristine state and the Githubs world that we talk about, which is what you can see on the right hand side, the test suite gets instantiated, ephemeral environment comes up all backed by configuration that lives right alongside my source code. So if I didn't change any of my environment, config in source, the ephemeral environment that comes up, that's a golden path. In that environment, we know for sure that the only changes that are going in at that point in time are the changes that I've implemented for my feature. So when I run the test now, all failures should only be happening because of my changes and nothing related to the underlying infra. This is where DevOps can essentially further add superpowers their capabilities. DevOps teams can just by having configuration wherein whatever golden path they want to attain codified left in source code, every engineer just goes and spins up their own version of it in the code review process. Again, it's very similar. Oftentimes engineers, whoever the reviewer is, need to make sure that the changes that we are reviewing, nothing is failing because of underlying infra issues or nothing is succeeding because of potential underlying infra issues as well. So as a result, preview environments, et cetera, everything can be very ephemeral. Just come up. I make my changes. When changes get merged into main, all of the infrastructure that was spun up just gets deleted automatically. So Dev and CI stages, they will normally resort to using Docker for many of these functionalities. You can see over here, the r has gotten cut off due to my screenshot. I apologize for that. But ultimately you can see like these are all just Docker compose files. If a developer, for example, needs a MySQL database, we'll normally not go to AWS for example, and spin up an RTS database just for that. Just use Docker to do our basic feature functionality testing and this usually suffices for dev. But as we get closer to CI and the other stages of the SDLC, that's where having more production symmetric infrastructure might be more helpful in weeding out bugs and getting a better sense of latencies, et cetera, et cetera, as a user would experience when interacting with a prod environment. So now why does all of this automation, why is it important? How exactly does it improve our developer experience? So, on the left hand side over here, let's take a normal dev loop. I write code, I build, compile, I run it, inspect to see if everything works fine, make a commit, move on. Let's say as an engineer, I'm working 6 hours a day. That's about, what, 360 minutes? This entire loop takes about five minutes to run. So out of 360, I can do this about 72 times in a day. So 360 over five on the right hand side, as soon as we go into this microservice containerized world, we get to this. Like, we do our code, we do our builds, and then there's the container bailed, et cetera. All of the container ergonomics come in, and that's pretty time consuming sometimes. And ultimately, you can see that five minute loop for getting the same task done is now taking nine minutes. And because we were working 360 minutes a day, which was our assumption, now, 360 over nine is about 40 iterations. So on the left hand side, I could do 72 iterations. On the right, I can only do 40, which is actually a 45% degradation, which can take a significant amount of time away from the inner loop of the development process. So this dev workflow that we had talked about in the past, like the one you see on the left, is the traditional dev workflow for all of us. Pull latest code, set up local n, wait for the id to index, write code, do all of that testing, et cetera, et cetera. It's a long loop on the right hand side. Taking this Githubs based model, it can actually, when we want to do dev, just get an environment, immediately hook our id into it. Everything has already been set up for us. I don't have to wait around for anything. I just do all of my testing. I know that this is a completely dedicated environment wherein Bob and Sally's changes, et cetera, none of it's going to influence me. All of the issues that I discover have to be fundamentally related to the code I just wrote. So now when I'm sending my code for a code review, I have much more confidence that things are going to work to spec and in the CI CD to pre production workflows. If there are some changes on the main branch that need to get tested, some of that stuff is getting pushed later in the pipeline. It's not going and impacting every single engineer every time they are trying to write code again. So the dev purposes, right. Normally, whenever we are developing a feature, service dependencies are normally pretty isolated. Sure, we have our id running somewhere. But wouldn't it be great, like on the right hand side, you can see the red circle calls the green circle, which calls the pink looking circle over there. If we could get a workflow replicated wherein that red was actually calling green, which is currently running inside my id with my debugger connected to it, that would let me have much greater confidence of the software I was building, because I know what the longer request chains are going to look like, and I know exactly how my changes are going to function with respect to them. So there are various ways of looking at it, right? The green one I was just showing you in the previous slide. It's an example when my service or my feature exists in the middle of a long microservice call chain. Similarly, if I'm testing something at the start of a call chain or at the end of a call chain, again, having access to a production like environment, wherein all of the network calls, et cetera, are seamlessly handled, for me, under nape, wherein the only thing that is under test is the change that I'm working on. That would help engineers get to having a lot more confidence a lot faster in the SDLC, right? While they are in the inner loop, instead of having to discover issues after it hits the outer loop, go back into the inner loop. Because that context switching, right? That's the most important and expensive part for us software engineers. So ultimately, to wrap up takeaways as engineers, we all know that issues will always be easier to fix when it's got way before production. The fastest would be when we were writing the code in the first place, the first time we are writing that line of code. One way in which this can be achieved is by using a very Gitops backed access to having production symmetric or production like environments for every stage of the SDLC as I'm writing my code. And why this is important is different stages of the SDLC when they are configured differently, it all adds to different bits of developer friction, ruins the dev organics. So at its core, taking a GitHub based approach would also remove the drift between these various stages of the SDLC. So, yeah, thank you. Thanks for listening to my talk. And I work at this company called Devzero, where we are trying to operationalize various things that we discussed in this talk today. If you're interested in checking out how any of these things can be applied to your day to day engineering workflows, check us out at www. Dot devzerode. Thank you.
...

Debosmit Ray

Founder @ DevZero

Debosmit Ray's LinkedIn account Debosmit Ray's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways