Conf42 Cloud Native 2021 - Online

How we used Open-Source tools to create Puerta, a gating service for Flagger

Video size:

Abstract

At StackPulse we use a full CI/CD pipeline with FluxCD + Flagger, in order to support our CD culture needs. We also developed Puerta — a homegrown gating service that implements the Flagger webhook phases to support time-based approvals, triggering E2E jobs (CircleCI etc.), load-testing and auto release to production once canary passes on staging.

In this session, we’ll discuss why and how we built Puerta and dig into three key areas: - How to customize your CD workflow to fit your needs and culture - How to empower developers so they can quickly deploy their code securely - How we dedicated time and resources to developing this internal CD service

Summary

  • Today we areas going to discuss Puerta, which is a gating service that we created for our Kubernetes native continuous delivery. After that we will discuss a little bit more about our delivery pipeline and we will dig into later about our custom gates.
  • At Stackpulse we use fairly common infrastructure. We use GitHub as our hosting for code repositories and circle CI for the CI pipelines. Custom gates are implemented in Puerta to support internal tooling and organizational culture. Why do we need custom gates?
  • We believe in small code changes and prs and merging constantly. We use confirmable rollout webhook to simulate release trains. We areas queuing the releases from reaching production to certain hours. Next gate is the e two e execution. This adds another layer of protection and makes developers feel much comfortable when pushing code.
  • We added an event webhook in Puerta, which took every event verbosely and pushed it to a central channel. We decided to keep the channel and copy the messages to be able to audit and transparency in the organization. The feedback loop is much shorter and the DM is more curated at the developer.
  • At Stackpulse we create a SaaS platform for sres and for reliability. We relied heavily on our continuous deployment pipeline to do that safely and to deliver fast value for our customers. Puerta is a gating service for Kubernetes native CD.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everybody, my name is Azad Rudyk and I'm the director of engineering at Stackpools and I'm here with Oralimelev, our SRE lead. And today we areas going to discuss Puerta, which is a gating service that we created for our Kubernetes native continuous delivery. What we are going to discuss today first we'll start with a quick introduction to Stackpools. After that we will discuss a little bit more about our delivery pipeline and we will dig into later about our custom gates. We will explain how they created and how they support our culture. And after that we will describe Puerta, which is the service that handles those gates and support our pipeline. Okay, so about Stackpools. So Stackpulse is a fairly new startup that creates we create a SaaS platform for sres and for reliability in general. We call that reliability as code. We digest many events coming from monitoring systems and we enable sres to automatically respond to those events by executing an automation that we call playbooks. Playbooks help investigate and remediate events and resolve incidents automatically without any manual intervention and therefore reaching a faster resolution, a safer resolution and quicker response. A bit about our tech stack at Stackpulse we leverage Google Cloud platform as our cloud provider, and we heavily relied on Kubernetes to deploy our services. Particularly we use GKE, which is the managed solution in GCP. We strongly believe in immutable infrastructure. So we have terraform code that describes all our infrastructure as code. And probably, as you guessed, we are cloud native architecture. We have microservices and we have modern RPC in the communication between those microservices. And as the context of this talk, we have a full CI CD from a merge to the main branch up to production automatically without any human intervention in between. And that's the context of this talk, and we'll dig a little bit about that in the next slides. Okay, so with that I let Orr explain and discuss our pipeline. Thank you Rudik. I'm Orr, the SRE lead at Stackpulse, and I'll take it from here. Let's talk about the CD pipeline. At Stackpulse we use fairly common infrastructure. We use GitHub as our hosting for code repositories and circle CI for the CI pipelines. We are using with lego bricks, with Fluxcd and flugger and circle, and connecting all of it together with Puerta, which I'll discuss in a bit. For the CI part we have GitHub, which kicks a circle CI job for each commit, then the circleci. A successful job ends up with Docker and OCI image pushed to a registry. In the CD part we have flux, which is for those who don't know, Flux is a GitHub toolkit which listens to two resources. One of them is git for configuration, the other is container registry, in our case, Google container registry for new artifacts and images. We push the image to the registry, flux then refreshes its cache, discovers the image, applies it to the new canary flagger, then recognize that and triggers a new canary pipeline. The way we extend flagger is through webhook as you see listed here, all the webhook stages that flagger supports and we leverage and implement those in Puerta. The custom gates are implemented in Puerta like I said, and this is how we extend flagger and support internal tooling and the organizational culture and structure we use. Let's discuss deeper so what happens to a commit HPR get merged to the main branch, then CI system triggers a build on staging on the main branch. Flux then updates the workload for the canary flagger triggers a new canary pipeline, then we have pre rollout triggers our e two e and waits for the e two e job to finish. During rollout, we use the built in metrics for flagger. It queries our Prometheus monitoring system for success rate and latency. We fine tune each canary metrics depending on its slos and if we stray away from the success rate and latency we set to reach in our service, we fail the rollout and roll back the canary deployments and shift all the traffic back to the priming. We then have a post rollout. On a successful rollout on Flagger, we implements the post rollout webhook which creates a new release on GitHub and then all the things happens again the same way on prod. So a new CI built triggers on the new release. Fluxity then updates the workload on production flagger, triggering a new canary pipeline on production and again and again. So why do we need custom gates? Let's discuss. We want to deliver a value for our customers since we are a new startup and we want to impact customers as fast and as reliably as we can. The way we do this is with gating. We make developers feel comfortable with pushing all day every day and relying on a gating service to gate and fence their failures and mitigate bad releases from reaching production. We want to support our organizational culture. We strongly believe in great engineering culture throughout our organization. The CD is not an exception for this. We want to have everything support our organizational culture and we do this with gating only e two e tested flows which the e two e are tested by the point of view of the user. Which means we can catch bugs that developers may missed during unit test and integration test and we might catch those in the gating and then prevent from those who reach production. Visibility is a crucial part of pipeline. Developers want to know what's the state and phase of their deployment and where it stands. Is it in prod yet or not? Can I check, can I reach the code on prod? Can I test it? Can I check that everything that I tested in dev is actually working as expected or not. We strongly believe in full ownership of developers from dev to production. You build it, you run it, you are the owner of the commit, you make sure it reaches staging, it behaves correctly, you test it, you write the e two e test, you write the integration and unit test and make sure everything reaches in a safe manner to production. Another benefit of having a gate is we can gather all the events happening in flagger and keep an audit trail in logs and store them for a longer time. Plus having them in a central channel. We can follow when and where was a release and we can use it for compliance reasons. So let's talk about the gates a bit. We have confirmed rollout webhook implemented in Puerta. We strongly believe in reliability since we areas a reliability platform. And what we want from developers is actually work when they feel comfortable remotely or in the office or at night where they reach peak performance. We don't want to block them from merging their code when it's ready. We believe in small code changes and prs and merging constantly. The PRS is crucial part. The thing is everything is automated. So we don't want someone merging at midnight their commit to reach prod. So we actually gate them from reaching production in hours. People cannot attend their code reaching production without waking up the on caller and stuff. We are making sure the rollout happens during work hours. We have 12 hours during the day that everyone can attend and actually answer on call regarding bad deployments. So it's actually guarding developers from having mistakes without intention at night. And we keep the code chunks smaller and much more reliable and easier to read instead of having a big chunk of code reaching production in every time of the day. So we use the confirmable rollout webhook to simulate release trains and we areas queuing the releases from reaching production to certain hours. Here is can example. So the canary is waiting. This is Friday. It's weekend here in Israel. People don't want to wake up from a release happening during the weekend, and this release will queue up until Sunday morning and will let hard devs have their weekend with their families, et cetera. So the next gate is the e two e execution. We use the pre rollout webhook to trigger a CI job on each newly deployed Canary. If the job fails, we fail the entire canary without actually impacting prod. The pre rollout prevents traffic from getting to the new canary and it stops before even propagating a single percent of traffic. We use playwright, which is an etc infrastructure for UI tests and our entire APIs are GRPC based. So we enjoy the fact that we get generated clients for each API and we leverage these generated clients in the e two e to trigger both UI and API tests along the same job. This one helps us simulate a user accessing our systems from the CLI, the API or the UI, and helps us catch bugs which we couldn't find or catch beforehand. This adds another layer of protection and makes developers feel much comfortable when pushing code. Next gate is crucial it seems trivial to have stack notifications for each phase, but what we had at first is we use the built in notifications from Flagger, which are lacking. They're not that verbose. So we added an event webhook in Puerta, which took every event verbosely and pushed it to a central channel. What we then realized is developers were complaining about the noise in the central channel. They have very verbose messages about all the microservices we deploy at Stackpulse for dev and staging, and everything was concentrated into the same channel. We then decided to actually keep the channel and copy the messages in order to be able to audit and transparency in the organization and let everyone know what happened when, so we can correlate between incidents and deployments. But we wanted the developers to know where they stand, in which phase their code and commit is at, and know where exactly they stand and whether their code reached production or staging, and whether it failed. And then if it failed, they can go back and probably fix it and know why it failed. Maybe 500, maybe something in the server, maybe something in the e two e broke it and they broke the contract and they go and fix it. The feedback loop is much shorter and the DM is more curated at the developer and helps the developer identify bugs and stuff earlier in the pipeline. So I'll give it back to Rudik. I think we're done. If everyone want to know another they have questions regarding Flager and Puerta. They can reach me. So let's sum up. At Stackpulse we create a SaaS platform for sres and for reliability in general. We relied heavily on our continuous deployment pipeline to do that safely and to deliver fast value for our customers. To do that, we use Flager, which is a very common solution in that field. But we had to extend the built in functionality of flagger using a custom tool that we created, which is which we call Puerta. And Puerta has custom gates that support our own needs and our own organizational culture. As Tor mentioned, we have e to e there and we have notifications there and many other things, and that's what helps us to achieve that fast value and fast and safe feedback loop. So it help us a lot to extend that functionality and get state of the art continuous deployment pipeline thank you so much for watching our session on Puerta, a gating service for Kubernetes native CD. Please feel free to reach out on Twitter and ask us additional questions. We'd love to hear from you.
...

Or Elimelech

SRE Lead @ StackPulse

Or Elimelech's LinkedIn account Or Elimelech's twitter account

Eldad Rudich

Director of Engineering @ StackPulse

Eldad Rudich's LinkedIn account Eldad Rudich's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways