Conf42 Cloud Native 2021 - Online

CI/CD in the serverless era

Video size:

Abstract

Serverless development introduces a new methodology of how to build real “Cloud Native” applications or workloads. It is impossible to duplicate an exact serverless environment to be run locally, and this is influencing the tools we use and the way we need to manage our CI/CD pipelines.

The main goal of Serverless is to release software faster. So, practicing CI/CD is the logical thing, but there are some adaptations that need to be taken. In this session, we will share how we are doing CI/CD on a pure serverless platform we are developing. We will discuss: - Dev Environment - Testing Methodology - Deployment Pipeline, combining Bash, AWS CLI, and Serverless Framework to create a seamless CI/CD pipeline - Monitoring

Let’s discuss good serverless practices.

Summary

  • Omega is 100% serverless in production, hundreds of millions of lambda invocations per month. In this session you learn how we drive our CLi CD flow in Omego. Quiz will share tips on how we cut our Cli CD time flow by two thirds.
  • Service is different. It affects the way CLI CD is conducted. It's distributed with maybe hundreds of components. The orchestration of building a serverless environment from scratch each time is difficult. For many of the components you can't run them on a regular Linux machine.
  • Each of our developers have an AWS environment on their name. For CLi CD driven integration tests, we are serverless first. We prefer to outsource everything that is not in the core of our product. No QA and no ops means we invest heavily in automation.
  • When omega started and we were small, we used Kanban to drive our workflow. As we grew, Apple management wanted more visibility, so we moved to a more detailed scrum. A lot of responsibility falls on the developer's shoulders. We are very continuous, delivery oriented.
  • We believe in automated testing without QA. Each one of our developer has their own AWS environment. We have three types of tests, unit tests, integration tests and end to end testing. By using parallelism we managed to reduce the test time from hour and a half to 40 minutes.
  • AWS well, another flow that we've developing internally is a staging flow. Integration metrics are important and they give us the ability to pinpoint potential issues. Another metric we are gathering is the deployment problems. Everything is in production now.
  • Lumigo is an observability and troubleshooting platform aimed mainly for serverless workloads. It fuses data for multiple resources and combines it into a coherent view. Use the power of serverless to parallel your testing and reduce costs.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everybody, thank you very much for attending the webinar. My name is Quiz and it's not going to be easy Omega's experience with serverless Cli CD in Omega we are 100% serverless in production, hundreds of millions of lambda invocations per month. And of course, serverless is not only lambdas, we are heavy users of Dynamodb, SQS, SNS, S three, Athena, Kinesis, Eventbridge and I'm sure I missed something. We are really the poster child of serverless in AWS. Not only that, we are also 100% of serverless CLI CD flow tens of deployments per week. We don't have a single server that we manage to drive our flow. Everything is outsourced either to AWS or external service provider. In this session you learn how we drive our CLi CD flow in Omego, how to make the entire flow robust enough, and I'll share some tips on how we cut our Cli CD time flow by two thirds and we'll talk about a unique flow for phase deployments we implemented internally. Across the session you'll see many links to blog posts we wrote on the subject. As a preparation for this session I gathered some ethics so you'll better understand the CICD numbers Lumigo is facing. We have around 20,000 individual integration tests we run per month, around 30 deployments to production per week. And the pain points, I'll also talk about them. During the talk is the time it takes us to deploy the test and run them. By the way, we are trying to be metrics driven and we gather quite a lot of information about the quality and performance of our continuous deployment and development. I'll share with you some of our dashboards, a couple of words about me and AWS serverless hero I'm leading the development of Lumigo's R and D. I've been using serverless in the past four years, mostly with Python. I'm obsessed with serverless testing and serverless development flows. I wrote a couple of blogs and have a couple of sessions on the subject. 15 years in the industry, working mainly with backend, mobile application, various verticals. On my spare time I like to play card games, although most of the time I lose a couple of words about omego. It is a SaaS platform for AWS monitoring and observability, heavily focused on serverless workloads. So the agenda we are going to talk about the famous Infinity loop and see how to adapt it to a serverless CLi CD flow, talk about our best practices and unfortunately, here we won't have any time for Q A, but you are more than welcome to contact me either through Twitter or through my email address, which is going to appear at the end. Service is different. It affects the way CLI CD is conducted. It's distributed with maybe hundreds of components. Therefore, the orchestration of building a serverless environment from scratch each time is difficult. Many components usually means that frequent deployments are the norm, and for many of the components you can't run them on a regular Linux machine. SQS SNS kinesis are AWS services. We don't have their container, only their mocks, so they have to run on a dedicated environment in order to test the real behavior. The infinity loop starts with a plan, goes to code, then to a build process. By the way, in Lumigo we use python, therefore the build process is skip, but in other runtimes like Java net, you actually have to create a running artifact. From there we move to testing both unit testing, integration testing, and end to end testing when everything passes, and elaborate later what it does mean by everything. We're releasing our artifacts and deploy them to production and start monitoring them. Before starting the flow, I want to go over a couple of guidelines we have internally in Lumigo that affect our process. Each of our developers have an AWS environment on their name. We're using organization to manage it and consolidate the billing for us. If the code is not tested on a real AWS environment, then it's not considered ready. Originally we had a shared environment with different name prefixes for each, resources for each developing. But very quickly we started stepping on each other's toes and decided to separate the environment. But we still have a single environment. For CLi CD driven integration tests, we are serverless first, which means we will always prefer to choose a serverless service instead of managing it by ourselves. And by serverless I mean true serverless and not the management. We don't want to handle sizing operation and pay for unused capacity. By serverless we mean across the technology step, things like CLI, CD, code, quality, and so on. I'm not talking only on serverless in production, we don't want servers. We prefer to outsource everything that is not in the core of our product. No dedicated QA and Ops, which means everything is being done by the developers from start to finish. The infinity loop that you saw earlier, same developer takes the ticket across all states. No QA and no ops means we invest heavily in automation. Across the board. We've talked about environments, and these are the environments we have in Lumigo. Each developer has its own personal laptop and the personal AWS environment where they can run their code. We do have a couple of shared integration environments that are part of the automated CI CD process. We have production environments which are composed from an environment our customer use and a monitoring environment that trams our own product that monitoring our production. We are eating our own dog food extensively, which helps us both to find potential issues ahead of time and to sharpen our product. Our technology stack to drive the CLI CD, we're using Circleci. We even wrote a joint blog on how we use them for our deployment. We use the serverless framework. Most of our code is written in Python, but we do have some services that are written in node JS. So let's start with the infinity loop. So when omega started and we were small, we used Kanban to drive our workflow. We had a long list of tasks prioritized. Each developer picked the top one. But as we grew, the Apple management wanted more visibility, so we moved to a more detailed scrum. Each of our sprints are a week long. We keep them short on purpose to create the feeling of things moving fast, but we don't want to wait for the end of the sprint to deliver. We are very continuous, delivery oriented. When a piece of code passes through all of our gates, it's being pushed to production again. A lot of responsibility falls on the developer's shoulders. Originally we used Trello as a ticket tracker, but as the team grew and the complexity of the task grew as well, we moved to GM. I can't say that I'm satisfied with the move, but that's what we have and we live with it. We're using GitHub to store our code and using the GitHub flow in which you have only master and feature benches. Each bench from a feature bench to master means deployment to production again, and it's going to come back over and over again by putting a lot of responsibility on the developer. At the beginning, we had a very heated discussion regarding mono versus Multireepo. We chose a multi because it was most suited for services deployment. That is, each change in the repo means deployment and it coerces the developing to think in a more service oriented way. You don't read directly from a Dynamodb table that does not belong to your service only because you can import a dial or call functions directly. Instead, use common practices to access remote resources, API gateway, lambdas, queues, and so on. Run the lastly post on the never ending battle between mono and repo you can find it in our blog. In one of slides I mentioned that each one of our developer has their own AWS environment. Right now we have around 20 services we need to orchestrate the deployments of these services so our developers can update or install them into their environment. We've created an internal orchestration tool in Python which we call the uber deploy. Not related to Uber by the way, that does the following pulls relevant code from git depending on the branch you choose, installs relevant requirements for each service and installs in parallel the various services according to predefined order. The Uber deploy tool enables our developers to easily install these services in their environment, so no one needs to know the various dependencies and the order of deployment, and it does it faster than manual. By the way, we use this tool only in the developer and integration environments. In production, each service is being deployed on update. This is purely a development tool. We believe in automated testing without QA. It's mandatory. We can't skip it. We have three types of tests, unit tests, which the developer runs locally integration tests, which the developer runs on their AWS environment and are also being ran as part of the CLI flow in a dedicated environment, and end to end testing with cypress, which again being ran on an AWS environment. Because testing in the cloud is slower than testing locally, we prefer to detect as many issues as possible before pushing it remote. We use git precommit hooks to automatically run test and linting. For python we use precommit and for node we use ASCII. One of the hardest things when running tests is adding external services. People usually ask me, why are we running our integration test in the cloud? Use mocks there are a lot of mocks that mimic the behavior of the various AWS services. Well, we tried it and it didn't work well. Couple of reasons. Some services that we use don't have good mocks. They don't really mimic the true behavior of the service and don't always include the latest API. Some mocks infrastructure like local stack are complicated. There's a lot of rocks surrounding them. I prefer to waste my time on real testing. Some of these mocks have bugs, and it happened to us more than once that things didn't work well. Later to find out that they work perfectly when running on AWS, I prefer not to waste my time on debugging mocks. So as a rule of thumb, we don't use services mocks for integration testing. We always run things in AWS environment. So this is our testing stack. We use black and fleck, eight for Python and prettier and dslint for node. We use static analysis for Python, although I'll be honest, it helps us more with readability and less in actual caching type errors, although it does sometimes succeed in catching the issues. And for unit testing we use Pytest, mocha and Jest. We have two types of integration lets very thorough API based testing in which our test is driven only by API calls no UI interaction, using node js and mocha to drive the flow very specific critical flows that are end to end, which include also the UI flows like onboarding, logging and so on. Within Cypress and jest we have two main problems when running our test. We have 20 services with around 200 lambdas, multiple dynamodb tables and kinesis. Deployment takes a lot of time. By a lot of time I mean around 1 hour. Tests themselves are also slow. There are many synchronous asynchronous flows. The test originally took us around hour and a half to run. These numbers are not good. They don't allow us to quickly deliver our features and above all get quick feedback in case of a failure. So first, we've tackled the second problem of running the test. We are using the power of parallelism. We've duplicated our integration test of AWS environment. Each time we do a deployment for testing, we actually deploy our code to three environments, and we're running each of our tests in a different environment so the tests don't interfere with one another and we can run them in parallel. The nice part here is because we are serverless, it does not cost us any extra money except for kinesis, which is quite a painful point. Kinesis requires at least one shard to operate, so we do pay for it. By using this parallel flow, we managed to reduce the integration test time from hour and a half to around 40 minutes half of the time. The nice part about it is that it's fully scalable. More lets means adding more AWS environments another change we did just recently is the ability to use existing stacks after deployment. We pinned the latest hash of the repo as a tag in the cloud formation stack. So when the Uber deploy runs, it checks whether the code changes. And in case it didn't, it skip the deployment of that specific service. It reduces the redeployment time from around 30 minutes to 5 minutes. Right now our biggest obstacle is the initial deployment, but it unfortunately takes quite a lot of time and we are trying to tackle this issue. AWS well, another flow that we've developing internally is a staging flow. One of our components is SDK which is embedded by our customers. The SDK is very delicate and a bug there means a lambda will crash. So we wanted first to deploy the SDK on our system. As I mentioned, we are dog fooding our own platform, so we created a flow that at first a step release an alpha version to NPM. Alpha is not seen by our customers. It then deploys the alpha version to our environment and triggers a step function which automatically, in case no issues are found in the staging environment, releases the final version. At the moment we are able to stop the step function like a red button in case we find an issue in the SDK. Integration metrics are important and they give us the ability to pinpoint potential issues. We are feeding the results of our test to elastic. At the top we are able to aggregate metrics like the number of failures or succeeded tests. The really interesting part is the distribution of failed lets. We can see a breakdown of the various lets according to the branches that ran them. If we see a test that fails in multiple branches, it's a hint for us that something is not working well in the test and requires a fix. The idea is that some branches might have bugs in them. That's why some tests might fail. But if the same test fails many times in multiple branches probably means the test has an issue. Another metric we are gathering is the deployment problems. The deployments are not bulletproof. For example, a deployment might fail because we are trying to provision multiple event bridges at the same time and the AWS has limitation on it. Although we do have internal retries, it doesn't always work. So this overview gives us the visibility on the number of failures that happen due to deployment issues. At the end of the day, we are cleaning our environments for two main reasons. Some services like in essence cost money and AWS environment has limitation in the number of resources you are allowed to provision. So cleaning is mandatory and it's hard. We still haven't found a good process for it. We're using a combination of AWS Nook and omega Cli right now. It's very slow. It takes a couple of hours to clean an environment. Unfortunately, right now I don't have any good and fast solution for this problem. So we are ready to release. We have a couple of release gates. Some are manual, but most are automatic. Code review is manual and the rest of the steps are automated. When gates pass, the developer clicks on merge and the deploy begins. Here you can see the various gates we're using. GitHub checks for gating. So that's it. Everything is in production now. What's next? Monitoring. Of course, monitoring is hard in serverless for a couple of reasons. Many microservices you need to see the full story. The root cause might be somewhere down the stack in various services, no service to SSH to. It's hard to collect the relevant details to better help you understand what's going on. A lot of new technologies, suddenly this space and cpu does not play a role. A lot of new metrics and new jargon to learn. So we are using our own product in Lumigo to monitor and to do root cause analysis. As I mentioned earlier, we are eating our own dog food and I do want to show you a quick demo of how Lumigo looks like and how we use it on a daily and a weekly basis. Okay, so Lomigo is an observability and troubleshooting platform aimed mainly for serverless workloads. It fuses data for multiple resources and combines it into a coherent view. Right now we use logs, specifically AWS, Cloudwatch, AWS API, and an SDK that the user can embed with zero effort in their code which collect more telemetry data on the compute instances. We have two major production flows, slack alerts, which indicates that we have an issue and we should handle immediately, and it's being monitored by an R D developer. We are defining the alerts to the alert configuration, and when alerts is received, you are able to configure it to work against slack, against pagerduty, or against an email. In addition, we have also a lot of other integration you can work with. We also have a weekly meeting in which we go over a list of issues which occur during the last seven days. So we have a real time flow where an issue is arriving through Slack, and we have a weekly meeting where we go over a list of issues and try to better understand what issues happened and whether they affected specific customers that we want to handle. One of the nice things that Lumigo platform enables you to do is to drill down into each type, into each issue, and to better understand the entire story of what happened, why something ran the way it ran. You're actually able to zoom into each one of your lambdas and the resources that you use actually read the return value, the event details. You can actually see the external request that you are doing in guest external resources. For example, here you can see a DynamoDB request, you can see the actual query and the response. And above all, we can actually detect issues in your lambda show the stack trace, you actually can see the various variables that various variables that were defined and to better debug the issues that you've encountered. So this is Lumigo. I want to quickly go and summarize what we have right now. So to summarize, use the power of serverless to parallel your testing, which reduce costs, prefer integration tests with real resources and not mocks, and allow easy orchestration for your developers. And before finishing, I want to tell you about a nice open source tool developed by Lumigo. It's a swiss knife for serverless. Many useful commands, for example, commands that enable you to switch between AWS environments, ability to clear your account, televent bridge, and the list goes on. Give it a try. That's it. Thank you very much. Again, if you have any questions, you're more than welcome to email me or ping me through Twitter. Thank you very much and bye.
...

Efi Merdler-Kravitz

VP R&D @ Lumigo

Efi Merdler-Kravitz's LinkedIn account Efi Merdler-Kravitz's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways