Conf42 DevOps 2023 - Online

AI-driven DevOps CI/CD pipelines

Video size:

Abstract

Ensuring reliability within deployment pipelines for complex systems dealing with massive flows of real-time data can be a challenge. However, by enabling Observability across our E2E infrastructure and combining AI techniques with SRE best practices, we can successfully prevent faulty deployments. Everyone knows that automation testing is at the core of any DevOps pipeline in order to enable an early detection of CI/CD platform overload. However, this does not apply to large systems dealing with massive flows of real-time data, where failures are often extremely hard to identify due to the infrastructure’s complexity – especially in multi-hybrid cloud environments. Imagine this: you’re a DevOps Engineer at a major tech giant and you are responsible for the overall system health, which is running in prod. Numerous alerts, server crashes, Jira tickets, incidents and an avalanche of responsibilities, which sometimes simply feel like a ticking time bomb. Furthermore, your production environment is constantly being updated with various deployments of new code. While automated tests in CI/CD pipelines make sure that the new code is functioning, these do not cover all the dark corners of your end-to-end deployment platform. Imagine, for instance, that the building/testing/deployment service of your CI/CD platform is experiencing performance issues (or worse) at the same exact time when you are deploying a brand new feature into live code. How do we prevent a disaster? And what if we are in rush for a hot-fix? The first step is to establish observability, which focuses on enabling full-stack visibility of your environment. This enables end-to-end insights, which provide transparency of all your internal components and help you understand how these interact with each other, as well as how they affect the overall systems. Unlike traditional monitoring approaches, the aim is to understand why something is broken, instead of merely focusing on what is broken. Ideally, we want to shift the paradigm from reactive to predictive. Once observability has been enabled across our end-to-end environment, we will employ AI/ML techniques in order to pre-emptively alert, as well as prevent any form of system degradation. We combine these strategies mentioned so far with SRE methodologies (e.g. SLO, error budget) when measuring our system’s overall health. This type of approach provides an additional layer of reliability to DevOps pipelines. For this session, we have prepared and analyzed several use cases, followed main principles, summarized best practices and built a live demo through a combination of Observability and continuous deployment tools.

Summary

  • Video was prepared by Francesco and myself. We're both part of an Accenture group of highly motivated SRE specializing the state of the art SRE DevOps practices. Our goal is to promote a growth mindset that embraces agility, DevOps and SRE as the new normal.
  • Francesco Sbaraglia is working as SRE and I ops tech lead EMEA at Accenture. He has over 20 years of experience solving production problems in corporate, startup and government. We will have a look at the classic continuous delivery pipelines architecture. Then we move to the AI driven approach.
  • SRE and DevOps needs a CICD to deploy Oddfix and Bugsfix during outages. Most of the time the CI CD platform is black box. We wanted to understand what is happening inside, how we can improve all processes.
  • This is the architecture for our CI CD flow integrated with the observability platform. In the upcoming slides we will explain more about observable platform and how we're using the open telemetry standard to collect data from our Jenkins pipelines.
  • KPIs that we want to derive from the data that we're collecting with open telemetry. By measuring the deployment speed, we also improve our realtime to recovery. The next KPI is build test success rate. The last one we ee in this slide is the availability of our CI CD platform.
  • Open telemetry is an open source standard for generating and capturing traces, metrics, logs. The ultimate goal is to derive actionable insights from the data. This basically translates to understanding how the various components in our systems are connected to each other.
  • We are using open telemetry collector agent to collect all the inside flows, metrics and traces from the CI CD pipeline. We are going to run a machine learning prediction. Correlation and prediction will try to understand if there will be a disruption and to predict the service alerts of the next Alphan hour.
  • We are running Jenkins inside our Kubernetes cluster. Using open telemetry we are getting these complexity dynamic and let's have a look on the inside. Here you can see the deep dive in all stages we can click inside and understand what is going on.
  • All of these metrics that youre saw in splunk observability have also been integrated in Splunk IT service intelligence. The idea is that all of these KPIs will contribute to the scoring that we ee up here, which is currently 100, therefore healthy.
  • We're talking about a critical platform that requires endtoend monitoring. We applied AI to all this data in order to create some predictions and derive actionable insights. Start simple and scale fast.
  • Well, it seems it's time to close the curtains. I really hope you got something from this session. Until next time. Cheers.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello and thank you for watching today's session on AI driven DevOps CI CD pipeline. This video was prepared by Francesco and myself. We're both part of an Accenture group of highly motivated SRE specializing the state of the art SRE DevOps practices with these goal to promote a growth mindset that embraces agility, DevOps and SRE as the new normal. I myself have a background, software engineering, AI and industrial automation. I specialize in several SRE related topics such as AIops observability, brand chaos engineering. Thank you very much Mikhaila welcome everybody. I'm Francesco Sbaraglia and I'm based in Germany. I'm working as SRE and I ops tech lead EMEA at Accenture. I have over 20 years of experience solving production problems in corporate, startup and government. Furthermore, I have deep experience in automation, observability, multicloud and flows engineering. I'm currently growing the SRE DevOps capability at Accenture. Let's have a look at the agenda for today. First we will have a look at why why we need to monitor and observe SCi CD delivery pipeline. Then we will have a look at the classic continuous delivery pipelines architecture. We will refresh about observability and how to use open telemetry in this case. Then we move to the AI driven approach. We will explain what we are doing, brand, how we are using AI. We will have a demo and Michaela will go with the conclusion and takeaways. Let's move now on the first point, why to monitor a CI CD pipeline and platform? You can imagine these CI CD platform is really critical. Let's see our key challenges. SRE and DevOps needs a CI CD platform that needs to be reliable brand with predictable failures. SRE and DevOps needs a data driven approach to run proactive capacity management when brand how to increase resources SRE and DevOps needs a CICD to deploy Oddfix and Bugsfix during outages. Now you can imagine how a CICD platform became really critical. Most of the time the CI CD platform is black box. So we wanted to understand what is happening inside, how we can improve all processes and also how we can improve each steps of the CI CD pipeline. Thanks Francesco. Now that we introduce the key motivation for today's talk, I suggest we dive into the first major topic which is CICD. What we see here is the architecture for our CI CD flow integrated with the observability platform. So let's start from the top left. Imagine a DevOps engineer who's writing some new code and when he's done, he commits it to the git repository. Subsequently, this commit will trigger our CI CD pipeline. As you can see, we are using Jenkins that after the commit will auto initialize the pipelines. Specifically, Jenkins will initialize the dynamic agent pod which we also see here and which is dependent on the resources that Jenkins takes from the Kubernetes cluster. In the next step, the changes will be deployed to the production environment, ultimately reaching our end users. So far, everything I've talked about is pretty much standard I would say. But what we're really interested in is how we integrated the pipeline with our observability platform. In the upcoming slides we will explain more about observability platform and how we're using the open telemetry standard to collect data from our Jenkins pipelines. But right now what we're interested in are the KPIs that we see on the right hand side that we want to derive from the data that we're collecting with open telemetry. So I suggest we analyze one by one. Let's start with the first one, which is speed of the CD pipeline. I want to stress the CD part here as we're only talking about the deployments. Now consider the situation in which an outage occurs and the DevOps engineer needs to deploy a hot fix as soon has possible. We are talking about a critical situation because the system is unstable brand, the end user is expecting a fix, otherwise there is a risk that they might abandon it and that's something we want to avoid. So as you can see, by measuring the deployment speed, we also improve our realtime to recovery. The quicker I manage to deploy the hotfix, the lower will my MTTR be. The next KPI that we're deriving is build test success rate. These one gives us an indication on how many core tests or unit tests are successfully completed once deployed to production. Furthermore, we also want to count the total amount of deployments we have each month per pipelines and possibly per application. We want to know the lead time for a change or a deployment. This one is relevant for identifying the time between the moment in which the DevOps engineer executed a commit to when the deployment actually took place. Furthermore, we're also measuring the change success rate, which corresponds to the total number of successful deployments divided by total. Speaking of which, some of these might fail, which is why we want to identify the success rate. And the last one we ee in this slide is the availability of our CI CD platform, which is obviously a critical KPI. Now, coming back to the big picture of this architecture. You see the benefit of collecting this kind of data and feeding it into the observability platform is that these DevOps engineer who does the deployments doesn't even need to use Jenkins to monitor the state of his deployments since all of these data will be readily processed brand available in the observability platform. So it's there where our DevOps engineer will go and check the status. Okay, speaking of observability, to introduce this concept, observability is the measure of how well internal states of a system can be inferred from knowledge of its external outputs. This basically translates to understanding how the various components in our systems are connected to each other. And we pose questions such as what are these dependencies? What are their dependencies? How do they work together? So in other words, we are introducing transparency and visibility across our entire end to end system. Why? Because our ultimate goal is to derive actionable insights from it. Here are a couple of notions relevant for observability. We have different sources of data illustrated in the center figures that we need to observe. As we mentioned, observability is the ability to measure a system current state based on the data that it generates, such as flows, metrics, and traces the so called golden triangle of observability on the left hand side, while on the right hand side we see the golden signals, which consist of latency, traffic errors and saturation. Okay, before we move on to the AI side of things, it's also really important to understand how we collect all of this data that we just mentioned. We sre using opentelemetry, which is an open source standard for generating and capturing what we ee here traces, metrics, logs. So basically our golden triangle. On the right hand side you see a diagram illustrating how exactly open telemetry works. Slo I suggest we quickly analyze it. Let's start from the top, where logs, traces and metrics are generated in our specific case by the prod and the container. But from the diagram, we see that we're talking about raw data too. In the middle, we have the open telemetry collector, which enriches and processes all of this data. Specifically, these enrichment process occurs in a completely uniform manner for all three signals. Basically, the collector guarantees that the signals have exactly the same attribute names and values describing the Kubernetes pod that they come from. So what happens in our case is that the exporter takes all of this data from Jenkins. So we're talking about system flows, app flows, traces, metallics, et cetera. It correlates all of it and it forwards it to our backend. So the key benefit from this approach, from these dynamic approaches is that we don't have to install anything on each single Jenkins agent. Instead we simply attach open telemetry to these Jenkins master and this ensures that we receive all three signals with only one agent. Thank you very much Mikhail. Now let's have a look on the aidriven approach. What you can see are we have these different stages. Observe, engage and act. In observe we are using open telemetry collector agent to collect all the inside flows, metrics and traces from the CI CD pipeline, but also from the underlining platform on the engage. We are going to run a machine learning prediction. So we are going to try to understand and correlate if there are disruption. You see here that we have two different KPIs. The first one is CI CD reliability. The second one is CI CD pipeline end to end. So we are running a smock test every 15 minutes and tests all stages of a fake pipeline. Then we move on on the correlation brand prediction. Correlation and prediction will try to understand if there will be a disruption and to predict the service alerts of the next Alphan hour. Then this will land on predictive alerting. So we are sending an alert before something will happen or in zero touch operation or zero touch automation. In this case we will have SL filling. All scripts will run automatically to fix the CI CD pipeline and platform. In this case we can also increase the performance. We can increase the resources that we have in youre CI CD platform to prevent any disruption that can happen. Okay, this is a funny slide. What we did is to try to understand if our clai these are correct. We asked at chart GBT, we asked at chat GBT if they can give to us a couple of slais and you can see here what we selected at the end. So the first one is about a build success rate. Really interesting because it's also the one that we are using these we have a build lead time. You can imagine which kind of measure. We have these and is really important for our SRE and DevOps text asset rate. We have deployment success rate, deployment lead time. Really interesting one because it's the one also that we are having in our prediction and machine learning will run brand. We'll try to understand if there are correlation with the other event brand then what we are using here is also change failures rate. But we will see this running automatically in the next part because we will see in a demo. Now let's move on. Our demo will have a two part, the first part I will explain what we are doing with observability and using open telemetry in a Jenkins pipeline which kind of insight we will get and then try to understand how to use these for the machine learning part. And then the second part of the demo Michael will explain us what we can do with AI. Okay let's start first to have a look on our pipeline has you can see we have our smoke tests pipeline. It will run every 50 minutes. Is it running over every 50 minutes and it's covering all stages. Here is everything green. Let's have a look on something that is broken in the end. So we have one time that was broken. We click inside, we have a look on our observability. As I mentioned we are using open telemetry. We are getting these complexity dynamic and let's have a look on the inside. Those are collected everything automatically. This is a trace. So the first stage of our pipeline we see these name of the pipelines and we see the time that was running is around 1.4 minutes. We can also click on the span and we can have a look all strategies which kind of progression they had and when stopped on the weatherfall. It's really interesting because youre see from the start the first stage will be about the agent. So we are running Jenkins inside our Kubernetes cluster. You can imagine that every time that the agent will start it will be allocated a new pod. And this new pod will can the Jenkins agent. Jenkins agent of course will do something as first it's these allocation. So we request resources to Kubernetes and start a new pod. Of course you can imagine here that if we have a problem with our Kubernetes cluster. If the Kubernetes cluster is saturated then this will take longer. So here we can already have a look, we can jump inside and we can try to understand if there are problems. In this case these we see a checkout. So there will be the download of our Jenkins file. So our Jenkins file is the subscription of all steps. We see these that is getting downloaded. It's consuming around 16 seconds. So we can do already some improvement. These the script will deploy brand compile our Yaml file. So it's the one that we want to kind of deploy to our Kubernetes cluster in prod. It's taking 47 seconds. And here you can see the deep dive in all stages we can click inside and also understand what is going on. Then there is a build of a new version. We are building a new version for production they will condense our source code and in this case we are going to have a build at the end these we place it these gate because SRe we want to understand if we already have a problem in production. We already consumed our error budget and we don't have anymore here we have a block. So this is the case. So in this case we cannot deploy to production because our error budget has already consumed. Another interesting part is about these performance summary what you can see on the right side. So we see here that the wall time that is required to run these pipeline. There is only one imputation is about application. So imagine that we have a database or we had other different third party software that we needed to connect here. You would have seen something like network compute and database. We can jump now on the graphical overview. What you can see here from the Jenkins pipeline using the open telemetry. We sre also getting some of the interesting metrics out of the box. The first one is about the request rate. It's really important for SRE because we wanted to understand when there is a peak on these request and it will happen maybe has. And if also we need to kind of run brand increase some resources. This will be also the case. Second part about the request latency. So this will tell us which kind of problem we have on this Jenkins master and what we can see down is also the error rate. So if we have error on the API then we will have these immediately an alertment that we can create. Another interesting part is the overview that we have with the Apm. Here you see the wall overview of our application. We are going to select only Jenkins. In this case we will get filtering only on Jenkins as we are interested about these service we are going to select of one day because we want to catch also this problem that we have. Another view that we see here is the problem on the timeline and here in this case there will be one of this problem of this run of the pipeline. Has I mentioned before, we can also go in deep dive is another run. And you see in this case the run was 1.86. 2nd we can also try to understand from a different run if something changes we can compare all of them. And this is also really interesting on observability our metrics, of course they are not standalone. This will be used mainly for debugging built. Michele will tell us what we can do with AI. Thanks Francesco for the observability preview. Now it's time to jump to the AI part. All of these metrics that youre saw in splunk observability that Francesco showed have also been integrated in Splunk IT service intelligence, which is what youre seeing here and which we're using as our AI platform. So what you're seeing on this page is the service tree overview which gives us insights on the structure of the CI CD platform service. I suggest we look at a couple of these services just to get an idea. We have the vault here which is used for secret management. We got the Kubernetes cluster service mapped, which is if you remember, we actually saw this in the CI CD architecture slide. We've got also the GitLab CI CD. However, this is out of scope for today's demo and we finally get to our Jenkins CI CD service node. If we look under it, we have two other nodes which are Jenkins end to end and the Jenkins reliability service. If I click into one of these, I will see on the right hand side, I will get a drill down, will open with a list of KPIs related to this service. And if you look at these KPIs, you'll actually notice that these are some of the ones that we already saw when we were analyzing CI CD, specifically when we were looking at its architecture. So we listed some of the KPIs. So for this demo we implemented a few of these. So keep in mind that this data comes from the splunk of observability platform via the open telemetry collector. And the idea is that all of these KPIs will contribute to the scoring that we ee up here, which is currently 100, therefore healthy. So this was sort of a short overview of the service decomposition, but I suggest we move now to the actual AI part. So if I go here into the predictive analytics section. So this is, as I said, the predictive analytics feature and I will use it to train a code based on my service and its KPIs. I will be using data from the last 14 days and since what I want to predict is the service health, that is whether it will be low, medium or high or critical, I will use the random forest regressor. So I will choose here random forest regressor. The split is 70 30, which is fine, and I click on train. This is now currently training the model based on my Jenkins reliability service which I've selected and its KPIs. And this might take a couple of minutes. Now we see that our model is ready. It has been tamed but also tested. Let's quickly look at the test results. Specifically, I want to see how my model performed on the test set. So if I scroll here to the button I see different analysis, but what I'm really interested to see is the predicted average versus the predicted worst case health score. And we see that in both cases it's 100%, which is okay. But if this were to go below alerts, an action would have to take place. This is something we also implemented. But for today's showcase, this is out of scope. I will save this model brand. Now I can finally use it on actual real data. So I again choose the Jenkins reliability service knows it's loading the model. For this service I again select the random forest regressor. And basically now while I'm waiting for the results, just understand what I'm trying to calculate. Here is the service health score for the next 30 minutes and this is the output that we get. And basically from this point on, the predictions runs automatically. Okay, let's summarize everything we learned so far today. We identified different challenges that come with CI CD pipelines and therefore concluded that we're talking about a critical platform that requires endtoend monitoring. Talking about endtoend monitoring, we also introduced the concept of observability, which is necessary in order to introduce full transparency brand visibility across our entire infrastructures. In order to bring this data from our CI CD platform into our observability platform, we made use of open telemetry, an open source standard used to collect telemetry data and which is relevant in order to ensure reliability. Furthermore, we identified some relevant APIs that help us derive the state of our CI CD platform. And finally we applied AI to all this data in order to create some predictions and derive actionable insights. Why? Because our ultimate goal is failures prediction, which refers to the use of historical data in order to preempt failure before it actually occurs. And a final takeaway, I would like to point out from today's session, start simple and scale fast. So perhaps youre don't know where to start from. Well, maybe start from a simple experiment, see how the system react, see how it goes. And as you proceed you can scale, you can basically build more and more on top of that. Well, it seems it's time to close the curtains. Thanks a lot for watching. I really hope you got something from this session. Until next time.
...

Michele Dodic

SRE DevOps Specialist @ Accenture

Michele Dodic's LinkedIn account

Francesco Sbaraglia

SRE Tech Lead ASG @ Accenture

Francesco Sbaraglia's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways