Conf42 Machine Learning 2021 - Online

Monitoring AI Pipelines Output As Product

Video size:

Abstract

I am part of a squad that is responsible for taking the AI engine insights and distributing them to our customers and in-house analysts. Our insights are the core of our product and due to this we need good visibility to be able to identify patterns and also when we are not performing as expected, to take action.

In this talk I will share how we improved our visibility in our products and also our quality by monitoring the output of our ML pipelines. This was an iterative process which was performed by me and the Algo team in which we added metrics, dashboards and alerts.

Summary

  • Hila Fox is a squad leader and a backend developer at Auguri. Augury is a ten year old startup in the machine health business. It gives customers a full SaaS solution that includes avoiding unplanned downtime. This helps our customers reach resilient production lines.
  • We have over 50 million hour of machines monitoring 80,000 machines. We also have dozens of machine learning algorithms. 99.87% of the detections are not being passed on from the detection management layer to the customers. This helps us give our customers only the relevant information they need to handle their machines.
  • So we have here two main usages for detections. Green is good, red is danger. Each time a detection is being propagated out of the detection management layer, it reaches our vibration analysts, which decides if it actually caused a change in the machine's health.
  • The detection management layer connects the AI engine to the customer facing product. It's important because of two main reasons. We have expected changes and unexpected changes. Our motivation is to avoid product issues. From simple bugs to bad deployments.
  • Using graphite and Grafana we're able to visualize a lot of aggregated views on the state of our AI engine as a whole. And now we have a very full view of our entire AI engine. We are talking right about now of changing our strategy with this by maybe moving to calculating the percentage in change on these numbers instead of just monitoring absolute values.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, welcome to monitoring pipelines, AI pipelines as product my name is Hila Fox and I'm a squad leader and a backend developer at Auguri, currently leading a squad that is responsible for taking the AI insights from the engine and distributing them to different end products. So let's talk about the agenda. We're going to talk about Auguri as a company and the product. We're going to talk about machine health AI and how we do it, the detection management layer and why it's important monitoring what we want to achieve using monitoring, the hybrid approach we took and being proactive. And in the end, the conclusions. First of all, let's talk about augury. Augury is a ten year old startup in the machine health business and this is in the manufacturing industry. Like it says on the slide, the world runs on machines and we're on the mission to make them reliable. We do this by giving our customers a full SaaS solution that includes avoiding unplanned downtime, aggregated insights and line of expert vibration analysts. This helps our customers reach resilient production lines. Something we saw is very important during COVID With the increase in demand and also having no or less people on site, this became very, very important. We have a lot of customers which are enterprise companies like P. G. Fridolay, Pfizer and Roseburg, and even that if it's not written here in the slide. But we also have Heineken and Sat, which keeps our beer and our toilet paper coming even in these hard days. We are operating in the US, in Israel and Europe, and we are expanding. So how does it work? Augury's main flow starts from our IoT devices. We make our IoT devices and these are sensors. We have three types of sensors in these devices, vibration, temperature and magnetic fields. And we monitor our machines 24/7 once the data is recorded, we pass it on to the cloud, into our AI engine to get diagnosed. With this AI engine, we also have a line of expert vibration analysts to give even more precise diagnosis. Once we have the diagnosis, we need to visualize it and communicate it to the customers. And we do this via web, mobile, emails, SMS and more. So let's just see, how does it look? We can see here in some manufacturing plant. We have here three and a half pumps. The one at the back is a bit cut off, but we have here four pumps. Each of the pumps have our sensors installed on it and the sensors are communicating with what you can see at the upper left corner. This is also our device. We develop it's called a node. And what it does, it communicates to the sensors with Bluetooth, aggregates the information and sends it to the cloud. This is a snapshot of one of the machines, and we can see that on each machine we have four sensors, and we have two sensors for each component. In this picture, we can see one component, which is a motor, and another component, which is a driven pump. So, machine health AI, we talked about how we install on the machines and how we collect the data. Let's talk about the AI and the complexity in it. First of all, some numbers, because numbers helps us understand the amount of data that we have and the complexity we are tackling. So we have over 50 million hour of machines monitoring 80,000 machines, diagnosed, multiple customers with global expansion, dozens of thousands of machines. We also have dozens of machine learning algorithms, which are based on time series deep neural networks, MLP decision trees, and more. On top of that, we have three product squads, which are developing the customer facing products which are using these insights, and also three algo squads and a data engineering team that work on the AA engine itself. So, let's take a deeper dive into the whole AI flow. I'm not going to bother you with the specific calculations, but for each machine, we collect 1.3 points per hour per machine, and we send it to the cloud. In the cloud, we first of all reach the transformation process. The transformation process is built from a validity algorithm, and afterwards it calibrates the data from electrical units to real physical units like acceleration velocity and more. After we have calibrated our data, we pass it on to our feature extraction pipeline, which is also a model. This is actually a dimensional reduction technique to capture the essential parameters for machine health. We collect roughly around 1000 feature per hour per machine and save it and pass it on to, of course, another machine learning algorithm, which is time series. In these time series algorithms, we collect and calculate features which are relative to themselves. Once we have this information, this is passed on to our ML platform. The ML platform is designed with two layers, two major layers. The first one is not diagnosed. The first one is for high recall, meaning we want to never miss a machine health issue. We want to always alert our customers when there is something going on with their machine. This is called anomaly detections algorithms. The other types of detectors that we have is fault detectors, which means specific faults that can happen on a machine. So the anomaly detector is a semisupervised machine learning algorithm, which actually calculates a relative baseline for each machine. And this is very important because it actually compares the data according to the same machine states, meaning we are finding anomalies per machine and not with our whole machine pool. The other detections, like we said, are fault detections and these are actually specific faults which are identified by a specific signature which is correlated with the fault. Each hour all of the detectors generate detections and output them to the detection management layer. Each detection has a confidence and a confidence is pretty similar to the probability from the machine learning algorithm. And once it reaches the detection management layer, it gets handled in there. What's amazing is that actually 99.87% of the detections are not being passed on from the detection management layer to the customers. And this is amazing because this helps us give our customers only the relevant information they need to actually handle their machines. Some of the detection are being passed on directly to the customers and some of them are passed on to our analysts. When our analysts are adding labeling on the detections, we can actually afterwards use this to retrain our algorithms. And this is a big picture of the whole flow together. So we talked about the AA engine from the inside of it, but now let's see how we use it in the product because this is very correlated to what I'm going to talk about as we move on with the presentation. So we have here two main usages for detections, and this is a real snapshot from the website from a specific machine. And we have two graphs over here. The graph at the top, like its name, says, it's machine health events. It's events happening on the machine's lifeline as we go along. And it's pretty self explanatory. Green is good, red is danger. And we can see the gray circles on the graph and each gray circle actually means something happened. And in our case, those gray circles are detections. Each time a detection is being propagated out of the detection management layer, it reaches our vibration analysts, which decides if it actually caused a change in the machine's health. Once it does, the analyst says it does, and then it notifies our customers and the customers can choose to take actions accordingly. We can see here that a customer decided to perform a repair on the machine and after the repair actually the machine health went back to green. So this is very good. The second graph that we are seeing over here is actually detector confidences over time. And this is very interesting because I chose to show here the bearing wear confidence output over time. And we can see it's very correlated with what's going on in the graph above on each detection that is actually propagated to our customers, we can see an increase in the confidence of this specific fault of bearing wear, which means that in high probability, this is probably what's going on in their machine. So we can also notify them that there is something going on, but also give them a specific confidence to what the specific fault is. So the detection management layer, we talked about the AI engine and the overall flows, and we talked about how we use it in the product. So let's do a deeper dive into the detection management layer. So it's important because of two main reasons. And the first one is that it connects the AI engine to the customer facing product. Like this diagram shows us, we have the AI engine on the right, and it generates all of the detections going downstream to the detection management layer and then distributed to different end products. So we can even call this maybe a single point of failure. So, being confident that the detection management layer is working as we would expect it to be, is very important. It's also a very delicate area because it has multiple consumers and multiple, and it consumes from multiple producers and also product to multiple. Yes, you got me. Okay, that's cool. It's complicated. That's the point. Another important issue here that algo, it contains logic and it makes decisions onto where to propagate to. So it makes this component very important in our flow. So we need to be confident in the changes we make. And as I sit, we have two type of changes. We have expected changes and unexpected changes. Expected changes are new features and we can actually mitigate the risks over there by testing in staging environment or even running in dry run in production, using feature switches and writing to logs without making real changes, will affect our customers. But the other types of changes that we have, which is in my opinion a little bit more interesting, is the expected changes, all sorts of bugs. So in the AI engine itself, we can enhance our visibility to see if there are things going on in the engine itself. But we also have the detection management layer, which creates logic, but also consumes the information from the AA engine. So we can add matrix over here was well, but what are we trying to achieve? So our motivation is to avoid product issues. It's from simple bugs to bad deployments. And when I'm saying bad deployments, I mean someone made a change and the change is valid, but just the technicality of performing the deployment to production. Something failed and for some reason a detector stopped generating detections. This happens every now and then and just needs to be handled. So this is something we would want to know. Changes in interfaces between squad. And this is also a very important point because was I said, we have three product squads, we have three algo squads and a data engineering team. This is a lot of people and a lot of communication that needs to happen. And it's natural that sometimes things wouldn't be perfect. So we want to be on top of that and figure out changes before they make a big impact. Another point that we would like to avoid is negative effects from configuration changes. And I'm going to explain about this one by using an example that actually happened to us after we started the monitoring, the monitoring initiative. And it actually caught this. It caught these issues. So what happened is that our DevOps team made security changes, something that needed to be done, and two of our detectors stopped generating detections. Now, it's all good stuff happened, right? But we need to figure it out very quickly. So the detections stopped being generated. We got an alert and then we just told them, hey, can you revert this change and please just investigate and see how we can make this change again in the proper manner. And that's what happened, and we figured it out very quickly. Another type of common production issues is making changes which you think are correct, but have effects that you can't even imagine, especially in complicated systems like this. It's very hard to understand how it's going to affect. So all of this can happen. And due to the nature of downstream flows, an error in the top of the funnel can cause major issues to several consumers. So this can affect a lot of products and a lot of customers. So this is very important to us, monitoring. It's the moment you've all been waiting for. So what do we want to achieve? First of all, we want to achieve good service and good support. It's the core of our product and we want to catch issues before our customers. It can go either way, even if they didn't know an issue happened, so we caught it beforehand. But even if a customer did notice the issue, we can already tell them we are handling this. So this makes us look very professional. Also, we want to find issues as fast as possible. Sometimes nobody notices an issue until it's too late, right? So we want to be on top of things, because how important it is to our product. We want to have consistent AI insights. The quality of our insights is very important. It's about giving our customers the consistency they expect. We want to find machine health issues, but also minimize the amount of false alerts we give to them. We want to improve the collaboration between our teams. I've already mentioned this, but we've grown from eight people working on the diagnosis flow to seven squad. This is a lot of people, a lot of team, and we need a way to improve our communication and enable first response. So our top goal is actually to retain the trust from our customers. We want to be able to give them a product. They know when we give an alert it's viable and when we don't it's all good. According to the Google Sre book, there are two types of monitoring. We have white box monitoring which is based on matrix of internal, internal stuff, cpu memory usage and more. And we also have black box monitoring which is testing externally visible behavior as a user would see it. So let's look at our use case and the title already gives up and also the drawing gives up where I'm going to. But we are talking about a hybrid approach. And why is that? Because from one side of this, I don't want to necessarily know about each component in my system and if its cpu is running low or we are out of memory. But from the other side, monitoring each end product by itself I mean it's just a piece of the puzzle, it's not the whole picture. So what can I do? So the detection management layer is actually a consumer or a customer to the AI engine. So this is pretty similar to black box monitoring, right. But also it expected product logic and decides on detection state. So this is very interesting too because this actually affects what the external users are seeing. So this is very similar to whitebox monitoring. So what we decided is to actually merge the two ideas together and monitoring an internal product process that makes also decisions about how external customers get this information. So this led us to believe that there are patterns we can commit to and it's very related to the product. We saw this example early on with the two graphs and the machine health events and also the detections over time with the confidence and actually we can commit to the amount of detections that are going to be propagated to users, not on a specific machine, but statistically propagated to users from our pool of machines. And also taking into consideration the was that it can be filtered in the detection management layer. Another thing that we can commit to is the amount of detections being generated overall and being sent to the detection management layer per detector. And in general. So this led us to understand that actually we have a detection lifecycle. A detection lifecycle is what it goes through in the detection management layer. It first of all reaches the detection management layer and afterwards it's either being filtered by the detection confidence, meaning we are not confident enough in this specific detection. We don't need to propagate it to our user, to our users, or even if the detection confidence is high enough, we might want to filter this detection due to the machine states maybe we already alerted the user on this machine and we don't need to put on another alert on this machine and in the end propagated to the customers. So these are the states that we have for a detection. And this actually led us to add the matrix on the detection lifecycle. And using graphite and Grafana we're actually able to visualize a lot of aggregated views on the state of our AI engine as a whole. In this graph we can see the amount of detections coming into the AI, into the detection management layer, daily pale detector. And this really gives us like a full flow of understanding of the differences between them. This is another very interesting graph because we can see here the differences between each step in the detection's lifecycle. And again not specifically for a machine and not specifically for a detection, but in general and how our system behaves. So let's get proactive. So once we had all of these aggregated views and now we know how our data looks like, we can actually use Grafana's alerts to set up alerts and knowing when something is not working as expected. So what we did is actually decide on the first four alerts, which is the bearing wear arriving to the detection. We chose a detector. I chose a detector, the bearing wear alerts. The bearing wear detector, sorry. And I decided on four alerts that we would like to monitoring, meaning four patterns that we would like to commit to. The first one is the amount of detections arriving to the detection management layer, meaning not too much and not too many. The other alert was about the amount of detections being filtered. And also I added not too many detections being propagated and not too less. Right. So once we had all of this running, I set up a slack channel, which is called detections monitoring, and started getting these alerts. Now it took some time because it took some time, we needed to tweak the values because we chose really simple absolute values to put our alerts by. It was really noisy understanding the different behaviors of the detector and it took some time but it did mellow down and just like a sort of FYi, we are talking right about now of changing our strategy with this by maybe moving to calculating the percentage in change on these numbers instead of just monitoring absolute values. So this is also very interesting, but out of scope for this stock. So after the Beringware alerts were stabilized, I created a workshop and together with all the algo team, we added dashboards and alerts for all of our detectors. And now we have a very full view of our entire AI engine, including the detection management layer and all of the detectors and all of the pipelines and everything that you can imagine. Everything is actually in there in one place, because we have alerts that indicate a sort of working, not working indication in a very high level and in how our customers would expect to get this as a product. This is an example of one of the graphs. It's a consistent detections generation graph and it's pretty straightforward. We have the red line which indicates the alerts threshold. There's also one on zero, so we can't see it, but it's there, I promise you. And another very interesting point here is the purple, barely dot visible line that we have over here that has written with it deployment tag. So actually tags is a feature that Grafana enables. You can use their open API and in each time send out a tag that has extra information on it and it gives you a point in time and you can enrich it over your graphs. So what you actually see here in this purple line is a deployment tag on each service and each component that we have in our system. We added a deployment tag that being created when we deploy to production. On this deployment tag we have a Githash, we have the name of the person that did this deployment and also the name of the service he deployed to. So when you have a very complicated system that keeps on deploying different component to it, but all can affect the downstream flow, we can use this and really have a quick way to identify what change was made. And they just like ping the person. Hey. Hi. I saw you did this change. I see this detection stopped generating. Can you please take a look? So this is very powerful. So in conclusion, keep the customers in the center, whether they're internal or external. Internal teams can consume products from each other. It's not about having zero bugs product, it's about fast response. To move fast, we need high confidence in our process. And having an easy way to communicate across teams is crucial. Thank you, I hope you enjoyed it. And if you have anything to add or say or ask, feel free to contact me.
...

Hila Fox

Squad Leader @ Augury

Hila Fox's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways