Conf42 Cloud Native 2023 - Online

Anomaly Detection with Apache Pinot & ThirdEye

Video size:

Abstract

Do you remember the last time you faced a data anomaly in your data and could not explain why it happened? Even worse, it took a long time until you found out. In this session, I will talk about the problems of anomaly detection and how you can mitigate this problem by implementing Apache Pinot with the ThirdEye add-on. Thirdeye monitors your data in near real-time, detects anomalies, and gives insights and interactive root-cause analysis into why they are happening. I will explain the technology and what makes it unique and show a quick demo at the end of the lecture using live data. I wish to show other professionals in the data engineering field how you can leverage this Open Source technology to your advantage in order to achieve a high standard of data integrity and validation.

Summary

  • Yoav Nordmann talks about anomaly detects with Apache, Pinot and Thirdeye. A problem went unrecognized for 12 hours, after which we lost about $10,000 of billing data. What I'm actually talking to you about today could have saved all of this mess.
  • Startree is offering Apache pinot and thirdeye as a SaaS solution. Third eye is an anomaly detection, monitors and interactive root cause analysis platform. Apache Pinot is a real time distributed OLap data store, purpose built to provide ultra low latency analytics.
  • Thirdeye uses an alerting template to create alerts. Alert is an anomaly detection rule configuration, right? There are multiple detector algorithms which can be used. Thirdeye is also a root cause analysis platform.
  • Apache Pinot and Star Tree third eye are running on a Kubernetes cluster. Thirdeye queries the data sent by Pinot every minute. It tries to see whether there is an anomalies in the data which I'm sending. Let's take a look at the anomalies.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Welcome to my lecture, anomaly detects with Apache, Pinot and Thirdeye. By the end of this lecture, I want you to be able to solve this, well, this, or at least recognize this as soon as possible. So what happened was I was working at this big data company, an ad tech company, and we were dealing with 1 million events per second. And all I needed to do was add a new parameter to the data. The problem was that there was a misconfiguration or a misunderstanding about the data type. See, it was sent as a string, but I implementing it as a long value, and this was only relevant for any mobile phone of type. IPhone V seven. Needless to say, that this problem went unrecognized for 12 hours, after which we lost about $10,000 of billing data, and we wasted five days to try to remedy the lost data, to no avail. So, what I'm actually talking to you about today could have saved all of this mess, because we would have recognized the problem immediately. Let me introduce myself. My name is Yoav Nordmann. I am a technology enthusiast. I love working with new and emerging technologies. You can call me either a nerd or geek, or both. It is definitely a compliment. At the clients, I usually work as a tech lead or architect. And at Tikal, the company I'm working for, I am a group leader and mentor for fellow workers in the back end. So, let's start and talk about what is anomaly detection. But in order to understand what anomaly detects is, let's try actually to first define what is an anomaly. Right? So, an anomaly is defined as deviation from the common rule, something different, abnormal, peculiar, and even not easily classified. Now that we have an understanding of what an anomalies is, we can define anomaly detection. So, anomalies detection is understood to be the identification and or observation of data points and events that deviate from a data set's normal behavior. Simple enough, right? So, what is the problem? Well, the problem is quite complex. Let's say an issue occurs just as the issue occurred at the company I was working for. So many, many times, if an issue occurs, it just ends up in a black hole, because nobody even recognizes the issue. Let's say somebody actually recognized there is an issue and invests time and effort to research the problem. So he was able to identify the issue, but then again, it might not be able to find a root cause. So again, it'll end up in the black hole, and it might occur again. Let's say there are many people who try to identify the root cause, and we found it so after that, actually we can fix the issue. So what we are trying to do today with apache Pinot and third eye is first of all eliminate the black hole. There is no more issue which will occur undetected. There is no more identifying an issue and not being able to get to the root cause. So there is no more black hole. Second of all, we are trying to reduce time to detect. We are also reducing time to resolution. So let me take you on this journey through the twilight zone. We are the thirdeye of rules and let's talk about star Tree's third eye. So what is third eye? Third eye is an anomaly detection, monitors and interactive root cause analysis platform. Remember I said star tree thirdeye. So what is startree? Or rather who is startree? So startree is a company and they are the ones who are offering Apache pinot and thirdeye as a SaaS solution. But Apache Pinot of course is open source and can be used freely. Thirdeye is not open source, it is actually given out with a community license, which means you cannot take third eye and create a sauce product with it. Other than that, you are allowed to use it at your own discretion and enjoy the full benefits of this great platform. If we look at the architecture of third eye, we can see that this platform is not just a simple tool or a simple UI. There is some more to this application or to this architecture, but the interesting part is that it actually is working against a data source. It is actually always querying this data source. So this data source has to be very, very efficient for all of these queries to be run and return in a sub second response time, right? And this data source is none other than Apache Pinot. So let's talk a little bit about Apache Pinot. What exactly is Apache Pinot? So Apache Pinot is a real time distributed OLap data store, purpose built to provide ultra low latency analytics even at extremely high throughput. Let us try to put all of this into context and explain the problem. In the analytics world, we usually talk about dimensions and metrics, right? Dimensions are the labels used to describe data and metrics are the quantitative measurements of the data. So for example, dimensions would be device type could be Android or iPhone, or the dimension could be country, which would be Israel, us, Mexico, or any other country. On the other hand, the metrics that would be, for instance, temperature, which has a value or views, which also is a value. And all we want to do with the dimensions and metrics is slice and dice. So slice and dice might be easy with three dimensions and as you can see here, this would be seven permutations. What happens if we have five dimensions? That's already a little bit harder, because those are 31 permutations. And what about seven dimensions, which has 127 permutations? And of course, we would like to have many, many more dimensions. To understand the problem, let's see how the data is usually kept. So, data is usually kept either in raw data and it is being processed going down the processing line. It might be after raw data, it might be joined and aggregated, and in the end it might be cubed. So why would we do this to data? The more we keep data in raw format, the more flexibility we have to do certain calculations. The more we preprocess the data, the less flexibility we have. But on the flip side, the more we keep the data in raw format, we have a very high latency. It'll take a long time for those computations. And the more we preprocess the data, we could have low latency. So Apache Pinot is right in the middle between joint data and aggregated data, trying to have maximum flexibility and minimum latency. But why would Apache Pinot be better than its competitors, of which there are a few? Well, this has to do with the history of Apache Pinot. See, Apache Pinot was actually invented and written at LinkedIn. And first it was used as can internal analytics database for the business users to see what is happening. As soon as those business users saw the immense potential of Apache Pinot, they actually said it should be expanded to be used by all 500 million users on LinkedIn. And if you go on LinkedIn today, you might even see all the queries which are being sent to Apache Pinot at any given time. As you can see just on this page, I have seven queries, and there might be more which are being run for each user with a low latency, most of the time, subsecond latency response time for any given user on LinkedIn. Some of the statistics on LinkedIn. So as you can see, there is 200,000 queries per second. Those statistics are actually a year old, so there might be more. They have a max ingestion rate of about 1 million events per second, and when querying, they have about 20 billion records which are scanned each second. So as you can see, Apache Pinot is a database built for speed and efficiency. Now, this speed and efficiency especially is being delivered by a pluggable indexing technology. And as you can see, they implemented a lot of indexes in order to help to try to create a minimum latency response time. So let's get back to what I'm trying to solve. So at the same company there was another problem. We had a lot of data, right? And what happened was usually there was this odd user which every day would enter the system and would download a lot of data. The problem was that when this person could initiate his query to download the data, all other queries, for some reason there was heavy usage on the system, so all the other queries had a higher response time. Basically other people had to wait longer for their data to arrive. And the funny thing was that, do you know when we knew about this problem? Actually only a day later, because there was a batch job which would go over the log files of the day before and extract all the query latency response times. And only a day later we would know that there were certain queries at a certain point which had a high latency response time. So again, the problem we are trying to solve is this, what happens if at a certain point there is a sudden degradation of performance and we do not know about it in real time. So the way Apache Pinot works or the way third eye works is actually using an alerting template. The alerting template is actually the detects logic or boilerplate that can be used to create an alert. An example could be we would like to create an anomalies if a certain metric is bigger than a certain maximum value. Using this alert template, we can then create an alert. So alert is an anomaly detection rule configuration, right? That would be our anomalies detection. So an example for this would be create an anomaly if revenue, that is our metric, is bigger than 20,000 and we would like to check every hour and the anomalies would occur if this alert would be triggered. So at a certain time when querying the data, we would see that revenue is 30,000, which is above the threshold of 20,000 on Thursday, the third between 09:00 p.m. And 10:00 p.m. The interface looks as follows. As you can see, we have a view of certain metrics, and for those with great eyesight can see that on February 20 eigth, there is a dotted line and there is a solid line. And the solid line is the actual data and the dotted line is the expected data. So as you can see here, this is an anomaly which can be traced in third eye. There are multiple detector algorithms which can be used. In the example you saw, the threshold rule, there is also a mean variance rule. There is a percentage rule, an absolute change rule. And if you want to get the services from Thirdeye from Star Tree, there is also a hold winters rule. This is proprietary to Star Tree. If you are using this for free, the non commercial license, then you will not have the hold winter took. But you want to write your own. This platform is actually pluggable. So if you want you can write your own detector algorithms based on your needs. Now this is all great and nice and we could know when there is an anomaly. But as I said before, third eye is also a root cause analysis platform. So you can see when there is an anomaly you will be able to create or go into the root cause analysis and see what exactly the problem is. For those again with great eyesight might see the difference that on the left we have the current data range and on the right side we have the baseline. And so we could see what exactly the difference is why there is an anomaly. Now as you can see, I have different colors. I have blue, I have red, reddish. So pretty simple. If we have a certain metric as a deep red, that would be a big change down. And if we have an intense blue, it is a big change up. So now looking back at this root cause analysis in third eye, we can see that there are certain values which are higher than the baseline and certain values which are lower than the baseline. All of this to help us with our root cause analysis. Now what about alerts, right? I mean, any company has its own alerting system. So how could I integrate those anomaly detections, those anomalies with my alerting system? Well, there is a possibility to have a subscription group if we would like to create a subscription group. In third eye, as you can see at the moment the channels are either email slack. There is also an option for webhook. So we can definitely integrate the anomalies which occur in third eye with our alerting system. But still there are a few skeptics. The baseline. Remember, what if the baseline was a holiday? Or even more, what if the baseline did occur on a day where we had a change in the system or a new version of the product? Well, so this is one of the greatest issues. We can create events. So we would create an event on certain dates. For instance, for each time we would actually update a new version of the product. That could be an event. If there are holidays, that would be an event. This is a very simple took to create events. And this would be integrated into our baseline and into our anomaly detection. And with our root cause analysis we will be able to see that there were specific or special circumstances at any gives point. So let's remember why we are here. As I said, at the end of this session, I want you to be able to, if not find, understand this, at least find it as soon as possible. When I say as soon as possible, I mean within minutes, if not seconds. I really hope I gave you an option to do just that. Now I would like to demo Apache Pinot and especially third eye. So something short about the demo. I will demo this I have on my computer. I've set up a kubernetes cluster using k three s. I am running Kafka, Pinot and third eye on my Kubernetes cluster from my own laptop. I'm sending via telegraph, I'm sending metrics of my cpu performance to kafka which will be ingested into Apache P zero. And thirdeye is going to query this data every minute. The lowest we can get actually using Thirdeye is every minute, and it will try and decipher, or it will try to see whether there is an anomalies in the data which I'm sending. So let me show you the demo of Apache Pinot and Star Tree third eye. First of all, I'm going to spin up my canines and we can have a look at kubernetes. So everything is in the Pinot Quickstart namespace. As you can see I have a small Kafka cluster and then I have Star Tree mySql, I have all the different pinot servers, and then here we have the different Star Tree components at the end. Zookeeper also is being used for Star Tree. As you can see, Pinot itself has many components, Star Tree as well. And if you want you can run Apache Pinot straight away. There's a helm chart for Apache Pinot, the open source. I just used the quick start guide from Star Tree, which includes Pinot as well, just for ease of use. So going ahead, looking at the UI, I now entered the Ui of Apache Pinot. This is what you get, as you can see again, a lot of components. Let's go to the tables. There's this one table I have host metrics, cpu real time. This is the table I configured in Apache Pinot to receive all the events I'm sending using telegraph to Kafka. And this is ingesting straight away from Kafka so we can have a look at the data. As you can see at the moment I have over 10,000 data points. I can run different queries. Now if I run a query, it doesn't matter what I have in here, as you can see, the total number of documents just there were some additions because every five to 10 seconds more documents are being entered into this table. This is a real time table, meaning it receives data from Kafka and it is updated at real time. So let's go and have a look at third eye. This is Star Tree third eye. This is what you get when you enter. As you can see, I have this one alert and I have ten different anomalies if I would like to look at the alerts. So this is a cpu alert. This is what's going on on my laptop as we speak. Let's go first into configurations. I have configured the data source which is my Apache P zero. I have a data set which I configured the host metrics cpu and these are all the parameters in the data set, host metrics cpu. This can be seen also in Apache Pinot in the table. Then these are all the alert templates that exist in Star Tree. Again, there are many which come out of the box and you can always add yourself. Here are the subscription groups. I didn't add anything because there is nothing for me to add. And here are events. Again, I didn't add here anything either. So let's take a look at the anomalies. As you can see, I have a lot of anomalies going on on my computer at the moment. Let's first take a look at my alert. The alert I configured. So there are two views. There's a simple view, there is an advanced view which is basically all the JSON. I will take a look at the simple view. I configured the name to be cpu alert. I would like to run it every minute, every hour, every day. It is based on the template type start free threshold and again it is using some I would like to do the aggregation of a sum on the usage system and I defined a threshold of 170. If I do a reload preview I can see actually the data and I can see the rule being happened to the data which I have. So I have this already configured. Let's cancel this, let's go back to the dashboard. So at the moment I have 14 anomalies. I can go into the anomalies. Let's take a look at the last anomaly. I would like to. Let's see the last 1 hour. I think that's enough. Okay. And the anomaly just happened. Right now I'm already at the threshold of 170. Again you can see the dotted line which is the expected, and you can see the solid line which is what's happening. So let's go and investigate our anomalies. So this is the anomaly. It's right here on the right side. And here we have the heat map as we seen in prior. As I've shown you, you can look at the top contributors, but there's not much going to be in here. Again, I don't have a lot of data and I would be able to look at the events. First of all, I can also change the baseline. So this is a week. Let's look at the baseline of one day, which again is not something I have because I don't have a day's worth of data. I just collected data for the past maybe 2 hours. Let's go back to see my anomalies. Let's take something a little before that. Again, let's take a look at the last 1 hour. Okay, so this is the anomaly. And then I can also say, okay, no, this is not an Anomaly and this is a Feedback which has been received. Let's go back and as you can see, before it was on 16, now it is on 15 again, I can also view all the anomalies here and I can preview this, I can look at it, there are also a lot of more, um, I can also change a lot of parameters. For instance, I am able to say, let's say I would like for those two anomalies to be counted as one. This is all within the configuration of here. Merge, Max duration, how many minutes? I would like it to merge, for instance, and gap at. So, well, this is not a new alert, so it's not going to change this on the fly. But again, if I would create the alert to begin with like this, it would count these two as one. Thank you very much for joining this lecture. I hope you've learned something today and I hope I can help you or I help you to achieve better data consistency and data integrity.
...

Yoav Nordmann

Fullstack as a ServiceTikal - Fullstack as a Service @ Tikal Knowledge

Yoav Nordmann's LinkedIn account Yoav Nordmann's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways