Conf42 Python 2023 - Online

Data science and machine learning in wastewater intelligence

Video size:

Abstract

To protect our primary resource, water, we need to maximize its usage. Kando will discuss how, through methods like proprietary sensors and the utilization of Data Science & Machine Learning, it can continuously monitor the quality of wastewater and encourage safe reuse processes.

Summary

  • Kando is presenting what we do today with data, science and algorithms, kando machine learning, in order to generate intelligence in the wastewater. Industries everywhere use water just like we do, and some of them use water in their processes. pollutants can be potentially very detrimental to processes.
  • Kando is a machine learning technology that collects data from wastewater. The goal is to tell wastewater treatment facilities what to do to mitigate pollution. With global warming, climate change, et cetera, water is important.
  • For that, we combine information about the wastewater network and a lot of open source data. Once we figure out where we need to be and we trade it off, the resolution versus the cost of deployment, et cetera, we start monitoring. In real time, we're able to classify events to belong to different sources of pollution.
  • The next point I'd like to get to, of course, is localization. At those locations, we typically have distribution of our sensors such that we have very broad coverage closer to the wastewater treatment facility. Kando, get to the point where we're able to point the specific or most likely source of our pollution to our user.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, everybody. I'm Alex. I lead the data science and analytics team here at Kando, and I'm very happy to be with you today at this year's edition of Python Web conference, presenting what we do today with data, science and algorithms, kando machine learning, in order to generate intelligence in the wastewater. So I want to begin in by breaking these words down a little bit. So what's wastewater intelligence? Right. So, wastewater, I think most of us know, right? We use the restroom, we use the sink, we use a shower. Water goes down the drain. Boom, success. Wastewater created. That's one part of it. The other part is maybe less familiar to some of us is what's called industrial wastewater. Industries everywhere use water just like we do, and some of them use water in their processes. And after these processes are complete, water finds its way into the wastewater collection system, where it needs to get treated. Now, these processes can be such that they introduce various pollutants into the wastewater, which is typically okay. The wastewater facilities know and are able to deal with those pollutants, but every now and then, pollutants exceed either types or various limits, and then they are potentially very detrimental to processes. We'll talk about that also in a few words soon enough so we understand wastewater a little better. Now, what's intelligence? What information are we looking to get from our wastewater? So one piece of information is to understand what's going on, right? The whole system is underground. Right. We're mostly oblivious to it. We have no idea what's going on there. So initially, we want to be able to understand what's taking place. Then we want to be able to understand whether or not something out of the ordinary is taking place. And once we understand points one and two, we would very much like to be able. By we, I mean wastewater treatment facilities that need to do something with whatever it is that's coming their way, they need to be able to understand what steps they can Kando should take in order to mitigate whatever potential damage is coming right now or that has happened sometime in the past. Okay, so why is the whole thing interesting? Right? You might ask legitimately, do we really need to know this? So, first of all, I think that specifically now, with global warming, climate change, et cetera, I don't really need to convince anyone that water is important. So we are all aware that water is a very finite resource. And specifically, drinking water or water that we, as humans can use is rare throughout the world. Kando. It's relatively scarce, and at least I've learned over my relatively short, safer now at can that treating wastewater is a very complicated process. It's typically combined of multiple steps. Each one is responsible of addressing one type or another of either pollution or contaminant that is taking place in the wastewater. So getting water from being a waste to actually being usable again, either for irrigation or just for reintroduction into the water cycle, is a very, very delicate and nontrivial process. So a lot of the events that can be introduced into the wastewater, as I mentioned previously, by various industrial facilities, are able to damage this delicate process and either render the wastewater treatment facilities themselves, in extreme cases, in operational, or just cause degradation in the effluent water quality. So the water that is later used to irrigate our crops or that is introduced into reservoirs, rivers, oceans, et cetera, is polluted, basically hurting all of us, Kando, our lives in day to day. So what do our users need to know when Kando is providing this platform? What is this platform telling the wastewater treatment facility? So, first of all, we're informing, or we're ideally informing them that something is happening, right? So be aware that there is an event taking place or there is a disturbance in the force somewhere throughout the collection system, either somewhere close to the producer facilities or somewhere further downstream. Then we need to be able to tell them what it is that's happening, right? So the event is of such and such category type intensity, what the pollution is, ideally, what is the potential damage of this pollution. And then just as importantly, if we're able to, we need to tell them where the pollution is coming from. And this is important for two reasons. First of all, they need to be able to kind of verify that this pollution makes sense in terms of the pollution source. Kando. The other, much clearer, is that this is their way to prevent this from happening in the future, right? So once they know which facility is causing such or other pollution event, they're able to contact the facility, make sure that the processes that take place in that facility are correct and that the treatment or pretreatment that needs to happen to whatever water is being discharged into the waste is taking place. And lastly, if there is something illicit going on or some malfunction or something that needs to be addressed more severely, then there are, of course, legal kando regulatory approaches that can be taken. But in order for them to be applied, we need to know, or they need to know who to apply them to. Okay? So fine. I gave a very nice elevator pitch of what can do is and what it does, and it took me a good several minutes, and now you all know, and now you all want to buy one of our systems and install them in your homes and be happy with them, of course. But this is a machine learning kando data science trap. So what are our machines learning? What is the data science here? So let's start with addressing our data sources, right? The first and kind of probably easiest to understand is just open source data, right? So when we deploy our system at a specific location, we need to be aware of where it is, where deployed, what is the combination of sanitary to industrial facilities, how many people live, wherever it is they live, what the topology of the ground or the surface is, so that we are able to understand how the sewage network is built. Once we understand where it is we're located, Kando, we start collecting information. I'll go into the deployment process also in a few steps, a little bit further. Then we start gathering our own information, right? We have a bunch of sensors, some of them customary, off the shelf. Others are designed and built specifically to our specifications. Those sensors generate a lot of signals. These signals need to be processed and understood over time. And then lastly, the very nontrivial bit of information that is partially our proprietary and partially open source is data about lab samples. Right? So we've built our own database, as well as access externally available data sources that tell us that when a sampling process, which basically constitutes taking a little bit of wastewater and sending it to an analysis in a lab, you can see a sample of such analysis on the right here. We are able to collect a lot of information about what pollutants are found, where and when. Okay, so these are kind of the main data sources that we're working with day to day. Okay, great. So now we understand what data goes into our bellies, but what do we do with that information? So the first question we need to answer is how our system needs to be deployed. Right? So we need to understand, going into a new region, a new area, a new wastewater treatment facility, which locations need to be monitored, and how. Right. The second question we need to understand is what constitutes an event, right? So we have a lot of signals. What of these signals are interesting and to which extent? Next, once we found something that is interesting, we need to be able to understand what that is, whether or not it's something that requires direct action, or is something that we just need to pass on as information. And lastly, as I mentioned, having found something, Kando understood what that something is, we would very much like to be able to pinpoint specifically what that information source is. Right. Where this pollution is coming from Kando, understand who its creator is. Okay, so let's dive in into a little bit of the nuts and bolts. One question is the deployment information, right. And for that, we need to combine information about the wastewater network, which is something that we typically get from our customers, and a lot of open source data. Right. Who are the people? Where do they live? What information do we have about industrial facilities to what sensors they belong, what these sectors do, how they do it. Once we understand all that, we're able to generate a map that says, well, this location is very important. This location is not as important here. We need to have finer resolution. And over there, we can just kind of get a typical overall glance, and that will be enough for us. So we understand what facilities are located, where what their potential pollution may be, and where those potential pollutants are gathered, such that we can focus on relevant areas. Once we figure out where we need to be and we trade it off, the resolution versus the cost of deployment, et cetera, we actually start monitoring. Right. And monitoring basically means generating a lot of time series through different sensors. And as you can see, a tiny example over here, it's not that easy to know when something is taking place that is out of the ordinary versus kind of just the regular bits and pieces of what happens throughout the day. So in order to facilitate this understanding of what an event is, we basically have a three step process. One step generates candidates using metrics for outlier detection that kind of identify interesting bits of our signals. And we put those interesting bits of our signals to the side, then the top candidates using, again, some scoring process that we have. These signals are sent internally to expert labelers who tell us what the relevant signals are. In cases that they know. They don't always know, but a lot of times they do. So they're able to tell us, well, this is one type of an event, this is something or other. This is a pollution of this type and so on. Kando, so forth. Taking all this information, we are now able to proceed to the third step in which we enrich our data set by matching the known patterns to a lot of unknown data, where we can tell externally what needs to be relatively similar. And once we know where to look, we know exactly what to look for. We're able to get a lot more information relating to a specific label. Of course, again with internal kind of validation. Kando corrections. All right, so we know where to look. We know kind of what we're looking for. Now that we found samples of interesting data, we need to be able to classify them. In order to classify our samples, we typically, pardon me, go to two different directions. One direction in completely classical machine learning, is regression. And in regression, what we do is we train an easily obtainable source of information to match a very difficult to obtain source of information, such that in having trained a specific subset of locations and events to that system, we can now deploy the relatively easily obtainable sensors that generate a lot of data instead of the very difficult and cumbersome sensors that are very accurate but very hard to maintain. And this allows us to be able to analyze signals that otherwise would be either very expensive or in yet other cases, almost impossible to obtain. That's the regression direction. And then the classification is, as I mentioned previously, when we have built a large enough data set of labeled or semilabeled information, we're now able to, in real time, classify events to belong to different sources of pollution. Sorry, different types of pollution. Okay, great. The next point I'd like to get to, of course, is localization. So we know where we're looking at. But at those locations, we typically have distribution of our sensors such that we have very broad coverage closer to the wastewater treatment facility. And as we go closer to the industrial facilities themselves, the coverage is obviously lower. And we may be focused on specific regions or in specific producers, but typically, we won't have coverage that would be enough to identify every source on its own. So typically, or a lot of the times, the information of an event taking place comes from somewhere downstream, and then we need to start building our ladder in order to climb further and further upstream in order to do that. So once we identify an event and we're able to classify it to belong to a specific type, we, from our open source data, can relate which are the most probable pollutants to generate this information. And having that information, we can now use our signals in order to climb upstream, match patterns through various metrics. Kando, get to the point where we're able to point the specific or most likely source of our pollution to our user. Okay, so we've gone through the entire process of what it is we need to see, where we need to see it when we find something, what it is we find, and finally, where the information is coming from. This is basically the entire pipe that we have, at least from data science and machine learning perspective. I didn't go a lot into the kind of the code behind it because some of the information is, or some of the algorithms we use are typical scikitlearn network x, et cetera algorithms, and others are proprietary that we kind of built either based on various time series tools or something that we built completely from scratch. But this is more or less the end to end of this process and in the end of it, the main takeaway here that I would like for you to go away with, or at least to continue to the next session with, is that the essence of what we do is to combine relatively easily obtainable data, whether through proprietary or available sensors, with intelligent processing techniques that allow us to focus on where to look, what to look for, and to identify what it is we see, in order to be able to give clear and understandable information to our users that are then able to drive change with the industries around them. And basically the bottom line is that based on this, we're able to give everyone where we're deployed, of course, cleaner, better water quality, which is one of the reasons that going to work at can do, is a lot of fun. Thank you very much. It was a pleasure for me to speak with you and please feel free to reach out and I'll be happy to try. Kando, answer whatever questions you might have. Thank you.
...

Alex Smolyak

Data Director @ Kando

Alex Smolyak's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways