Conf42 Internet of Things (IoT) 2023 - Online

Using the FLiPN Stack for Edge AI (Flink, NiFi, Pulsar)

Video size:

Abstract

Introducing the FLiPN stack which combines Apache Flink, Apache NiFi, Apache Pulsar and other Apache tools to build fast applications for IoT, AI, rapid ingest.

Summary

  • Using Flink, Nifi and Pulsar together to solve really complex problems. Flink to do some processing and joining Nifi to scoop up the data and shove it into pulsar. Flipn stack or flipn pattern or flipping federation. Everything's Apache. They all work together really nice.
  • Apache Pulsar allows you to do things like a topic per device. It's actually 10 million now, Tim, I've updated that 10 million. Easily scalable, no data movement. Most commonly you could do any kind of code for your architecture.
  • Data from the edge, we get that into pulsar so we can start doing things with IoT and a lot of different options there. Once we get it out there, very easy to distribute the data for people to write apps or whatever you want there.
  • Nifi is the ability to listen to rest endpoints on demand. How did I get that data? Well, on that device, I have a minify agent. And then we pushed that data into pulsar and then Flink. Depends on what you're doing, really.
  • Latest candidate release for Apache 9520 that runs on JDK 21. New processors that'll listen to slack. Can now take groups of processors and run them as stateless. A couple of new features in the new system are pretty cool flow analysis.
  • David: If you're interested in seeing more on Pulsar, on Nifi, on Flink, all these are very cool ways to write apps. Setting these things up is pretty easy, especially if you use the cloud managed services for them. Thanks everyone for attending.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to using the flipn pattern for edge AI. That's when you use Flink, Nifi and Pulsar together to solve really complex problems. David and I are using to go through a couple of slides, little demo, just to give you a feel for how you would develop IoT edge applications using this pattern and these open source technologies that work together really well and scale out tremendously. I'm Tim span. I'm a principal developer advocate working with all the streaming stack. I run some meetups, do some events, do some blogging, and we'll go into some cool tech. And I'm David Kermgard, I'm a developer advocate at Stream native. I publish some books there, Pulsar in action. I also contribute a lot of code and give a lot of talks around the world. And I focus primarily on Ni five, Flink, and Pulsar. So thanks Tim, for having me. Yeah, it's the dream team. If you're interested in these three technologies, these are probably the two people who like them the most. Every week I put out a newsletter, covers all kinds of streaming stuff. All the fun tech out there, very easy to check all the time. We do meetups all over the place, just look for them and they cover all the technology that you like. Today we're going to cover introducing to these technologies a little overview, some examples, and then a little bit about each of the streaming tech there. That's important for this pattern pulsar to be able to get anything in the stream. Flink to do some processing and joining Nifi to scoop up the data and shove it into pulsar and get it working. And that's really the flipn stack or flipn pattern or flipping federation. I've been talking with some of our friends, you know who you are, Peter, online about should you use the word stack going back all the way to the lamp stack? Probably not, but it's a bunch of things together. Pattern module, just a guide, some examples, best practices, things that work together really well and make things easy for you to build. Streaming apps, everything's Apache. They all work together really nice. David wrote the connector from Nifi to Pulsar. There's a really solid connector from Flink to Pulsar and also things like Pinot to pulsar and a lot of other projects. Once you get into Apache, things just really fit together really well. Again, that is the reason why we like to use that together with this pattern. It could really accelerate what you can do as an edge data engineer. Whether this is just grabbing that raw sensor data, log data and get that into a stream. Or it's even more advanced where you've got different models running at the edge, cameras connected to gpus, different accelerators there, and you'll see there's a lot of different people involved in these projects, lots of framework language options connecting to different clouds. Most of you out there probably some kind of engineer. To be a cloud data engineer involved in these edge things, you're going to need python comes up a Iot, or Java. Both of those work really well within this pattern. And you should know little SQL. My cat is always messing with the camera. He likes the tools that we use. He thinks I'm spending too much money on all these devices, but what can you do? AI, obviously we can run that within Nifi, within pulsar functions, flink at the edge in these different agents, and at some point it's going to run this whole talk for me and fix any typos. So hopefully we'll see that happen. A lot of different projects. If I listed all the Apache projects, we could probably spend the whole 30 minutes. There's so many beneath the covers, so many that work together. Calcites used everywhere. You Iot bookkeeper, you got zookeeper, you got different registries, Tika, whole bunch of different projects. OpenNLP, I've been playing with some more. Some good things in there. Iceberg, all them work together. But the main things we need to build these streaming applications that connect from the edge all the way to the cloud or wherever your enterprise is centered, even if it's federated between multiple availability zones on premise, wherever this is running, those all connect together. You're probably not running Flink on the edge. You certainly could, especially on one of these Jetson boxes or even one of the beefier ones. But Flink is usually for downstream. Again, IoT minify, probably running at the edge. Maybe pulsar or pulsar might be up a level. How you decide where you put those connections is a talk for later. And David has a couple of those out there if you want to take a look. But Flink is really nice for taking billions of data points, streaming them together, and run real time things on them. You can also support batch if you need to do some batch things. Sometimes people will join gets things out of a batch source like say kudu or a relational database. You could use that to augment what you're doing within Flink. And I'll show you a little bit of Flink SQl. And what's nice is scales as big as you need to go scales out really solid. That's the part of the second part of Flink I like. First part is they all work together well. Then you've got this scale out and then third is with the SQL. It is easy to start. Even if this is all running in one docker where I've got Flink, Nifi and Pulsar in one docker on a laptop. This still can handle all the data engineering I need from a bunch of devices into final use case of having real time analytics. Important piece of the puzzle there. And like I said with that Flink connector makes it very easy for me to stream the data in from pulsar regardless of the mode it's in. Could be in generic pulsar mode, could be in Kafka mode, and then stream it down, stream into a sync when you're done. So do my analytics, which could be as easy as possible in SQL, and then insert the results later, which could be aggregates, which could be summaries, could be windows of data, and do this at whatever scale you want with simple SQL. I'll show you a little bit of that running in our short talk here. Then next up, I'll let the expert go into pulsar for you. Thanks, Tim. So yeah, Apache Pulsar sort of fits nicely in that stack. As Tim mentioned, Nifi is the edge device or the edge technology that brings in all the information. But you have a nice place to buffer this information before it gets processed by a flink sort of processing engine. And pulsar serves that exact purpose. It can scale up to just one particular use case. Ten petabytes of data per day coming in. So all the data you could ever need coming in in a truly elastically scalable platform that integrates nicely with all the open source tools, as Tim mentioned. Spark, flink, everything else, Nifi. So it's a nice, think of it as an infinite buffer, an infinite state streaming storage that has multiple layers of different protocols on top. So it speaks its own native pulsar protocol. It also speaks Kafka, it also speaks MQTT, it also speaks RabbitMQ. So it's a great way to bring data in from different sources and put it in a single place and then expose it up to something like flink for your processing as well. Yeah, it's hard to underestimate something that can run more than a million topics, which if you're starting to do this, you'll see why that's a big deal. It's actually 10 million now, Tim, I've updated that 10 million. Yeah, we've done 10 million now. Yeah, it's great. So everybody, why Apache Pulsar? Everything Kafka can do but better and more is how I describe it. It came with unified messaging and streaming on day one. So we actually support queuing semantics, streaming semantics, infinite message retention, tiered storage, offloading things, capabilities that you still don't have in Kafka like dead letter queues and scheduling, scheduled message delivery and multi protocol support. Easily scalable, no data movement. So when you add, as you add up capacity to your cluster, you don't have to move the data rebalance free and 10 million topics and soon do ten x that again once we get Oxia fully up and running, that's our zookeeper replacement. There's been a lot of buzz in the Kafka community around. We finally got rid of zookeeper and that's great. We've done a similar process, but we rearchitected it entirely to make it more scalable as well. And then georeplication was the first use case. It was built at Yahoo in 2012 to replicate data across multiple directions, a bi directional mesh, multitenancies built in on day one, and then encryption all the cool goodies you'd need for a full featured enterprise system. So it's not just a toy, but it's truly scalable. And you can scan that core, decode more and find out more information about why all the different capabilities of Pulsar. 100 million coming. I hope I don't have 100 million on my next project. I don't even know how many Kafka clusters you need. I wouldn't want to run that many. But it allows you to do things like a topic per device. So if you have large number of devices in IFU space, you want to have just their information commingle it. You're no longer limited because of the platform to decide how you structure your data. I think that's the big win there. Yeah, that's pretty cool. And not just one per device, maybe one per device sensor. Like even my little device I'm running for the demo here has two sensors on it, so maybe each one gets its own topic. And then I could join them together with flink. Yes, absolutely. I don't know if I want to write a sequel that has a 10 million topics in it and then join them all. I don't know. That might be a little much. That might be a bit much. There might be a nice way to view it up though. There are ways to aggregate. I mean, that's always the thing. If I put everything in one topic, then I don't have to join. But then there's a whole lot in that one topic, and that could be a bottleneck. If I have a million topics, like if I want to aggregate all those sensors, I could do it in I five. I don't know. There you go. That's when you got to balance it. Like, how many topics do I want? I don't know. No limitations, though. That's the key. You're not. Whatever makes sense for your architecture. You pick the best. And Flink plus, pulsar is so mature at this point, we're talking three or four years in, so it's pretty solid. And the flink versions and the Pulsar versions, always incrementing. Everything's getting better. So pretty nice way to connect there. And how do we get data in there now? Most commonly you could do it with any kind of code because there's support for Pulsar, support for a lot of different languages. So sometimes I'll just have a Python app at the edge, push right into pulsar, and that could be using the native pulsar library that installs on all these devices. And I've done IoT or some of these devices are hard coded to push out MqtT. We can just have pulsar listen that way. So you got some options. But often I'll have something like minify, which is a small Nifi agent, running on that device, just to make it easier for me to manage what's going on. But one of the reasons why I want Nifi or minify is besides it's easy to work with, and we'll show you that in the demo. It has some features that are really nice for picking up data. If you've ever used any kind of logging agent, they're usually pretty simple. Maybe you're setting up some YAML or XML or JSOn, some kind of configuration, but they are designed to run maybe just at the edge. You're not going to have a scalable cluster, not going to have something that guarantees delivery, has built in back pressure, prioritize queuing, allow you to change the qos on them. Built in data, providence. Hundreds of different controls, lots of different sources, version control, DevOps, all those things you might want with a scalable architecture. Maybe the last people still using zookeeper, though, depends what environment you're in. That could also be done otherwise in Kubernetes, but pretty straightforward. But it's just designed so you don't lose data. They keep adding new ones. I'll show you a little bit of NiFi 20 and the additional records we could read so I can read all these type of data. Convert that into a format that's easier to use within pulsar and flink like Avro, and then use that to join together data. Got a number of articles out there if you're interested in using the different data. This is the one from the demo today, which is the raspberry PI 400, which is cool because it's got the keyboard. I don't know why they didn't put a screen with it, but I added a very tiny screen that has my ip on it. You're not too valuable there, but data from the edge, we get that into pulsar so we can start doing things with IoT and a lot of different options there. Mine for this particular example is minify agent HTTP into niFi. Nifi does the cleanup and just gets that into a topic. Flink does the simplest possible, SQL gets it, and it can push it anywhere. I mean, it could go into another topic and then someone else can consume it. You've Iot a lot of options there, and I'll show you some inside the demo. Just wanted to show you different examples. And there's lots of different sources of data that Nifi, Flink Pulsar can read, not super hard. And then once we get it out there, very easy to distribute the data for people to write apps or whatever you want there. But that's all the slides. Let's see if we've got things running here. Hopefully things haven't timed out. Seem to be okay here. So this is Nifi. This is Nifi one two four, which is a newer version, but not the 20 ones. So I've got a controller here that is receiving Nifi calls. And Nifi is the ability to listen to rest endpoints on demand, and take any data you don't have to have it fixed to some schema or some specific class of data. Anybody who wants to call it, just post data on this port and I will consume it pretty easy. And then I have a provenance and all the data that's come in, and I know how long it was, what type of data, what the data was, what the user agent was, plus the data itself, which in this case is JSON, pretty straightforward. And then I could process it, route it, and in this case, I'm consuming it in here. I paused the live data. So we've got a bunch of data coming in. So it's not limited amount and then I'm using IoT to pulsar and elsewhere. But how did I get that data? Well, on that device, that raspberry PI keyboard, I have a minify agent and it's running a shell script to run some python to grab some sensor data. I set that agent that you saw there, and then if it's not empty, I'm calling Ni five via HTTP and just sending it into that port. So the data is just streaming in. If that agent is not running, I'll see what's going on. I could see all the metadata about that particular agent. I can also change Iot and debug it on the fly if I need to. I'll see all the alerts going on, sort by that. If I have a ton of different things going on at once, I could see that. I could see if something's offline and I can delete new things if need be. I could see if there's new things on that device. If someone's done an upgrade, lots of different things you could do. Pretty easy way to do that. So we got the data from Nifi streaming into pulsar, and this is an easy way to run flink behind the scenes, it's just regular apache flink and there's no jobs running yet. And I'll just start my query here in this UI, and as you can see, it deployed the job already. I only have one node because this is running on a laptop. If I had a massive cluster, you know it's going to look different. This UI is going to say there's more resources. You don't have to code differently, you don't have to write SQL differently. It's just going to take that data in from the table, filter it out wherever it makes sense. And here we're showing a sample of that data as it comes in. And what I do with that is I've got a little materialized view that takes a cache of that data and presents it as a rest endpoint. And then I could just put it in a dashboard. But I could have also written this dashboard directly against Pulsar using the pulsar's websocket API. Depends what you're doing, really. I should be joining this to another data source, and there's lots of different things I could join that to. Depending on what tables I have out there, like this one, I'd probably join this to multiple devices that are either the same or similar. Maybe looking at ones that have similar regions or maybe from the same area or have similar sensors, like here, I'm looking at carbon dioxide and volatile chemicals in this area. So I might use that to pinpoint things going on based on what's going on there and maybe join them based on lat long. I should probably add lat long based on where it is. I can add a gps sensor there or just manually do it if I know that sensor is not moving again. When you're setting up these things, you put them where it makes sense. And as you can see, more data shows up. This is because another record came off. The device got pushed to Nifi. Nifi did a little bit of cleanup on it, and then we pushed it into pulsar and then Flink got that event and it shows up here. And then it'll update this materialized view with any updates. And then we could push that to a dashboard Jupyter notebook, a regular application, or maybe another flink app, or maybe a spark app. I mean, you have a lot of options here. Or it can be pushed into data sync like Pino or Hive or kudu or hbase or mongo or relational database. Lots of options there. Depends on what makes sense for you. I want to show you another thing we have here, and this is the latest candidate release for Apache 9520 that runs on JDK 21, which is super fast and has the ability to run Python. And there's a couple of python processors built in for doing some cool stuff like chunking documents, parsing them, interacting with chat GBT. You got to put your keys in there. Pretty fun pushing to different vector stores like Chroma and Pine cone. Also there's a couple of new processors I like, one that'll listen to slack. So if I post a message to slack, it'll get pushed into, Nifi will grab that and we'll be able to process data coming from Slack, which is really cool. Another new feature is I can now take groups of processors and run them as stateless. So it runs in its own clean environment, isolated from anything else, and it runs from start to finish as sort of a job or function as a service runs that completes and you get the logs of results and then it'll stop depending on how you want to schedule that. Here we've got it to run in 1 minute segments. It depends on how you want it to go. Typically with these you'll do something that's triggered by a time, say maybe I'm listening for s three changes or assist log. Something lands in a file system. This will get anytime a new object shows up, and then maybe we run some processing against it. And then we're done. Just gives you an example. There lots of different things you could do. Just wanted to show you. The new Nifi 20 also has the ability to read parameters dynamically from different servers, like one password, like a database, like Azure key vault AWS secrets. Nice way to do that. A couple of new features in the new system are pretty cool flow analysis and this is an area where we need a lot of work. So new people who are learning Nifi, you could add stuff to their server to dissuade them from doing things that may be problematic. With this one you could tell them not to use certain component types. Knifey still has a bunch of them that haven't been deprecated yet, and any of the ones without records. If you're using structured data, you might want to not use them because we've added new ones for Excel, for window events, for YaML, for grok. So you could do a lot there. Makes it pretty easy. Let's get back to this here. I think we're pretty close to the time here. I want to thank David. Hopefully everyone has learned some deep decent stuff on using flipn pattern and definitely reach out. If you're interested in seeing more on Pulsar, on Nifi, on Flink, all these are very cool ways to write apps and as you can see, there was no magic there. Setting these things up is pretty easy, especially if you use the cloud managed services for them, even if not within a docker container or kubernetes. It's a couple of clicks. Things are running as you saw, drag and drop some simple SQl and things are just running for you and you get your data, you do with what you want with uh, we can join things together like Flink has got a pretty rich SQl here. Here I'm joining two different topics based on a similar, you know, lots of different things you could do. You could also use debesium here so I can load from relational tables, join them together. Pretty powerful way to do that. And with this I can join things like Pulsar and Kafka together. I can join a database table with a topic. So a lot of options there. Thanks for coming to the talk David. Do you want to send them out? Yep. Thank you everyone for attending the talk. Hopefully you have a lot of fun with the flipn stack. Thanks Tim. Thanks everyone for attending.
...

Tim Spann

Principal Developer Advocate @ Cloudera

Tim Spann's LinkedIn account Tim Spann's twitter account

David Kjerrumgaard

Developer Advocate @ StreamNative

David Kjerrumgaard's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways