Conf42 Python 2022 - Online

Building a real-time analytics dashboard with Streamlit, Apache Pinot, and Apache Kafka

Video size:

Abstract

When you hear “decision-maker”, it’s natural to think, “C-suite”, or “executive”. But these days, we’re all decision-makers. Restaurant owners, bloggers, big-box shoppers, diners - we all have important decisions to make and need instant actionable insights. In order to provide these insights to end-users like us, businesses need access to fast, fresh analytics.

In this session, we will learn how to build our own real-time analytics application on top of a streaming data source using Apache Kafka, Apache Pinot, and Streamlit. Kafka is the de facto standard for real-time event streaming, Pinot is an OLAP database designed for ultra-low latency analytics, and Streamlit is a Python-based tool that makes it super easy to build data-based apps.

After introducing each of these tools, we’ll stream data into Kafka using its Python client, ingest that data into a Pinot real-time table, and write some basic queries using Pinot’s Python SDK. Once we’ve done that, we’ll glue everything together with an auto-refreshing dashboard in Streamlit so that we can see changes to the data as they happen. There will be lots of graphs and other visualisations!

This session is aimed at application developers and data engineers who want to quickly make sense of streaming data.

Summary

  • Real time analytics is the discipline that applies logic and mathematics to data to provide insights for making better decisions quickly. We want to give our end users analytical querying capabilities on fresh data. Let's have a look at some examples of places where those type of dashboards have been built.
  • We're going to be using Streamlit, Pinot and Kafka. The idea here is that it's used for real time low latency analytics. We'll use data from the Wikimedia recent changes feed. How are we going to glue all these components together?
  • A pinot table also needs to attach to a schema. And columns like that basically store the data that's going into pinots. They're stored in segments. Let's have a look at how we can build a streamlit application on top of this.
  • Streamlit allows users to see changes made by different users. Can ingest rather any python libraries that we want to use. Built a sort of dashboard that we can build. Shows changes and what type of changes they've been making.
  • So we had this streamlit API on Wikipedia events. We then connected to Kafka using the Kafka Python driver. And then we were using a pinot Python client to then get the data out. The code used in the demo is all available on this GitHub repository.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everybody, and welcome to this talk in which we're going to learn how to build a realtime realtime realtime realtime realtime realtime realtime analytics Dashboard streamlit Apache Pinot let's first start by defining real time analytics. So Gartner has this quite nice definition where they say that real time analytics is the discipline that applies logic and mathematics to data to provide insights for making better decisions quickly. And so I've highlighted the key bit here, which is that we want to make decisions quickly. That's the idea. The data is coming in and we want to quickly be able to make some sort of actionable insight based on that data. They further divide it down into on demand and continuous analytics, where on demand is where the user makes a query and gets some data back and continuous is where maybe the data gets pushed to you. So we're going to be focusing on the on demand part of that. So that's real time analytics. But what about real time user facing analytics? So the extra bit here is the idea that maybe we've always had these analytical queries capabilities, but we want to give them to the end users rather than just having them for our internal users. We want to give our end users analytical querying capabilities on fresh data. So we want to get the data in quickly and be able to query it. And in the bottom right hand side of the slide, you can see some examples of the types of applications that we might use where we might want to be able to do this. We can further break it down like this. So the real time part of this definition means the data should be available for querying as soon as possible. So the data comes in often from a real time stream. We want to be able to query it as quickly as we can. So we want very low latency between for the data ingestion. And then on the user facing side, we want to be able to write queries that are similar in query response time to what we'd get with an OLAP database. So I e. It should be in milliseconds, like how quickly the queries come back. And then in addition to that, lots of users are going to be executing those queries at the same time. So we need to be able to handle high throughput, low latency queries on very fresh data. And let's have a look at some examples of places where those type of dashboards have been built. So weve got one example here. So this is the Uber eats. So imagine we're running a restaurant on Uber eats. So in the middle of the screen we can see there are some examples of the data of what's happened in the last twelve weeks, like how much money have we been making? Can we see changes in what's happened over the last week? And then on the right hand side are some things that might need our attention. So we can see there are some missed orders, there's some inaccurate orders, there's some downtime, and those are like places where we might want to be able to go and do something like, and do it now. So it's allowing us to see in real time there's something that you need to do that might be able to fix a problem for somebody. LinkedIn is another good example of where these types of dashboards have been built. So we've got three different ones on here. So on the left hand side we've got a user facing one. So that might be for you and me using LinkedIn. So for example, we might want to see who's viewed our profile. So it sort of shows the traffic to our profile over the last few months. And then if you remember this page, we can also see who's been viewing it. And where do they work for? Over in the middle we've got a recruiter dashboard. So recruiters are trying to find people to fill the jobs that they have available. And so this is kind of giving them like a view of all the data that's there to try and help them work out where they might best target those jobs. And then finally, on the right hand side is more of an internal product or bi tool. So this is capturing metrics and we want to be able to see in real time what's happening. So maybe this is a tool that's being used by a product manager. Okay, so that's the overview of what a realtime, realtime, realtime, realtime, realtime analytics dashboard, streamlit, Apache, Pinot, try and build our own one. And we're going to be using three tools to do this. So we're going to be using Streamlit, Pinot and Kafka. So let's have a quick overview of what each of those tools are. So, Streamlit, I first came across it a couple of years ago, so it's sort of beginning of 2020. So it's a Python based web application framework, and the idea is that it should be used to build data apps, often machine learning apps, and it integrates with lots of different Python libraries. So plotly, Pytorch, Tensorflow, Scikitlearn. If you have a Python client driver for your database, it'll probably integrate with that as well. And if you want to share your apps with anyone else they have a thing called the Streamlit cloud that you can use to do that. Apache Pinot is next up. So that's a columnar OLAP database, but it also has additional indexing indexes that you can put on top of the default columns and it's all queryable by SQL. And the idea here is that it's used for real time low latency analytics and these finally we've got Apache Kafka, a distributed event streaming platform. It uses a publish and subscribe sort of setup for two streams of records and those records are then stored in a fault tolerant way and we can process them as they occur. So how are we going to glue all these components together? So were going to start in the top left hand corner. So were going to have some sort of streamlit API. So some events coming in from a data source, we'll have a Python application processing those. We'll then load them into Kafka. Pinot will listen to the Kafka topic and process those events and ingest them into Pinot. And then finally our streamlit application will query Pinot to generate some tables or some charts so that we can see what's actually going on with our data. So the data set that we're going to use is called the Wikimedia recent changes feed. So this is a continuous streamlit of all the changes that are being made to various Wikimedia properties. So like for example Wikipedia and it publishes the events over HTTP using the server side events protocol. And you can see in the bottom of the slide an example of what a message looks like. So you can see weve got an event, it says it's a message, weve got an id and we've got a bunch of data. So it indicates like the title of the page, it's got the Uri of these page, the user, what type of change they made when these made it and so on. So what we're going to do now is we're going to move onto our demo. So we'll go into visual studio here and what you can see on the screen at the moment is a Docker compose file that we're going to use to spin up all those components on our machine. So first up we've got Zookeeper. So this is used by both Pinot and Kafka to manage their metadata. Next one down is of course Kafka. So that's where the stream of data is going to be it connects to zookeeper and then it also indicates a couple of ports where we can access it from Pinot and then from our Python script. And then finally on the last three bits we've got the various Pinot components. So we've got the Pinot controller. That's what manages the pinot cluster and sort of takes care of all the metadata for us. We've got the pinot broker. So that's where the queries get sent to. And then it then sends them out to the different servers. In this case we've only got one server, but in a production setup you'd have many more and it would get those results back and then return them to the client. And then finally we've got the Pinot servers themselves. So this is who stores the data and processes the queries. So I'm just going to open up a terminal window here and we'll run the docker compose script, so docker compose up and that will then spin up all of these containers on our machine. So we'll just minimize that for the moment. And we're going to navigate over to this wiki Py. So this is the script that we're going to use to get the data, the event streamlit from Wikimedia. So you can see were on lines ten to 13. We've got the Uri with the recent changes. We're then going to call it with the request library and then we're going to wrap it in this SSE client. So you can see here, this is a server side events client for Python and it then gives us back this events function. It just has a stream of all these events. So if we open up the terminal again and we'll just open a new tab and if we call Python wiki py, we'll see we get like a stream of all those events. You can see loads and loads of events coming through. We'll just quickly stop that. And if you have a look at one of the events, if we just highlight one down here, you can see it has exactly, very, very similar to what we saw on the slides. Weve got the schema, it indicates if it's a bot. We've got the timestamp, we've got the title of the page and a bunch of other metadata as well. Okay, so that was the top, that was the top left of our diagram. So we've got the data from our event stream into our python application. Now we need to get it across into Kafka. So that's our next bit. So that's Wiki to Kafka. So the first bits are the same. So we've got the same code getting the data from the source. We've now got a while true loop. So this handles like if we lose the connection to the recent changes, it will just reconnect and then carry on processing events. The extra bit we've got is here on line 39, where we're adding an event to Kafka. And so producer is defined up here on line 25. And that's a Kafka producer for putting data onto a Kafka topic. In this case, the topic is Wikipedia events. And then the last interesting thing is on line 41 down to 44. Every hundred messages, we're going to flush those messages into Kafka. So let's just get our terminal back again. And instead of calling that wiki one, we'll call Wiki to Kafka. And so if we run that, the output will just be every hundred messages. It's going to say I've flushed hundred of messages. I've flushed another hundred messages. So you can kind of see those are going in nicely. If we want to check that they're making their way into Kafka, we can run this command here. So this is going to run call the Kafka console consumer, connect to the Kafka running on localhost 1992, this topic, and then get all the messages starting from the beginning. So if we open ourselves up another tab here, let's just have a quick check that the data is making its way into Kafka. So these we go, you can see all the messages flying through there. And if we kill it, it says, hey, I've processed 922 messages. And that was because we killed it. There are probably more of them in there now. And again we can see it's a JSOn message. It's exactly the same message that we had before. We've just put it straight onto Kafka. Okay, so so far we've got to the middle of our architecture diagram. So we took the data from the source, we got it into our python application. Originally we printed it to the screen, but now weve put it into Kafka. So all the events are going into Kafka. So our next thing is, can we get it into Pinot? So let's just minimize this for the moment. What we need to do now is we need to create a pinot table. So that's what this code here will do. And a pinot table also needs to attach to a schema. So let's start with that. So what's a schema? So a schema defines the columns that are going to exist in a table. So our table is going to be called, our schema is called Wikipedia. We've got some dimension fields, so these are like the descriptive fields for a table. So we've got id, we've got wiki, weve got user. These are sort of all the fields that come out of the JSON message. We could have metrics, optionally you can have metric fields. So for example, if there was a count in there, we could have a field that we use for that. But in this case we don't actually have that, but we do have a date time field. So you need to specify those separately. So in this case we've got a timestamp. So we specify that. So that's a schema. The table, a table is where we store the records. And columns like that basically store the data that's going into pinots. They're stored in segments. So that's why you'll see the word segments being used. And so we first need to specify the segment config. So were going to say, okay, where's the schema? If you had a replication factor, you could specify it here, although we've just set it to one. And then you need to indicate the time column name if you're doing a real time table. So real time table basically means I'm going to be connecting to some sort of real time stream and the data is going to be coming in all the time. And I need you to process that. We then need to say, well, where is that data coming from? So we specified the stream config. So in this case it's a Kafka streamlit coming from Wikipedia events. And then we need to tell it, where is our Kafka broker? So it's over here on port 1990, these. And then we can just say like, well, how should I process those messages that are coming from Kafka? And then finally the segments that store the data, how big should they be? So in this case, we've said they're going to be 1000 records per se. That's obviously way smaller than what we'd have in a real production setup. But for the purpose of this demo, it works quite well. So let's copy this command, let's get our terminal back again, and we'll create our pinot table. So that command is going to run, and you can see down here it's been successful. So the pinot table is ready to go. So now we're going to go into our web browser so that we can see what's going on. So we'll just load that up. There we go. And so this is called the Pinot data Explorer. So it's like a web basic tool for seeing what's going on in your Pinot cluster. So you can see here we've got one controller, one broker, one server. Normally you'd have more than one of those, but since we're just running it locally, we just have one. And then on the left hand side over here you can see we've got the query console and we can navigate into this table. We can write a query so we can say, hey, show me how many documents or rows you have. And you see each time you run it, the number is going up and up and up. We can also go back into here and we can see, we could navigate to the table. And so you can see that, hey, there's a table. It's got a schema. We could navigate into the table. The schema is defined in more detail there you can edit the table config if you want to, and down here it indicates what segments are in there. So you can see this is like the first segment that was created. What we're going to do now is we're going to go and have a look at how we can build a streamlit application on top of this. So we'll go back into visual studio again and we've got our app Py class. So this is a streamlit script. It's basically just a normal python bit of code, except we've imported the streamlit library at the top and then we've got a bunch of other things that we're going to use and then there's just some python code. And whenever you want to render something to the screen there's like a streamlit something. So streamlit markdown, streamlit header, whatever it is. And that will then put the data on the screen. So we'll come back and look at this in a minute. But let's just have a look what a streamlit application actually looks like. So if we do streamlit, run app py, that will launch a streamlit application on port eight five one. So we'll just copy that. Let's navigate back to our web browser. So we'll open a new tab on here. Just paste that in and you can see here we've got a streamlit application running. So this is actually refreshing every 5 seconds. You can see here the last time that it updated. So I've got it like running on a little loop that refreshes it every 5 seconds and you can see that these number of records is changing every time it refreshes. The last 1 minute there's been 2000 changes by 398 endusers on 61 different domains. The last five minutes is a bit more, last ten minutes, obviously a bit more than that. And then actually the last 1 hour in all time are exactly the same because we haven't actually been running it for that long. We can then zoom in. So on our navigation on the left hand side, that's an overview of what's happened. We can then see who's been making those changes. So we can see is it bots, is it not bots? So at the moment it says only at least I'm not sure exactly how it defines what a bot is, but 27% of the changes have been made by what they define as bots and 73% not by bots. And every 5 seconds this updates. So these percentages will be adjusting as we go. We can then see which users was it, who were the top bots and who were the top number. So you can see these people are making a lot of changes. Like this one here has made 548 changes in the seven minutes or so since we started running it. We can also see where the changes are being made. So if we click onto this next tab here were are the changes being made like which properties which wikimedia? So it's mostly on commonswikimedia.org, which is surprising to me. I'd expected most of it to be done on Wikipedia, but it's actually not. It's mostly on Wikidata and Commons Wikimedia. And then we can see the types of changes. So weve got edit, categorize, log and then new pages and then interestingly conf 42 I'm not entirely sure what that is. We can also do drill down. So when you're building analytics dashboards that's a pretty common thing. So you might have like these overview pages, but then you want to do a drill down. So like take me into one of them. So maybe it's show me what's being done by a particular user. So in this case we can pick a user, see where they've been making changes and what type of changes they've been making. So this user here, I'm not going to attempt to pronounce that, but they've made like a lot of changes on wikidata.org and mostly editing stuff not really any categorizing, if we pick like a different user, so say the KR, but let's just pick one changes back again. But if we were to pick another one, we could sort of see what they've been up to. So this is a sort of dashboard that we can build. So now let's go back and just have a quick look at how we went about building that. So we might be able to build one for ourselves. So we'll just minimize this here. So you can see we can ingest rather any python libraries that we want to use. This is where were setting the refresh. So this is refreshing the screen every 5 seconds. You don't necessarily have to have that. If you wanted to just have a manual button to refresh it, you could have that. Instead. We can define the title. So this was the title that we had on the top of the page. We've also got. This is how we're printing like the last update that was made. I've done a bit of styling myself here on this overview. So this is the overview tab. So I've actually just explained how that works. We've got down here, we've got a sidebar title and these we're building a little map that has a function representing each of the radio buttons. So we've got overview who's making the changes? Where are the changes that done and drilled down? You don't have to have this. If you had just a single page app, you wouldn't need to bother with this. You could just literally just print everything out straight in one script and it will show all of it on one page. But I wanted to be able to break it down and then if we narrow in on one of them, so say this one here, this is showing the types of changes being made. We've got our query were. So it's saying select type, count the number of users, group by the type, and for a particular user. So where user equals whichever user you selected and then it puts the results, it executes the query, puts the results into a data frame here, and then finally builds a plotly chart or graph online 303 and prints it out to the screen. And the rest of the code is pretty similar to that. It's quite procedural code. We're not really doing anything all that clever. It's just sort of reasonably basic python code. It's just that streamlit is making it super easy to render it to the screen. So hopefully that's given you an idea of what you can do with these tools. So I just want to conclude the talk by sort of going back to the slides and just recapping what weve been doing. So just to remind ourselves, what did we do? So we had this streamlit API on Wikipedia events. We used a python application using a couple of python libraries to the SSC client that processed those events. We then connected to Kafka using the Kafka Python driver. We put the events into Kafka. We then had Pinot connected up to that. So we created, wrote some Yaml, I guess not only Python, but a bit of yaml, connected that to the Kafka topic and ingested that data into Pinot. And then we were using a pinot Python client to then get the data out. And then finally we used the streamlit Python library to get the data in here. And so we've used lots of different python tools. So we used other tools outside of Python, but we were able to glue them all together really nicely and build a web app really quickly. That looks actually pretty good. It's pretty nice, it's pretty interactive, and it's all done using Python. And finally, if you want to play around with any of these tools, this is where they live. So we've got streamlit, we've got Apache, Pinot, Apache, Kafka and the code used in the demo is all available on this GitHub repository. If you want to ask me any questions of anything doesn't make any sense, you can contact me here. So I've put my Twitter handle, but otherwise, thank you for listening to the talk.
...

Mark Needham

Developer Relations Engineer @ StarTree

Mark Needham's LinkedIn account Mark Needham's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways