Conf42 Python 2024 - Online

Sketching Algorithms: Making Sense of Big Data in a Single Stroke

Video size:


What if you could take all your big data problems and shrink them down into small data problems so that you can process everything faster with less memory and constant query times? Data sketching applies theoretical math to computing to make slow big data problems into fast small data problems.


  • Tun Shui is the vp of data at Quix. He talks about sketching algorithms and how you can make sense of big data in a single pass. Quix Cloud is a fully managed platform where you can run your streaming applications and build out real time data pipelines quickly.
  • Sketches are small, stateful streaming data programs that deal with a huge volume of data. They provide approximate answers orders of magnitude faster than exact techniques. Here are some characteristics of sketches.
  • Distributed systems use parallelization to attempt to speed up queries. Why are sketches fast? There are four main parts of sketching. This enables the merging of multiple data structures without loss in accuracy.
  • The count min sketch is for estimating cardinality amongst many duplicates. You also can estimate frequent items. This is good for recommended systems as well. To give you an idea of how the algorithms work, I'll cover the naive example.
  • The Apache data sketches project was started at Yahoo in 2019. It's available in Java, C Plus plus, and you'll be happy to know it's also available in Python. It has a really active community and it's got a good roadmap for implementing sketches.
  • You can get started with streaming data using Python today. You may already have sketching algorithms available to you if you use certain database technologies or processing engines. Here are the links to the Quickstream's open source library in GitHub, where you can start the project.


This transcript was autogenerated. To make changes, submit a PR.
Hey folks, thank you very much for joining me today at Comp 42 Python. Today you get to learn about sketching algorithms and how you can make sense of big data in a single pass. My name is Tun Shui and I'm the vp of data at Quix. Wix is a remote company and we have teams across Europe and the US and our headquarters in London in the United Kingdom. A bit about myself I have a background working in high growth startups. As a data engineer and head of data. I help companies determine their data and AI strategy and where possible, center it around streaming technologies. I'm a big fan of the mantra less is more and simplifying things because I think that just makes everything easier to build on and maintain. And I'm really interested in the real time data and AI ecosystem, which should explain why I'm so excited about sketching algorithms. I've popped my LinkedIn there in the corner, so feel free to connect. First of all, a bit of information about Quix, so our mission is Python. First stream processing. The streaming data landscape was started and built in Java, and to this day it's mainly accessible to those who use Java and SQL. Quicks focuses on the Python developer experience, and we have two products. We have Quickstreams, which is an open source library that uses data frames to work with data and build streaming and event driven applications. And since sketching is based on streaming data, it goes hand in hand naturally, with quicks. Quix Cloud is a fully managed platform where you can run your streaming applications and build out real time data pipelines quickly using open source connectors. We're pushing the idea of streaming data frames, and as you can see, it's reminiscent of the other data frames you may have encountered in pandas or Pyspark. You can build your data transformations using Quickstream's applications, and they're designed to work with Docker and be containerized. You can deploy those containers on your own infrastructure, or if you prefer not to manage infrastructure, you can deploy it on Quicksloud. We provide a nice interface so you can build out streaming data pipelines. You can get started quickly integrating data with connectors to data sources such as Kafka, MQTT, and databases such as postgres, as well as popular destinations for data such as Snowflake, redis, and influxdB. The connectors and the transformation code samples are all open source, and we have contributions from our friends and partners such as influxdB, and we'd love to have more contributions, so please get in touch if you're interested. So what is a sketch? Note that this is a really large topic, because there are families of sketching algorithms, all for different use cases. And my hope here is to introduce you to a few of my favorites and inspire you, and hopefully get you introduced to the possibilities. So you go out and explore by yourselves. After internalizing all the documentation. My experience using sketches, I've come up with this that sketches are small, stateful streaming data programs that deal with a huge volume of data, and they provide approximate answers orders of magnitude faster than exact techniques. It's grounded in the truth that when it comes to big data, approximate solutions are favorable over exact solutions, because they're a good tradeoff for memory resources and time. And you might not realize it, but you're already probably accepting of approximations. You may have heard about this little tool called Chat GPT, which works by responding with its best guess, an approximation. And here are some characteristics of sketches. The first is that they're small, and they have to be small, and they're usually only a few kilobytes in size, meaning that they have reduced memory requirements and will grow sublinearly in space. They're stateful, and they maintain a data structure that keeps the state of the observations. So every data point that comes in isn't retained in its exact representation, it's streaming, meaning that it deals with data in a single pass. So the sketch needs to look only at each item in the stream once. And they're fast. They're orders of magnitude faster than exact techniques. The results are also mergeable without a loss in accuracy, and that keeps everything fast. And these are the components. So the first typical stage of a sketching process is a transformation that makes the input data stream resemble white. So you have a uniform distribution of values, and this is usually achieved by hashing the input keys and then normalizing the result into a fraction, that is, a random number between zero and one. The second stage of the sketch is the creation and updating of a data structure that follows a set of rules for compressing down the values it receives from the transform stage. So in the middle there, when you initialize the sketch, you can configure the parameters for the data structure, which enable you to manage its memory. The last stage, there are estimator algorithms that take a query, look it up in the sketch's data structure, and returns a result. Now this result would be an approximate, but the important thing to note is that it will have mathematically proven error distribution bounds. So this helps in dealing with the confidence of the results. So whilst tools like chat JBT give you an approximation. It's not always transparent about its error thresholds, which is why you get those things called hallucinations. But with sketches you have mathematically provable error bounds, which helps with confidence that the approximation is sufficient. And let's talk about why being exact is slow. We live in a world where the majority of the time we use exact techniques. And what are exact techniques? It's when you run a query to get the exact answer. For example, a common one I'll go through today is a count of the number of unique identifiers. The most common example is a database query, and hopefully everyone here has experienced a query that runs for a really long time. If you have billions of rows with high cardinality, that is, you have lots of unique identifiers. You could experience a really slow query when you're running a group by could. So distributed systems exist and they use parallelization to attempt to speed up these queries. And here's an example of how parallelization works to solve a problem such as word count. So on the left there you have some input data coming in. And to get the unique count of words, count of unique words, it has to go through these stages of splitting the data, mapping, shuffling and reducing. And these operations happen on different nodes or machines, so each box there represents a node or a separate machine. So splitting and mapping are self explanatory. But why does it need to go through shuffling and reducing? The reason is because if you look at the mapping step, each of the nodes have local answers and they are non additive, meaning that you can't simply sum them up since the same word occurs in multiple nodes. Here you can see that at the mapping step you can see the word deer and the word bear appear in two different nodes. So shuffling ensures that the same unique words are colocated so that they can be accurately counted. So when the reducer performs the sum operation, you get accurate results. There at the end you get bear is two, can is three, deer is two, river is two, and this is taken from Hadoop's documentation. And this is common to all massively parallel processing systems such as spark and snowflake. And really all distributed databases work in a similar way. And the take home message here is that shuffling is slow. And when processing queries you have to do what you can to avoid ad hoc network I o. And there are techniques in distributed systems like Spark that help you to get around this by using broadcasting, where you would use broadcast joins where you want to join two tables. So that rather than a huge amount of shuffling taking place. A copy of the smaller table can be stored upfront alongside the larger table on the same machine, so it's available when the join needs to be performed. And I refer back to the classic notes here of latency numbers that every programmer should know. When you compare the time it takes to read from memory versus reading from disk versus round trips over a network, we're talking orders of magnitude difference. Even though at certain scale we're counting things in nanoseconds and milliseconds. When the data gets large and you're dealing with many different unique items, we're talking orders of magnitude difference. So why are sketches fast? There are four main parts of sketching I should go through first. So the first is the stream processor. It processes the data as it streams in with a random algorithms selection. There's also a data structure which starts off as an empty summary and is updated with each data point over time. As it comes in, the size grows sublinearly. The query processor, it computes the desired result and models the error properties using probability statistics. And lastly, we have merge and set operations. So these are the operations that operate on multiple sketches, and there is no loss in accuracy doing so. Here's an illustration of those four parts. So you have the front end stream processor that uses a stochastic process to randomly select data, which it uses to populate a data structure. This data structure can be queried at any time using a back end query processor that has probabilistic analysis to return an approximation with known error thresholds. And at the bottom there you have a way of dealing with multiple sketches where you can stream them in and perform merge or set operations to combine them into their resulting sketches. And you should note that the sketches are designed to be cross language compatible as well. The data structures should be easily queryable by the implementation in any language. I used the word sublinear earlier, and what does that mean? So here on the x axis we're plotting the number of items or the cardinality, and on the y axis it's the memory or the space. So in most data structures that take exact representations of input data. So for example, if you're using words in a key value store, the growth in memory as cardinality goes up is somewhere between linear and superlinear. But with sketches it's always smaller. So with sketches it's always sublinear. The mergeability I mentioned also keeps things fast. This enables the merging of multiple sketches without loss in accuracy, and this solves nonadditive challenges. So I've thrown another term at you. So I'm going to define that too. So what are non additive challenges? So, aside from being absolutely everywhere, let's take this example where Quix has gone into the retail industry. So we've opened up four stores in some of the busiest cities in the world. They happen to be the fashion centers of the world. We have the annual sales figures from each store, as well as the count of unique products in each store. And with the actual sums of the annual sell figures, since money is additive, you can sum it up to get a global total sum. So you can see we're doing pretty good here. But when it comes to the unique item counts, you can't do this, because the last time I checked, the quicks London store definitely has some of the same items as the quicks Paris store. And the quicks Tokyo store has some of the items as the Paris store. So there's usually an overlap. And since that information about unique keys is lost at aggregation time between the stores, you cannot simply sum them up. You would just get incorrect unique counts. But if you use sketches, you could solve this. You could save the sketch for daily records in the database in their own column here on the left, we have two tables. So one is for all the days in February for the London store, and the one below is all the days in February for the quicks Paris store. We could union the number of days in the month to get the sketches for the unique products per store. Then to get the monthly unique count, it would be a case of doing a merge on the store's sketches. So here we can combine the sketches from each of the european stores to get a european approximation of uniques. And of course, you could employ data modeling to get the exact unique counts, but that would involve combining data across doors. That's probably a lot of network I o. And you would have to do this in a batch job. You'd wait for all the data to come in and could perform historical batch. You'd perform it as a historical batch job. So this is a really good example of using sketches where you're able to do approximate analysis with data that is being streamed in, and to perform that approximation in real time or near real time. And this is really why merge ability with sketches is such a superpower. Let's have a look at some of my favorite sketches. So the first is counting unique. So the ones I really like are the ones that estimate count distinct. So there's one called the count min sketch. Also, for estimating the cardinality amongst many duplicates. So for that you have hyperlog log and you have compressed probabilistic counting sketches. You also can estimate frequent items. So this is good for recommended systems as well. So when you estimate what are termed the heavy hitters or top k, for example, during comp 42, if we were streaming in the data, we could perform an approximation for what are the top most viewed talks in this conference. We have quantiles and data distributions. So this estimates the distribution. So you've got the different percentiles, like 25th percentile, 95th percentile, 99 percentile. You can do all of those with well defined error bounds. So if you're streaming in data again from this conference, you could determine and approximate how much time people spend on each of the web pages. And lastly, we have sampling. So sampling keeps some semblance or the transformation of the item from the stream. For example, with reservoir sampling, you choose k samples from a list of n items and you retain that. And to give you an idea of how the algorithms work, I'll cover the count min sketch. Let's start with a naive example, and I'm going to go with food items. So here we're counting, or rather approximating food items. So on the left there we have a continuous stream of food items coming in. So in practice, this could be food items from orders in an online supermarket or a food delivery service. And on the right we have a hash or key value data structure where we're using the key as a representation of the food item, and the value is the incrementing count of the number of times it's observed in the stream. So as we go through the stream one at a time, we're counting this. So by the end, when we get to the last item there, we have counted one burger, three pizzas, one salad, one burrito, et cetera, et cetera. So this example does only have ten items, but this approach won't scale at high cardinality. So when there's a huge increase in the number of unique items, it just simply wouldn't scale. So if this stream introduced all the food items in the known universe, the data structure would have to grow linearly or sublinearly this way to hold all the keys. It would mean reserving huge amounts of memory and long wait times when you're querying to retrieve the cans for a food item. And if you wanted this in real time or near real time, forget about it. And from here, there's not much you could do to improve things, like in a naive way, you could reduce the representation of the food item down to perhaps the first three characters. So, for example, instead of storing the graphical bitmap, you could save pizza as P-I-Z burger becomes bur. But then eventually, as the list got longer and the cardinality increased, could start having key collisions, like when a burrito comes along and tries to increment bur as well. So let's see an example with hashing, which is a better representation of the count min sketch. So, in this example, rather than taking the exact representation of the food item, we'll run it through two hashing functions. And you can see that there on the right, we've got two of the columns, each of which represent the hashing function and the number that they return. And we're going to use this to know how to store it in the Data. So it's kind of like encoding a Burger as one and six and storing it in the Data structure here in this Array in the Middle, that Way. So we'll use the hashing functions there on the right to increment the counts in this 2d array in the middle. So this array starts off all blank, or rather all zeros. And the first item in the Stream is a Burger. So we pass it through to the two hashing functions there on the right. And we see, for Burger, we get the representation of one and six. We then go to the one in the first column and increment that by one. So it goes from zero to one. And then we go to six in the second column, and we increment that also by one. Now we have one and one in those positions, and we're done for this item. So we read the next item in the stream. It's the pizza. We look it up on the right there, and it's encoded as four and one. So we go to positions four and one in our array in the Middle, and we increment at those positions, four and one. So we have a count of one and one. And we'll keep doing that. So we can do that for Salad as well. But let's see what happens when a duplicate appears. So when another pizza appears, we look up its hash representation again, which is for one. And since we already have a count of one in each, we increment them again. So now we have the number two in both those positions. So you just keep incrementing them by time you encounter them in the Stream. So let's fast forward to the last item. So this is what it looks like when all ten food items are stored in the Array. And now let's try querying the Data structure now. Because it's a stream, you can query it at any point in time. That's the beauty of this. So you can perform an approximation at any time that you like. So let's say at the 10th item mark, we want to query this Data structure. We want to get an approximation for the count of burgers. So we look there, on the right there, we see the hash representation is one and six. So we look into our data structure in the middle and we get the counts of three and one. Now you've probably figured out why this sketch is called count min, because we take the minimum value of the obtained counts. So out of three and one, the smaller number is one. So we have our approximation of account of one for burgers, which in this case matches the exact count. Let's do another example. Do the same with pizza, which has a hash representation of four one. So we look up four one in the middle data structure, and that gives us account of three four, and we take the min. So that's an approximate could of three for pizza, which again matches the exact count. So you get the idea. You can see that for some of the food items, for example, if you look at the last one there for tacos, that's six and one, which gives us a could in the middle of one and four. So you can see that though we take the min as one and our approximation becomes one, one of its counts got as high as four. So some errors can start creeping in, like with this particular size. And so this would suggest that you either need to increase the space, making the arrays bigger, or adding more hashing functions. And these are sort of parameters you can control in sketching. So this would all work together to reduce the collisions and keep the counts as accurate as possible. So now that you know what sketches are, how they work and how useful they are, how do you get started with them today? So you're probably thinking, will I need to implement them like it sounds like some complex statistics and math going on. Is there anything available in open source? Well, the good news is you'll be glad to hear such a project does exist. So please go and check out the Apache data sketches project. It's available in Java, C Plus plus, and you'll be happy to know it's also available in Python. This project was started at Yahoo in 2019, and it achieved Apache's top level project status in 2021. So it has a really active community and it's got a good roadmap for implementing sketches. There's a very clear picture of the sketches that have been implemented and the ones that are coming up. And the beauty of this project is that it's cross language compatible, meaning that you can mix and match languages for different components. So like I mentioned before, with the stream processor you could implement that in Java and then the back end query processor could be implemented in Python and it would all work. But we're at Conf 42 Python, so of course we're going to do it all in Python and I'm going to do that here. So I'm going to use these final few slides to show you some code in Python. So this is an example in a Jupyter notebook where I load in some Spotify data and filter it for the artist information. So you see I've imported in the library there, and I've also imported in pandas to load in the data so they're interoperable. You can see there at the bottom there that I've specified a confidence score that I want, which is percentage. So I want 90% confidence. And you see that the library gives you some convenience functions that helps you get a suggestion for the number of hashes you should be using to get that level of confidence. And the same again for the relative error, which I'm going for 1%, it will suggest the number of buckets that you want to instantiate this count min sketch with. And to make it easier to follow, you'll see that I load in the records one at a time using a loop. So when you run it with such a large data store, such a large data source as the Spotify data that I'm using, it actually takes a while. And in reality what you'd be doing is you'd be reading the records one at a time from a stream and then updating the sketch that way. So for that you could use Kafka with quicks for this because it's sort of a general purpose use case. But if your use case was something else like getting approximation statistics like quantiles and counts from changes that are happening in one of your databases, you could use change data capture with something like debesium to obtain a stream of changes, and then you perform the updates in that way. So getting into a stream is really the first part of the puzzle. There you can see the sketches provide functions there which allow you to specify different things. There you've got a lovely print statement there that shows you what their sketch summary is with their number of hashes, the number of buckets, the number of filled bins, et cetera. So this is a good way for you to keep can eye on how you should size out your sketches. So yeah, you can see they maintain the desired confidence score. You've got nice summaries, and at the end you can see I've put in the estimations. I've requested the estimations for some of the popular artists in that data source, and this one's my last slide. So you can pip install the Python library today and get started. You may already even have sketching algorithms available to you if you use certain database technologies or processing engines. So we have Druid, we have postgres and Hive, which have been supported for some years, and it's been implemented recently in Apache Pinot and Apache Spark last year as well. So if you're using any of these, you should be able to just search for in the documentation the sketching algorithms available, and a lot of them will have things like hyperlog log and some sort of counting or quantile operation they should have. So as I mentioned, there is an active community of mathematicians, data scientists and engineers working on sketches, so you'd be amongst really good company if you do check out the project and want to contribute. And so yes, I'm certainly going to be exploring more sketches in future, but I'm curious to know how you'll get on as well. So please do let me know what you think. Here are the links to the Quickstream's open source library in GitHub, where you can start the project to follow its process and follow its progress, and most importantly, show us your support. And also please do come hang out with me and the rest of the Python team in the quicks community on slack, and we'd love nothing more than to help you get started with streaming data using Python today. So if you're going to implement the algorithms and you don't know where to start, please join us there. So thank you very much for taking the time to join me today, and I hope to see you soon. Peace out.

Tun Shwe

VP of Data / DevRel @ Quix

Tun Shwe's LinkedIn account Tun Shwe's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways