Conf42 Cloud Native 2024 - Online

Designing streaming data pipelines: Best Practices and Considerations


The talk will focus on the considerations while designing a scalable and reliable streaming data platform and how it is different from batch data platforms. Unbounded datasets present very different challenges from bounded data processing, and this session would focus on addressing those.


  • All right, let's look at some terms that are used in streaming. Event time is the time at which the event actually occurred. Watermark indicates when the processing of a given chunk of streaming data is considered complete. These challenges form the fundamentals of how we look at streaming.
  • The accumulation mode is you do not accumulate the results, you just give me the final result. Let's talk about some real examples of how streaming at scale has some challenges. How we can sort of solve them using infrastructure and code considerations.
  • The first one is auto scaling. You need capabilities to be able to auto scale your existing machines. Dynamic work rebalancing in the world of batch and streaming. Window processing decoupling is a very important one. These are some of the nuanced considerations.


This transcript was autogenerated. To make changes, submit a PR.
All right, so let's get started by looking at some terms that are used in streaming. I've put this slide because it's important to look at those different terms so that we think about streaming in the same dictionary as we are talking in the slides going forward. So let's look at some terms. Event time. The time at which the event actually occurred. For example, if you're playing a video game, when did you actually hit a button? When did you actually hit pause? When did you actually do an activity? The actual time of that activity is the time that's called the event time processing time. It's different from event time in that it's the time at which a streaming data ingestion pipeline will actually process your event. The event time in an ideal terms, which is not usually the case, or it could be the time after a few seconds or milliseconds of event time when the stream processor gets the data and processes it. Watermark. That's the notion of completeness. We are going to discuss this much in detail, but for now, just assume that watermark indicates when the processing of a given chunk of streaming data is considered complete by a pipelines, chopping up the data into temporal boundaries for groupings and aggregations. What it means is if you have to aggregate a certain chunk of a streaming data, you window it basically similar to batch, where you have group buys aggregates. You need a certain start and stop to group by and aggregate. And that certain start and stop in streaming data world is called a window. Triggers. Triggers are when in the processing times are the results materialized? When do you actually materialize the results? Accumulation is the relationship between multiple results in the same window. Do you want to create a running sum? Do you want to create just one sum in a given window? What do you want to do? So these are all the different terms that we are going to use in the world of streaming data. Let's look at some interesting challenges in streaming, as these challenges form the fundamentals of how we look at streaming. So let's look at this stream. So those line, those black line below that shows the actual wall time or the clock time. And then you see the events that are coming, the red boxes that are there. One was an event that happened at 08:00 a.m. But came into the pipeline after 11:00 a.m. Then there was one event that actually happened at 09:00 a.m. And that came into the streaming pipeline between 11:00 a.m. And 11:10 a.m. And then the third one was actually an event that happened at 10:00 a.m. That came in the streaming data pipeline between 11:00 a.m. And 11:15 a.m. So what does it show us? It shows that this data is out of order. Why it came late. The event happened at 08:00 a.m. But came after 11:00 a.m. So should we process this because it's late data? Should we not process it? So there are these challenges where you have data almost every time coming out of order late because of networking issues, because of some issues of bandwidth between where the stream processor is versus where the data is originating. So you will not always get the data or the stream in time whenever it happened. You can have a lot of use cases where the data is coming late. So this forms the basis of interesting challenges in streaming where now we are worried about how do we process this late arriving data or should we actually process it? That's the dilemma between correctness and completion. In batch world. The correctness is like I process the bounded data based on the business logic, right? Whenever we write our unit best cases and whenever we do the actual testing before productionizing those code. The notion of correctness in a batch platform is that, okay, I had a bounded data set, and the word bounded here means that I had finite boundaries. Whether it's a table, it's a file, whatever it is, I have a bounded data set and I have to apply a set business logic to that bounded data set. If the business logic that sits behind the input and the output gives me correct results, I'm good. That's correctness for me. In the world of batch, in the world of streaming, it's all what I said in batch plus figuring out how to handle out of order data. We saw in the previous slide that there was out of order data. If we have to think about correctness, should we add that data because it is being missing, or should we discard that data because it's late? What would be more correct? That's dependent on your business logic and that adds additional complexity. In the world of streaming that we did not have in the world of batch. Second is completion. In the world of batch, we think about completion as I processing all the records in a bounded data set, be it a file, be it a table, et cetera. If a file had 10,000 records, I process all those 10,000 records with the business logic, that's completion. If I'm partially processing a file, that means, for example, I just process 6000 records and I have 4000 unprocessed records because of some schema issues, et cetera. That's completion in the world of batch. But in the world of streaming, the completion is like I try to process all the data on time and late arriving, or how much of late arriving data should I process? Should I process the data that's late arriving by 30 minutes or 40 minutes? What's completion in the world of streaming? So there's always a dilemma between correctness and completion in the world of streaming that we struggle with and to delineate some of that and to demystify some of that. We have those important concepts in the world of streaming. The first one is watermark. Watermark is the notion of completeness. What it means is all the input data with event time less than X has been processed. X is the watermark here not perfect because it needs strategy to process late arrivals, right? The second term is trigger, when to emit the aggregated results for a window. Event time triggers are there. There are processing time triggers, and then there are data driven triggers. And then the third term is the accumulation. The accumulation mode determines whether the system accumulates the window panes or discards them. Basically what it means is if I'm processing buttons in my video game console or PlayStation or anything, does every click accumulate to my score for that given window? Or should I discard everything and just consider the last one? So those are the three terms that we use and that help us clarify for a given streaming pipeline. Should we focus on correctness and completion, bringing all these concepts together? Let's look at them from a completely different angle. The first one is what results are being calculated. So if you have to define what results are you calculating, you can define them with the help of transformations, filtering, aggregation. You can run some SQL queries, or you can do some training if it's a machine learning pipelines. The second concept is when in the event time are they being calculated? So here is where we think about the windowing part. There is event time windowing, which is similar to sharding in batch, where now we are windowing the data based on the window logic. There are fixed windows, there are sliding windows, there are session windows. And I've also given an example here in Apache Beam, which is an open source framework where you have a window that is defined as a fixed window for 60 seconds, for example, in the event time. And then comes the third one which is when in the processing time are the results being materialized. You have watermark, which is the notion of completeness that we went through before. You have triggers, which means when do you trigger the results in a given window? And then you have accumulations, which means how do you actually compute the results across a window? Do you do a running sum, do you not do anything? And do you just do the last value in a given window, et cetera? And then you have later rivals as well, which are also defined by triggers. So if you see here, there is an example where we have a fixed window. We have a trigger that after processing time, 60 seconds after the processing time, you just do the triggering of the results, which means we are allowing for late arrivals for up to 60 seconds in this example. And the accumulation mode is you do not accumulate the results, you just give me the final result. You don't accumulate or do a running total. So this is how we bring all the concepts together in the world of streaming. Now, we spoke about some terms. We spoke about how do you bring those concepts together. I showed you some code examples as well. Let's talk about some real examples of how streaming at scale has some challenges and how we can sort of solve them using infrastructure and code considerations. The first one is auto scaling. We have heard of auto scaling in terms of batch, and most of us, when we speak about auto scaling, it's mostly horizontal auto scaling, where we keep addressing more machines of the same type to a given cluster to improve the performance. In the world of streaming, that may or may not work because there might be a particular instance in which a stream needs much more memory on the same machine itself, rather than having additional workload distribution across different machines. That happens in the horizontal scaling. In that case, it is good to have infrastructure that supports both horizontal and vertical auto scaling in flight. What does in flight mean is while the stream is being processed, because we cannot stop the stream, then do the auto scaling and then come back and start processing the stream again. It's running water, it's running data. So you need capabilities to be able to auto scale your existing machines, Ram and number of cores in flight while the stream is being processed. And that's why vertical auto scaling is important. Dynamic work rebalancing in the world of batch, we have multiple machines that are working in the same world of distributed compute. With streaming, there could be some machines that are working on some data. Let's assume it's x, and there are some machines that are working on some other streams, which is y. It could very well be possible that there are some machines that are not working as much as other machines, which means those is disparity of work between some worker nodes. This can cause to SKUs, this can cause SKUs in the machines, this can cause performance issues, et cetera. To avoid this, it's good to have dynamic work rebalancing. Have infrastructure that allows for workers to redistribute work amongst each other so that you can avoid stragglers, or you can avoid nodes that are not working and you have dynamic work rebalancing. Window processing decoupling this is a very important one because a stream pipeline comprises different things. You have the code logic. Suppose you have a code of filtering and then you have windows which have to be processed, right? It's always good to have a decoupling between the windows that are being processed from other stream operations because it helps you scale better. You can have your primary SDKs for streaming on one machine and then you can decouple the processing that can goes to different machines which can help you scale your operations. Giving memory and GPU related resource hints for a particular pipeline or specific steps in a pipeline can be really helpful. You can give python hints or any other language that your streaming pipeline supports and say hey, this function would need ideally this much memory in GPU so that your streaming pipeline can scale up or down accordingly without relying too much on the auto scaling capabilities. Because now you're being very definitive on what the stream will need. I think this is a ubiquitous principle that applies both to batch and streaming, where it's good to use combined by instead of group by because combined by reduces shuffle unless you have to use a group by because group by introduces a lot of shuffle. Retrying forever I had one of the customers who had a default retry forever and their streaming pipelines would hang a lot because they had this option. Always have a time period or a retry count which returns an error and ideally sends the element to a dead letter queue after a particular number of retries, and you can always reprocess those dead letter queues later. This helps you to have some amount of definitive time period after which you don't have to retry and your streaming pipelines will not hang. File I o this is a very typical one that is common between batch and streaming right size your files, input files, output files, et cetera, between 100 mb and one GB shard, depending on the total size. Network. It's always good to have all the sources and syncs in the same region to reduce network latency might not be possible all the time, but if it is, please follow this type hints when you use type hints. Apache beam raises exceptions during pipeline construction time itself, so you don't have to worry about runtime issues because you're catching most of the issues in the compile time itself. So those are some of the nuanced infrastructure and code considerations that I have learned across throughout my field experience in streaming, and I do hope that some of these resonated with you as well, and you could use them in your production pipelines. Thank you so much for having me today, and I hope you enjoyed the session.

Ankit Virmani

Official Member @ Forbes Technology Council

Ankit Virmani's LinkedIn account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways