Enhancing Workflow Reliability with Dapr: Techniques to Minimize Failure Costs

Video size:

Abstract

A long-running workflow typically consists of a series of steps leading to its completion. However, a failure in one step might require restarting the entire workflow, which is costly. Breaking each step into its own workflow can mitigate this issue but introduces challenges in workflow management.

Dapr can help achieve best of both worlds. In this talk, we will explore how my team is using Dapr constructs such as stateful workflows, activities, and replays to curate workflows that minimize the cost of failures while still abstracting away the complexities of workflow management.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. Welcome to this talk on writing efficient and long running workflows. As part of this topic, we'll go over how we can write long running and fault tolerant workflows with minimal effort using some of the open source and cloud native tools available. Before we deep dive a little bit about myself. My name is Siddivarma Vaigiraju. I'm currently working as a software engineer at Microsoft. And previously I worked with oracle In my five plus years of experience. I worked with different cloud native tools like ubernetas open telemetry and orco And I also worked with petascale petabyte scale ingestion systems like observability systems. In my free time, I love to contribute to open source and go on hikes. This is my LinkedIn and email. Please feel free to reach out to me if you have any questions or suggestions regarding the talk, or if you want to talk about software engineering in general. Coming to the agenda. We'll first look at the problem statement or the scenario at hand. Then we look into what are some of the difficulties or what are some of the pain points when working with the scenario. Then we look into how distributed application runtime, one of the open source tool or dapper helps solve them. Third, we'll look at a live example and then conclude the talk. workflows are not new concepts. They have been there for a while and we have been writing these workflows for a while as well. There are mainly two kinds of workflows, I would say. One is the long running workflow and the other is Short running workflows with long running. They can be your etl jobs extract transform and load jobs that take data from a place transform it and then Store them on a destination store now These can run for multiple hours and, or multiple days as well. Then, we have the short running workflows, which are your CI or CD jobs, or your integration tests, and so on. One common characteristic among all these workflows is having a retry mechanism and a way to save and retrieve progress. Let's look at one such scenario. Imagine you have a storage account and the storage account has a bunch of files, maybe around 10 files, and each of the files have around 100, 000 rows. And those 100, 000 rows attribute to VM IDs. And my goal is to fetch The instance details or the service details that this VM belongs to. For example, the VM one might belong to service a VM two, might belong to service B and so on. Once we start processing, let's say we have 10, 10 files and we FET the first file, processed the a hundred thousand dose, and then we came to the second file. Processed the 50, 000 rows and then the pod somehow dies because this whole thing is running in a pod. When the workflow restarts, if I have to start from the 0th offset again in the second find, that is a lot of redundant processing. That means, in addition to the 50, 000 rows that I did not process, I'm processing the 50, 000 rows I have already processed. Again, will this strategy scale? That would depend on the number of files. For example, if there are 20 to 30 files, maybe in a day, then yeah, definitely. It should not be an issue. But if there are around, if the number of files are around hundreds or in the order of thousands, then this strategy will definitely not scale. The reason being with cloud workloads, Failures are inherent part of, are inherently part of it. And for every failure, if we just start from scratch again, that strategy is not, of course, optimal. One way to fix it is, what if I can, If I do the checkpointing at the offset level, so whenever I download a file, I checkpoint which file I have processed till and till which offset so that when I, when the pod recovers from a failure, it is able to Go to the store, pick up the offset that it has previously processed and start from there, right? That is an optimal strategy and that will of course reduce the amount of reprocessing we do. So what we have concluded is towards building a checkpointing engine. Now is checkpointing, is building a checkpointing engine really easy? You have to think about schema. You have to think about how the transactions occur. You have to think about the underlying database. What is the versioning strategy for the schema? How will observability work? And what happens if the checkpointing engine fails? Will you also fail your workflow? all these things are something we will have to consider. And building such an engine that is agnostic and works with multiple different workflow types is a multi month effort. How do we fix this? If I think about it, checkpointing is not something new. It has been there for a while and we are not the first service to use this strategy. I want, so there are two things I want to, there are two things I need. First, I want to recover from failures. And second, I want to abstract away the concept of building the checkpointing engine and see if there are any out of the box solutions available. This is where there are A bunch of different open source tools available and one such open source tool is distributed application runtime Or dapper. Simply put, this is an engine that runs as a sidecar along with your service and does all the necessary actions. For example, if you have 3000 pods running and your service is deployed in the 3000 pods, then you have the dapper sidecar running along with your sidecar. Along with the service in all these 3, 000 pods. If you look at the architecture diagram here, we have the application itself, which is nothing but our service. It can be written in any programming language. Go, Node, Python, NET, Java, and so on. Then, you have the sidecar. Now, your service communicates with the sidecar using the HTTP or the gRPC APIs, because the communication is through a HTTP or gRPC. It makes the workflows programming language agnostic. That means your service can be in any language and the sidecar can be in a different language as well. Along with the work, along with the scenario we discussed. Dapr also supports some other features as well. One such feature is the service to service invocation API. It is normal that we call any downstream or upstream services. As part of our workload, maybe we call identity To authenticate or authorize the request and so on And some of the things that are required when you're calling the api is retry strategies and also You want to make sure your service has the permissions to call the specific API. All of this can be offloaded to the Dapr Service to Service Invocation API. What you need to do is just call the Dapr sidecar, and sidecar takes care of the retry strategies and any of the permissions required to call these APIs. Then, you have the publish and subscribe pattern. This is also a common pattern that is used as part of our distributed systems. Now, you can configure the runtime to produce and consume to a message broker of your choice and deliver the events to a specific API. That is also something Dapr supports. And today, what we'll be discussing is about Dapr workflows. With the Dapr Workflow building block, you can write fault tolerant and long running workflows with most of the things abstracted away. For example, the thing we discussed is checkpointing. The Dapr Workflow engine inherently takes care of this. How? Let's take a look at that. Coming to the Dapr Workflows, there are mainly three things. First is the Workflow SDK. Because The checkpointing is taken care by the workflow engine and it needs, and for that it enforces certain rules that the app, that the application needs to follow. SDK is the one that curates those rules. It has certain set of interfaces and contracts that your service needs to follow so that the workflow engine can automatically do the checkpointing. Then you have the workflow engine itself, which the SDK talks to, to notify the progress made till now and the workflow engine stores the progress in the state store. The state store is simply put a database where the, pro, where the data is stored. The state store can either be a Cosmos DB or Redis or Firebase or Cassandra. Now, one good thing here is Tomorrow, if tomorrow, if you have to change the state store, it is as easy as changing the state store, building block. You just have to change the, YAML file. That is configured and everything else. The workflow is workflow takes care of the reason for this is the state store in Internals are not exposed to the SDK the dapper runtime takes care of it Now, let's look at what a dapper workflow looks like the most basic unit in the workflow is called an activity That is the one that actually performs the action. For example, if you have to call a specific API or do some complex business logic, that is performed in the activity. Then a workflow is made up of multiple activities. So you can think of the workflow as an orchestrator and activity as the one that is doing the action. Now, if we consider the scenario we have discussed previously, where we have a, where we give a storage location and the activity has to perform, has to retrieve the service details of the particular VM, the in, there is each workflow or activity has an input and output. So the input in this case is the object store file name. That we are passing to the workflow. Now, there can be one activity that can return all the rows in the file. Then each individual activity has input as the VM ID and output as a service name. For example, if there are 100, 000 rows, then there will be 100, 000 activities. And each of this activity would have the input as the VM ID and the output as a service name. So we also discussed that, the dapper workflow is able to resume in case of failures. So what ends up happening is for each input activity combination, the progress is saved inside the stage store. For example, for VM1 and for activity A, if there was an output, the progress is stored in the stage store. When there's a failure, what ends up happening is Dapper would check For this combination if there is already an output. So after For example after for this VM ID VM 1 if you have already performed the activity when there's a failure and the workflow resumes Dapper already knows that for this combination. There is already an output available. So instead of Performing this activity again, it would just return the output. That is how it, it is able to checkpoint and resume without performing the activity again. In order to better understand this, let's look at a small example. As part of this example, this is what we'll do. We'll first understand what the workflow console app is doing. Then we'll run the Dapr sidecard. Third, we will look at the We will run the workflow console app. Fourth, we will create an artificial failure. And fifth, we will run the app again to see that it resumes from the previous failure point. Okay, starting with the first step. What we see here is a dependency injection with registering workflows and activities. As part of registering workflow, we have Multiple activities and as we discussed previously, workflow is nothing but the orchestrator for the activities inside it. The first activity here takes the file path as input and returns all the rows in the file as list of strings. For example, if there are 10 rows, then we return 10 strings. Then for each of them, because they are the VM IDs, we are gonna, we are going to pass them to the VM metadata fetch activity to get the corresponding service details. Because there are 10 rows, there will be 10 activities created with the input name and activity combination. For brevity purposes, I am not making an API call but just writing a console. writeLine to make sure, we know when the, when this activity has been triggered. Similarly, there is a console. writeLine trigger here as well which happens when the rows are fetched from the file. Once this is done, we set the environment variable for dapper grpc port to be 4001, so that once the console app starts up, it communicates to the dapper sidecar on this port. Then, we get the workflow metadata instance, sorry, we get the dapper workflow client instance. Schedule a workflow here, the using the name of the work, name of the workflow, and then the input. Input from where the content needs to be read and the instance ID name, which is unique per workflow. Then we wait for it to start and finally wait for it to complete. This is what is happening as part of the console app. Now let's start our side card. This is the command we used to run the Tapper side card here. I'm specifying the HTTP port to be 3 5 0 0 and the GRPC port to be 4 0 0 1. Awesome. The sidecar is up and running. Next it is time to run our service. So I remove the break point here. I'll remove the break point here as well. Now from the console, what we see is we have fetched the VM data and we have fetched the row from the file. So we completed fetching the rows from the file. Now. If I stop the app here and resume it, I should not see this activity being executed again. That means, I should also not see these console. title lines again. Now what we'll do is because we have already scheduled the workflow, we'll not schedule it again and then we'll run the app. So what we are seeing here is it takes a bit of time to, for the workflow app to itself start, but once it starts, you will see that it directly goes to the second activity and does not start the start fetching rose from the file again because for that input. And the file name, for the input and output combination, Dapr already determines on the state store that the output already exists. So as you can see here, we are, we have not seen the output of fetch rows from file here again. We have just seen that we have fetched the VM metadata. So that means when the Dapr workflow resumed, it did not start from that activity again because it already determined. That activity is already complete and it resumed from the next activity. What you're seeing here is an example of a setup. That my team uses when running dapper on kubernetes so when we Install dapper. We also get the dapper sentry service, which is responsible for distributing certificates And so on and then there is also the dapper placement service which is responsible for distributing your workflows or the activities inside the workflows across multiple dapper hosts. And then, the dapper itself running in the side pod has to communicate to the state store. In this case, it is the Azure Cosmos DB state store using the managed identity. And then finally, All the metrics and logs coming, all the metrics and logs coming from the Kubernetes infrastructure. are sent to Prometheus for metrics and application insights for logging. And when running on Azure, it is generally recommended to use managed identity. And one thing to keep in mind when using the MSI is the dap, dapper container takes. A bit more time when authenticate with msi So the default timeout would not work and you have to configure And you have to increase the timeout value to conclude we talked about the problems we run into when Use when running workflows, for example, how we need quick recovery and how checkpointing is something that is abstracted away and looking for out of the box solution helps. Then we looked into how Tapper, in addition to other building blocks, the workflow building blocks helps curate long running and fault tolerant workflows with it. Eventing mechanisms, and then finally, we looked at how we can run Dapr in the Kubernetes environment. That's it, folks. I hope you enjoyed the talk, and I hope you enjoy the rest of the conference as well. Thank you.

Slides

Download slides (PDF)

See all 81 talks at this event!

Conf42 Cloud Native 2025 - Online

March 06 2025 - premiere 5PM GMT

Enhancing Workflow Reliability with Dapr: Techniques to Minimize Failure Costs

Video size:

Abstract

Summary

Transcript

Slides

Siri Varma Vegiraju

Technical Leader | Software Development Engineer @ Microsoft

Join the community!

Featured event

2025

2024

Info

Conf42 Cloud Native 2025 - Online

March 06 2025 - premiere 5PM GMT

Enhancing Workflow Reliability with Dapr: Techniques to Minimize Failure Costs

Video size:

Abstract

Summary

Transcript

Slides

Siri Varma Vegiraju

Technical Leader | Software Development Engineer @ Microsoft

Join the community!