Conf42 Rustlang 2022 - Online

EpiRust - Building ultra large-scale epidemic simulator using Rust

Video size:

Abstract

EpiRust is an open-source, large-scale, distributed agent-based epidemic simulation framework. It follows a minimalist approach for modeling an epidemic in a virtual society. An agent in the simulation is an individual person with certain characteristics and behaviours attached to it. These agents interact with the environment and among themselves giving rise to various different emergent phenomena. In this talk, we would be sharing our journey about, how we started from serial implementation representing one city with 5 million agents to distributed setup which can simulate more than 100+ million agents depicting multiple cities and even larger geographical areas. More about EpiRust can be found here.

Summary

  • EpIs is a scientific computing case study. We'll be speaking about building epidemic simulator using a rust language. What is out of scope for these specific talk is we will not be speaking much more detailed about our Covid-19 epidemic model. This talk is purely going to be technical in nature.
  • An agent based large scale open source epidemic simulator. Created a virtual life virtual society where individuals are performing their day to day routine. All of these interactions can potentially cause a spread of disease. Next goal is to reach a scale which is 100 million.
  • Taranta: Now we will be seeing how do we optimize for the throughput performance. Our throughput is six times better as the 100 million engines. We need some sort of scale optimization for some of the infrastructure which can be used at cloud scale.
  • Rust is closer to the metal and it gives the performance pretty much similar to the C language. With the fearless concurrency, we could easily go with the parallel implementation. Overall ecosystem also help us the documentation, the way it's written. Now we will see how Rust helped us achieve our goals.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Greetings everyone. My name is Jayanta along with Sapna, both of us we'll be talking about Apiz today so let's briefly understand what this talk is about. From this talk what all things we can expect. EpIs is a scientific computing case study. We'll be speaking about building epidemic simulator using a rust language. We'll be speaking about our journey, about how did we went ahead and built a simulator which is optimized for performance and it can support large scale. Also we'll be sharing some of our understanding and our insights how just as a language helped us building the specific simulator. What is out of scope for these specific talk is we will not be speaking much more detailed about our Covid-19 epidemic model. It deserves its own talk. And this talk is purely going to be technical in nature. But even being said that we need to cover some of the basic understanding so as to understand the problem statement. And initially in the agenda we'll specifically talk about APZ in the beginning these we can talk about memory optimization, performance optimization results and the insights which we gathered from this specific tool. That would be towards the end. So let's briefly understand what is an episode. As it has been described, it is an agent based large scale open source epidemic simulator. So let's try and understand each and every single term of this. So when we say it's an agent based simulator, what it means is we understand that spread of a disease is a complex phenomena. All of us have witnessed in past two years how Covid-19 is spreading. It is difficult to understand as there are numerous factors which has impact on these spread. So what an agent based model does is agent based model is much more bottom up approach is we start by creating every individual in the society or every individual in the system which these individual areas of interest for us. So we create every individual person, then we create their homes, then we create their workspaces and every individual, we assign certain schedule to these individuals. So we create a virtual life virtual society where these individuals are performing their day to day routine. They are interconnecting, exchanging some of the information they areas, meeting each other in physical world. And all of these interactions can potentially cause a spread of disease. So that is the basic fundamental about agent based simulation. So on the right we can see a grid which is depicting a minimalist model which we are using for ephesus. So every blue dot in this grid is an agents. So this is one of the smaller scale model which might be for like few hundred individuals. So all of these individuals, they stay in certain home area where they interact with their family members while going to office. Some of these individuals will take up, will take space and transport areas that they will be accompanied by other individuals who may be stranger. And just by sitting next to each other, there's a chance that a disease can be spread through. So that is why we have a transport section. Similarly, in work area people will come to work, they will interact with their colleagues. And again such interactions might lead to spread of disease. And once these individuals come back home, it might happen that someone might get infected at work. And again the person is in connection with the family members and might spread diseases at home as well. So it's a minimalist model which is depicting spread of diseases. And in this case specifically Covid-19. So at the bottom it sounds pretty complex, right? But if you want to take the most simple view of episode, I can easily explain it as two for loops. So the outer loop is a clock tick. So there is a vector clock which is simulating us in this context of a virtual society. So what we are doing is at ivz clock tick which is at iVzrl. We are iterating over every agent and asking them to perform their corresponding routine. And once this routine is performed, at the end of each action we are checking if there is a possibility of spread of a disease. And once we iterate through all the agents then towards the end we do stockkeeping. In terms of we see how many folks got infected in the last hour. Also if government wants to intervene, if there areas certain three conditions which ask government to put a lockdown which ask government to ask people to wear masks or to start a vaccination. Such intervention can as well be modeled. So that is most simplified view of apologist. So this is our journey. We started busy, simple. Initially we started with thousand population. Every individual in these skis was same as other person. These slowly we moved to 10,000 individuals. And we started bringing heterogeneity. Heterogeneity in terms of these individuals go to work. Not everyone in the society is working. There might be younger kids or elders who are staying at home. So based on age there's stratification. Then there are certain individuals who are essential workers as in these individuals will still work if these is a lockdown imposed. So we started slowly modeling, capturing the complexity from these domain and making our model rich in terms of domain. The way in next target after finishing ten k was to model Pune city which is a city in India from where both of us are. And since the scale was 3.2 million individuals. Scale and performance became the issue pretty quickly for us. Then once we modeled Pune, we went ahead and started modeling Mumbai. So Mumbai is one of the metro cities in India, which has population, which is 12 million. So roughly four times of Pune. Also, one of the challenges which we faced in the Mumbai context is modeling commute. Because many folks in Mumbai, they travel a large distance from one place to reach their workplaces and then come back home. And next goal, which is work in progress, is we want to reach a scale which is 100 million. With 100 million, we can model our state, which is Maharashtra, which is one of the largest state in India, and which as will help us to model multiple cities at the same time. So, having said that, let's briefly understand what are technical complexities in this context of episode. Epirust is compute intensive. So what it means is if we talk about a simulation which runs for 45 days, the outer loop, which we saw for time, it runs for 1080 ticks, because each tick is one r 45 into 24, which will come down to 10 80 ticks. So if we started with a population of thousand individuals, so for such a smaller population, there would be 7 million behaviors, which needs to be executed in the context of this 45 days. So if we increase population to 1 million, the number from 7 million rises to 7 billion, and moving one step ahead from 1 million to 100 million, the number of behaviors will move on to 700 million. Also, when we talk about scale, if we carefully look at this metrics on the right, the metrics is parse that not all, most of the places are empty in this context, which means we will as well need to take care of memory footprint when we are building such a model in memory. And then there are domain complexities. We need to understand what are different things, which goes with Covid-19 specifically because epirust is built specific initially for Covid-19. So we need to understand the way disease works. We need to understand government interventions and how these interventions have impact on individuals lives. And then adding on to this, there is one more difficulty. So the inner for loop, which we saw earlier, which is per agent, it executes agent one after other. But when we speak about agent interactions in real life, all of us move at the same time and we do not move one after other. So such a loop causes part dependency. To solve that, we are using 2d buffering algorithm to take care of to remove the part dependency from the picture. So how did we start? As I said earlier, we started with very simple mechanism. We started with two for loops. The code was pretty material. The grid which we saw was implemented as 2d metrics. Population for Pune City is 3.2 million, which means if we look at number of seals, this 2d metrics, number of elements, this 2d metrics used to hold that was close to 32 million cells. Which means the memory consumption went ahead from. And it was approximately somewhere between five to Tengb, which was pretty big. Keeping our goals in mind, we wanted to model not just for Pune City but as well as for Mumbai and want to go ahead and reach scale with 100 million. So we faced a challenge for modeling the specific grid. One solution which we came up with is if we can represent this 2d as a grid, as a hash map. So we created a structure where the key for hash map is point and the value becomes object of citizen struct. So point is nothing but xy location on the plane, on the grid. And the citizen is that individual agent who is occupying this specific sale. So essentially what happened is number of agents became numbers of entries in the specific hash map. Since it is a hash map, Zival became much more easier. Now instead of order of n square, now it is just plain order of one. Oppositions memory which was earlier in the extent of five to tengB, right away came to few hundred mb. We went one step ahead and then we started looking at what can be efficient hashing algorithm which can give us better mileage. So we experimented with hash brown, we experimented with FX hash, we experimented with FNV hash. FNV hash because it is non cryptographic and these hash is not exposed outside. So FNV hash gave us the best possible outcome and we went ahead with FNV hash. Having said that, I'll hand it over to Sapna to talk about how did we optimize for throughput. Thanks Taranta. Now we will be seeing how do we optimize for the throughput performance. So what we tried doing is first representing the parallel incrementation form. As you saw, we are using 2d buffers. One is the read buffer. From there, when our internal for group starts running, the agent will see its neighboring agent position from the read buffer and update its position according to its training routine into the write buffer. And each updation of each agent is independent of the other agent status. So we can easily use the data parallelization here. So one of the method was the map duries where we actually individually update the agent status and later on serially store it into the temporary sum data structure and later on serially update the status to the right buffer. That is just to avoid the collision into the map because we are using buffer as a map. In this case the parallelization wasn't completely one of the step, we were doing it parallel and another was a serial. It did give us some performance improvement. Another way was doing it using the parallel iterators where we use the rayon library. And what we are doing is using the just concurrent data structure dash map, which allows us to automatically do both the stages to check whether the particular entry in the dash map is empty or not. And if it is empty then you update the agent status there. And as you can see this graph, you can see how we are doing against the 5 million population using the Mapreduce or parallel method. If you see the first where we have throughput 0.5 for the serial implementation, that increases to the double with the Mapreduce and almost nearby to it using parallel. But if we go ahead then the parallel and map reduce performs same. If we go with the higher number of scores, that is 6400. That's about the scaling up where we tried to scale up the population and did the parallel implementation. But another use case we had of the scaling out. As you see we use the 64 threads and those many cpu cores. If we want to scale it further, we always won't have those many resources available. And from the domains perspective, sometimes these disease specifications or the geography specifications can be different for the smaller unit of area. One city might have it differently than another city. That's what we did is we started those smaller area into the different engines. Each engine will have its own specifications, it will run its simulation and that way you can run it to the distributed mode which allows us to use the compute opensource efficiently and solves our domain problem as well. Another thing is these engines are not completely isolated from each other. Agents will be traveling from one city to another city or one world to another world. In that case that we are achieving using the kafka where a data of such commuters is sended over the Kafka and other engines will read it over from Kafka. And this has to be happened at the hour. If at our 24th somebody is traveling, it should be reaching at 25th hour or 24th hour depending on the condition. And that for that purpose we have the orchestrator which do nothing but the synchronization of the Indians to the particular points. And this is our distributed structure. And we will see how does it help us in optimizing throughput further. So as you see the first graph where we can see the number for the Mumbai and Pune against the serial, parallel and distributed setup. And you can see for the Mumbai or Pune it actually goes with the serial implementation. When we move to parallel the throughput has increased to almost twice. And again if we go to the distributed mode again it's as twice as the parallel implementation. Now that is one way of sharing out. Another way is like as we already saw, the Mumbai has 12 million population and how did we divide it into the distributed mode is the worldwide distribution where each engine was having population around half a million and there were 24 words. But if we can go little further and reduce the population per engine and increase the number of engines that what we tried with the 100k population, each engine, 100 engines which again sum up to the 10 million which is near to the 12 million. Our throughput actually is the six times better as compared to the Mumbai throughput. And that's where these distributed helps us a lot. Now we saw we did changed our engines from 24 to 100, but it is not possible to manually start up the hundred of engines and then start up Kafka and do all of this stuff. We need some sort of scale optimization for here and some infrastructure which can be used at this scale. So for that we added the cloud. We migrated to the cloud and added cloud support to our application. First we containerized our aprest and then we used Kubernetes for the container management at the scale. Basically we are using the Kubernetes jobs which runs Kubernetes jobs for the engine and orchestrator and Kafka is again deployed on these cloud. We are using Helmchart to package the application. So anybody can just according to their use case change these config values in the helm chart, which way they want to run serial, parallel or distributed and how many engines, et cetera. And after the configuration with single command they will be able to start our application on the cloud and get these output data again. If we are having application running at a scale, there are so many issues you need to debug by the developing purpose to improve the performance. Most of the metrics are needed for that. We again take advantage of some open source tools like Elkstack, Prometheus and Grafana for the logging and monitoring purpose which helped us for debugging, logging and getting some automated alerts. Now we will move to the next part, that is Rust features. So as you know, our basic language is Rust and all this code, whatever we saw till now has been written in Rust and now we will see how Rust helped us to achieve our goals. So as you know, already. Rust is closer to the metal and it gives the performance pretty much similar to the C language. So that helped us a lot. No runtime and no garbage collection which gives us the performance. But the performance is similar to the C. But C and C Plus plus has many memory issues that Rust is taking care of already, and that's why it served us quite a lot of development time. With the fearless concurrency, we could easily go with the parallel implementation and most of the errors were coming at the compilation time. Whatever the issues we got larger on, there is nothing related to memory or something related to the language specific. Apart from that already we saw the performance and memory management helps us in productivity. Overall ecosystem also help us the documentation, the way it's written, the various grades for the different use cases, and the cargo as the build tool with its simplicity. All of us helps in these productivity. Most of the features areas doing converting into the parallel, using the rayon library, multi threading using Tokyo, etc. That all of us help us. This is the team which worked on the epirust and that's it.
...

Jayanta Kshirsagar

Application Developer @ Thoughtworks

Jayanta Kshirsagar's LinkedIn account

Sapana Kale

Software Developer @ Thoughtworks

Sapana Kale's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways