Conf42 Site Reliability Engineering (SRE) 2025 - Online

- premiere 5PM GMT

Real-time earthquake alert system: Leveraging Serverless architecture with Confluent Kafka

Video size:

Abstract

In our upcoming presentation, we’ll explore a cutting-edge architectural solution for real-time SMS and email notifications, particularly geared towards responding to earthquake events. This system is designed to handle rapid data transmission, listening for event changes every second, making it ideal for real time critical alert scenarios. Central to our discussion will be the integration of Lambda functions and Confluent Kafka, coupled with advanced multithreading techniques and DynamoDB lock strategies. A focal point of our presentation will be addressing the challenges and innovative solutions involved in integrating Confluent Kafka with Lambda functions to enable serverless operation of both producers and consumers. This is a key element in ensuring the quick and efficient distribution of notifications through parallel methods. Additionally, we will delve into the implementation of an automated scaling mechanism, which is vital for optimising the performance of the Serverless Notification ecosystem. Our aim is to provide a comprehensive insight into how these technologies can be effectively combined to develop a robust and efficient system, capable of delivering critical real-time alerts for situations like earthquake occurrences, ultimately playing a crucial role in saving human lives.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
So hello at Cov 42 Site Reliability Engineering 2025. Glad that you're here and listening to my talk. Yeah. Let's let's buckle up and deep dive into it. The session title is Real Time Earthquake Color System. Leveraging serverless architecture with confluent Kafka. Basically, it's project which it's mostly built in AWS cloud with specific services, mostly serverless and also in combination with confluent Kafka, just for the message distribution. And achieving a fast kind of delivery of the notifications where I some bold lines about what I'm doing, what or what I did in the past. Cloud development, cyber security, or kind of projects, architectures, designs consultancy. Everything like this project in data science domain from data extraction, data modeling until, let's say about full a TL pipeline or anything like that. So just trying to provide from real scenario data, something that's practical. So just having practical projects based on specific demands of data. Multiple security research is for top pro Romanian banks. So yeah. Also this one is is a very interesting to let's say pen test, if you want to call it like that. An a s company builder is serverless. Niche for one year right now. If you have any questions after this talk, ideas, remarks, anything like that, you could contact me on mail or on LinkedIn. This, these are my addresses, so yeah. Enough about me. Let's let's try to. Explain what's in this project. The story of behind the research project is like spread between these questions. So these these bullet points are let's say questions. For example, the first one, why the need for high performance earth throwing notification system? I start right by providing you an interesting story of what we currently have in Romania in terms of Romanian alerting system, which is provided by the government. It's the Romanian alert application, which reaches like very low. Hardware in our telephones. So it's basically delivered with the telephone and they're just emitting these kind of alerts or a bad weather, critical accidents, critical infrastructure failures, anything like that. But an interesting point here is that in very. Large of the cases on notifications, we receive those alerts after the event unfolded. For example, after the bad weather was just gone. So if you are seeing the rainbow, you'll just receive a message, so that's not very good. Although the project and what the government develops is is very very good. Very nice ideas. This is why I was thinking back then one year and a half ago, why not having also a kind of notification for earthquakes in that time. I was thinking to just have the notifications when the event happening is happening because I was thinking that I cannot, I. Predict. So I cannot have a machine learning model. We just which is just trained on previous data and just will predict something because it's so critical and it's let's say unpredictable, let's say it's a natural happening. So I started to look around and see what kind of data. Could I use, because I cannot just put in the ground some some systems and just have some some data from from those, at least not right now, because in what I'm about to say to tell you is where I found basically that the data and in which project. I found that there is an institute of seismic movements and earthquake researches in links to the Romanian government, which has some very talented researchers and very interesting projects there. And in those projects there was a kind of very interesting one called rapid Alerting System. Or something like that, which was based on some physical systems placed in northwest of Romania in the, let's say the well-known area seismic movements, at least in the couple of years. So they placed those systems in those areas, and they are just emitting data. When the very little system movement is detected. So until now, very interesting thing. But I was also thinking okay, so that town is like in distance, like 500 kilometers from Bucharest, which is the capital of Romania. So if we are translating like the kilometers in some. Time thing. There are some couple of minutes until the full waves, for example, like a seven fourth wave hits. There are like a couple of minutes or at least one two minutes until the full waves are received in est in in our towns, like in the north, central North, three minutes, two, three minutes, something like that. So I was thinking what, having this advantage, so having the advantage of time, distance, and time use those systems, basically that data, which is exposed by them in a browser application, I think right now and I can use, just listen to that data. And use it to spread it to certain consumers, basically, and subscribers, which can be people or critical infrastructure things. So I just responded almost to also the second question, where to find good real time earthquake data. So like I've said this institute provides. Very interesting projects. And that particular one with the systems placed in the ground, emitting data at the very little assessment movements, it was very helpful. So we integrated with that with that emitting system. And we are listening for for events, basically why choosing the serverless ecosystem. So I am ambassador in AWS or serverless lambdas and anything related to that, it's from a perspective, due to my previous projects, I integrated Lambdas serverless, like anywhere. But beside the knowledge, it was also practical. Why I'm saying practical is because for this particular project, we don't need nonstop servers like 24 7. We just need some some listeners which need to be prepared when the earthquake hits, just to spin up those kind of instances, containers, and use them to just spread and distribute. Very fast, the notifications. This was the first thing. The second thing is also the costs. 'cause yeah, if we have like servers up 24 7, it's also a matter of costs be involved and how to scale them just on the fly, like in milliseconds. So these two things were the base pillars of choosing. A s Lambda serverless. Moving on to the architecture of the serverless identification system. What you see here is the base architecture, which is the first version, let's say. I'll try to explain you like in big lines because we will deep dive in the next next slides. What in the gray is like the backbone of the system, which has the Lambda producers, which I'm calling like Lambda producers 'cause of the Kafka integration and also the Lambda consumers. In the right in the center, you have the topics partitions, the backbone of Kafka itself. The yellow ones are like the. Error handling mechanisms. SQS, that letter queues just for putting all kind of errors like if there are errors networking or latency or things like that, we'll just retry on that code execution. But if it's like a major failure, we'll just put it in a particular queue. And keep it there for a manual studying of the logs and of the error itself. The blue ones are for monitoring, just monitoring the number of instances, how the system behaves, costs, and are very important because we are I will show you afterwards, we are also scaling up. So we need to see the costs and how the system behaves, like little, let's say, central kind of monitoring or everything for errors. And yeah, also the end channels, if you view them you can view them in the right Twilio, like SMS mail through A-S-S-E-S. Or through Telegram, which is like a backup thing from from our end, from our perspective. The purple ones are for CICD, like pipelines for creating the infrastructure, but also we are thinking you, you can view it in the next slides to scale up and down through this kind of CICD idea. The build some scannings just automation things on the repository. And the block with with green, it's like the application, the web application where the end subscribers or the creator infrastructure members could join into the backbone system. What I'm trying to say, join is like just subscribing the system. Provide maybe a telephone number or some kind of informations about the subscriber and based on them, you can see there that there is a database that's a Dynamo database connected to the backbone. Of the system, which will be just iterated. So we will just iterate through the members, through the subscribers and try to notify everyone. Also, some lambda functions there for the different kind of APIs, some protections through the firewall. The main application was we're thinking to put it in like a Kubernetes. Infrastructure, but we will see maybe use what kind of serverless also for that one. Architecture of the base notification system, like I've said in the previous slide, this is the backbone the gray area. If you call it like this, you can see the Kafka producers, Kafka consumers, and the Kafka topics in in the middle. In the Kafa producers, we have this thing called multithreading. Just using the, let's say, full capacity, full resources of instances of Lambda the CPU, so just using the. The full capacity and try to separate, let's say threads and link them to certain partitions or topics in Kafka. So this is the beginning of rapid or fast production and notification. Basically the first the first step. We're trying to put like a, also a topic roulette on top of what Kafka has like for topics, partitions, and try to not override messages in the same partitions, same topics. Just try to have a thread and the partition and put the message on that and try to to connect it to the Kafka consumers. This is also like a an interesting key idea of this project to have the partitions and basically topics in Kafka triggering certain instances of Kafka called the, called consumers. Achieving a, a parley list, distributing the message the most fast way possible. Can be. The yellow box is monitoring and mo monitoring, at least for the kind of major failures. So for the air errors, which are put in like the error queue and for the failed events, health events, just networking, latency, things like that, we will try to retry. Two times, third, three times. We have the end channels where the notification is sent. SMS mailed Telegram is back up and the dynamo. So the database, the blue one is where the subscribers are just present and registered. And the Kafa producers listen for that data, which I've explained before, which is provided through through the systems of the institute. Another interesting thing here will deep dive bit into the architecture of Kafka integration in the serverless context. Like I said, Kafka is is here to somehow parallel distribute the messages, the notifications, and to also trigger parallel pins of plum duck consumers. So every partition we will have a certain code execution, like a certain, trigger, let's say certain lambda as a consumer, which will try to notify a certain subscriber. We have the threads just using the resources, the topic roulette, build on top what Kafka or they have or topics, partitions, and try to fill up every. Partition to be able to trigger from it the Lambda consumers, which have the logic to just iterate the database of subscribers and notify them in the end. Yeah. So here we also have some yellow dots. Which represent like the scaling possibilities, so scaling in terms of resources, threads and scaling in terms of partitions topics. And in the, again, in the Kafka consumers just trying to scale the triggers based on what Kafka has and also the Lambda consumers, the Lambda Kafka producers key elements. We have the multithreading explained in the previous slide using those resources from the Lambda instances thread lock to not write in the same topics from separate threads. Using partition number to not override data in partitions and using lambda retrial and sqs that letter for major failures. And the retry is just for latency, networking, minor problems. Consumer key elements. Here are some a again, some nice ideas from our perspective. Custom timestamp blanc lock for mechanism for parallel dynamo update. So we are trying to here was an interesting dilemma, let's say, because we. We saw that there were multiple notifications for a certain subscriber. So let's say one, one one, one man and one fellow could receive multiple notifications. And the second guy could not receive anything just because the, there was a, like a. Race between them and the first one won and just had all the notifications there. So we wanted to just lock somehow certain subscribers. If those were notified, basically the code already was applied on them and they received, or it's like in progress of receiving a notification and not, trying to notify it again, just moving to the next one. This is like the lock per subscriber, per item to have for each individual one one notification. Optimistic locking in the custom dyna concurrency control is. Method, just build on top of the first one with the custom timestamp log. We are just putting like a version and we know if the version changed, if we need to notify the subscriber or not. Just let's say another fence of achieving the concurrency and the fast thing, but also to notify all the subscribers and not miss one of them. Yeah, using lambda retry que for errors, like for the Lambda producers and Lambda consumers using a Coldstar worm start methodology. We are seeing that using a worm start helps because it's the last phase of the notification. So started from the producer. Reaching the Kafka cluster. And after that we need something that's in a ready state. Just have it in there, a number of instances and lambdas, and try to have them in a ready state, just ready to go to a list of subscribers, execute the code for reach, and send the notifications. But this is again, an interesting procedure also in auto scaling basically. So if you have like thousands of subscribers or reaching a certain number of subscribers, maybe you need certain instances in Worm State, resting cold start, things like that. The Lambda Kafka consumer integration with database. So there is like a custom dynam blocking. This was the first version, like I've explained a bit in the previous slide. So from the earthquake topic, the partition we're just trying to trigger a certain lamb execution. And from that execution, the code will act on certain items and using this kind of lock. Like the timestamp on, on which we are placing the log. We know that an item is in progress of being updated or it was just updated like one millisecond ago, something like that. And the next trigger, so the trigger topic two will go to the next one. So just going like a domino thing, but from the front. So view viewing like a domino row from the front and just go going gr gradually with each trigger. So with which it, with it with which, which partition just going through each subscriber and notify each one of them. This is four, just using the full, full capacity of triggers and not wasting like multiple triggers. So multiple code executions on a certain item, although there is there is like a loop. So going gradually through the items, we can end up using the same trigger for the same item. Yeah, it won't be, notified like twice or three times. But we will have we have a waste, let's say of time. So trigger the trigger. We lacked like twice on the same item, but just updating it once. Yeah. So this is like the first version of what we were trying to achieve. Concurrency of notifications and also notifying the subscribers in a best way and notifying all the subscribers, the custom dynamo, segmented subscriber processing. This was like the second version of the Lambda consumers logic. So from the previous version, we are trying to segment the list of subscribers and having like logs per certain segments. So for example, one trigger from a partition from Kafka will just act on a certain segment and just lock that segment with a block ID or something like that from ID one to ID 400. Something like that. So just the trigger will act on, on that particular segment. So in this one the the triggers won't conflict. There won't be any race conditions. Let's say each trigger will be allocated to a certain segment and will just do the work. So the trigger can be viewed like worker, so the worker will do the job just for that segment. So from now we can view that domino ization from the parallel from the lateral view, we are viewing with workers, like soldiers, just particular segments of items which are basically subscribers, peoples people, infrastructure members, things like that. Just on them the triggers the workers will focus. This is an improvement in terms of how we are reaching those subscribers through Kafka, partitions topics and lambda code. Different kind of lambda code executions, which are done in parallel. Also moving with this parallel view through the end. So having the triggers, the workers act in parallel the subscribers. But although having one worker for a segment like one to 400 or 4,000 is like a lot, even though there will be like, lo, very low level party threading in that particular segment, you only have one worker. It's a hard work there. This is like the second version, so this is like a parallel distribution and segmentation. In Lambda Kafka consumer, there are like maxim invocations and locks per segment. Also like domino style, but we led multiple executions. So multiple code triggers to just act on let's say particular segments. So we allow multiple workers to work on the same segment and doing that low level multithreading. So letting, I dunno, maybe two. Like in this example, we let two triggers just act on on on the same segment so they can resolve or notify and execute the code. Or faster, let's say we can allocate, like in this example, maximum workers like five. Or anything like that, just depending on the segmentation, which is done like in an automatic way. Just just having the nu the maximum number of subscribers, maybe dividing somehow that list and having the numbers of locks Mexican workers. Things like that. So what's the important idea is that we are viewing the domino from the lateral view and we have multiple soldiers located for the same segment. They will not conflict with each other, there won't be any risk conditions between them. They will just act like multiple workers are dividing the work in the same segment. This is like the improve the. Improved version of the distribution between the Kafka code execution the Lambda code execution and the Dynamo where we store the subscribers. So this is like the level right now. But we're also trying other things and just improve the speed and everything, like I've said with it, that granular multithreading apply that segment level. So it is very blow level I could say. So we will just keep on researching and see what's the best for for that moment. Increasing every time with new, new possible ways. Again, one interesting thing. So like I've said Kafka was integrating here just to resolve this par distribution, but instead of Kafka, we could use Kinesis from AWS, which could work well or a different kind of cues to just be to just be close to the. To the cloud, not go to to, to the internet, the cluster and just come back, just be in the same network. May, maybe it could decrease the latency, but this is for the later some some things about auto scaling. Which is very interesting from this project because we don't want to have manual interventions when it reaches like a base kind of stable state, the project. We won't need those manual interventions anymore. This is why we are thinking to put in place a kind of auto scaling, scaling up or down the system based on the number of subscribers. So if we have like a range between, I dunno, zero and 201 or 301, we'll just increase with threads. So basically resources allocated in Lambda producer instances topics, partitions, the Kapha clusters, and also the number of triggers and number of Lambda consumers. Yeah, like the warm state consumers, if we are going down with the subscribers, we can just lower the resources on the fly again, if we, yeah, it's hard to predict something to autoscale through the roof before having that major earthquake, but at least. There is a possibility to scale it based on this number of subscribers. For now this is our idea for now, but we can think about some other features, let's say which could be linked auto scaling, not predicting 'cause. Predicting is hard. Hard. We need something that's realistic. So this was made through custom scripts like some Python combined with Azure with A-W-S-C-L-I commands for for Lambdas, for the Kafka clusters, just on some custom scripts. And these were placed in a lambda, like an external lambda. We just, which orchestrates the entire, auto scaling and everything. This kind of external lamb does act also like for the previous segmentation, just allocating different kind of maximum workers. Segmentation ID just on the fly, again, viewing the number of subscribers, just doing the calculations and having these numbers. The custom auto scaling using Terraform. So infrastructure as a code. So right now we are trying to reduce the custom scripts, so custom shells and things like that and just move to an infrastructure as a code. Something that's more used in every project. Right now, software project is the same kind of logic, just, we're also provisioning the infrastructure through infrastructure as a code. It's all the services and everything, and we also use it. That's the interesting part, to scale or up or down with GitHub actions. Or we could use, something in AWS for code deployment because we need to check the number of subscribers and to trigger the Terraform execution with a new list of what we need, like parameters, like we need two threads, we need four threads. We need four topics. So these are the kind of, numbers, which are placed on a certain file, which is sent to the Terraform just to execute on on that list. So it's a very interesting approach of how infrastructure is a code act or scaling up or down or serverless, lambda AWS architecture and, ion system basically. Yeah. These were like in the area of development operations, how to operate entire system when it's like in a stable position after we figure out the services, what we need and so on and so forth. This like a separate lane of, how to keep it automatic, let's say also in for creation of the resources and the services and everything. Then to have it a certain time step of them, if something happens, we could revert a certain step, certain snapshot of the resources and the architecture. And we could also use it for scaling up or down, depending of the. Number of subscribers in in this phase. So this is the project. This is the researches. These are the researches done until now. We yeah, we are in touch with some we have some discussions how to. Proceed further how to do other interesting researches for these kind of earth failure notifications May maybe to integrate it in what there is on provided by the governments, not only Romania, who knows to have it like an alternative on just having some kind of integrations. Between what's already there or providing new ideas, anything like that. Yeah, just bringing something to the table, a kind of innovation we could we could say. So if you have any questions or remarks or opinions. Please feel free to contact me on mail, LinkedIn. Just be aware that the project itself, it's it's registered. So yeah, if you want to, if you have some inspiration from it or you want to do something similar, please just contact us or me just in in the first place and we could align and we could discuss. Firstly, so like I've said, it's a protected project under some licenses. Yeah, we also have some open discussions now and there. So yeah. Thank you. And thank you for being here. It was it was pleasure. And thanks for the, thanks to the organizers for inviting me to present. This this research. Thank you.
...

Vlad Onetiu

Software Automation Engineer & DevSecOps @ DataIceberg

Vlad Onetiu's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)