Transcript
This transcript was autogenerated. To make changes, submit a PR.
So hello at Cov 42 Site Reliability Engineering 2025.
Glad that you're here and listening to my talk.
Yeah.
Let's let's buckle up and deep dive into it.
The session title is Real Time Earthquake Color System.
Leveraging serverless architecture with confluent Kafka.
Basically, it's project which it's mostly built in AWS cloud with specific
services, mostly serverless and also in combination with confluent Kafka,
just for the message distribution.
And achieving a fast kind of delivery of the notifications
where I some bold lines about what I'm doing, what or what I did in the past.
Cloud development, cyber security, or kind of projects,
architectures, designs consultancy.
Everything like this project in data science domain from data extraction,
data modeling until, let's say about full a TL pipeline or anything like that.
So just trying to provide from real scenario data, something that's practical.
So just having practical projects based on specific demands of data.
Multiple security research is for top pro Romanian banks.
So yeah.
Also this one is is a very interesting to let's say pen test,
if you want to call it like that.
An a s company builder is serverless.
Niche for one year right now.
If you have any questions after this talk, ideas, remarks, anything like that, you
could contact me on mail or on LinkedIn.
This, these are my addresses,
so yeah.
Enough about me.
Let's let's try to.
Explain what's in this project.
The story of behind the research project is like
spread between these questions.
So these these bullet points are let's say questions.
For example, the first one, why the need for high performance
earth throwing notification system?
I start right by providing you an interesting story of what we
currently have in Romania in terms of Romanian alerting system, which
is provided by the government.
It's the Romanian alert application, which reaches like very low.
Hardware in our telephones.
So it's basically delivered with the telephone and they're just emitting these
kind of alerts or a bad weather, critical accidents, critical infrastructure
failures, anything like that.
But an interesting point here is that in very.
Large of the cases on notifications, we receive those
alerts after the event unfolded.
For example, after the bad weather was just gone.
So if you are seeing the rainbow, you'll just receive a
message, so that's not very good.
Although the project and what the government develops is is very very good.
Very nice ideas.
This is why I was thinking back then one year and a half ago, why not having also
a kind of notification for earthquakes
in that time.
I was thinking to just have the notifications when the event
happening is happening because I was thinking that I cannot, I. Predict.
So I cannot have a machine learning model.
We just which is just trained on previous data and just will predict
something because it's so critical and it's let's say unpredictable,
let's say it's a natural happening.
So I started to look around and see what kind of data.
Could I use, because I cannot just put in the ground some some systems
and just have some some data from from those, at least not right now,
because in what I'm about to say to tell you is where I found basically
that the data and in which project.
I found that there is an institute of seismic movements and earthquake
researches in links to the Romanian government, which has
some very talented researchers and very interesting projects there.
And in those projects there was a kind of very interesting one
called rapid Alerting System.
Or something like that, which was based on some physical systems placed in
northwest of Romania in the, let's say the well-known area seismic movements,
at least in the couple of years.
So they placed those systems in those areas, and they are just emitting data.
When the very little system movement is detected.
So until now, very interesting thing.
But I was also thinking okay, so that town is like in distance,
like 500 kilometers from Bucharest, which is the capital of Romania.
So if we are translating like the kilometers in some.
Time thing.
There are some couple of minutes until the full waves, for example,
like a seven fourth wave hits.
There are like a couple of minutes or at least one two minutes until
the full waves are received in est in in our towns, like in the north,
central North, three minutes, two, three minutes, something like that.
So I was thinking what, having this advantage, so having the advantage
of time, distance, and time use those systems, basically that data,
which is exposed by them in a browser application, I think right now and
I can use, just listen to that data.
And use it to spread it to certain consumers, basically, and
subscribers, which can be people or critical infrastructure things.
So I just responded almost to also the second question, where to find
good real time earthquake data.
So like I've said this institute provides.
Very interesting projects.
And that particular one with the systems placed in the ground, emitting
data at the very little assessment movements, it was very helpful.
So we integrated with that with that emitting system.
And we are listening for for events, basically why choosing
the serverless ecosystem.
So I am ambassador in AWS or serverless lambdas and anything related to
that, it's from a perspective, due to my previous projects, I integrated
Lambdas serverless, like anywhere.
But beside the knowledge, it was also practical.
Why I'm saying practical is because for this particular project, we
don't need nonstop servers like 24 7.
We just need some some listeners which need to be prepared when the
earthquake hits, just to spin up those kind of instances, containers, and use
them to just spread and distribute.
Very fast, the notifications.
This was the first thing.
The second thing is also the costs.
'cause yeah, if we have like servers up 24 7, it's also a matter of costs
be involved and how to scale them just on the fly, like in milliseconds.
So these two things were the base pillars of choosing.
A s Lambda serverless.
Moving on to the architecture of the serverless identification system.
What you see here is the base architecture, which is the
first version, let's say.
I'll try to explain you like in big lines because we will deep
dive in the next next slides.
What in the gray is like the backbone of the system, which has the Lambda
producers, which I'm calling like Lambda producers 'cause of the Kafka
integration and also the Lambda consumers.
In the right in the center, you have the topics partitions,
the backbone of Kafka itself.
The yellow ones are like the.
Error handling mechanisms.
SQS, that letter queues just for putting all kind of errors like
if there are errors networking or latency or things like that, we'll
just retry on that code execution.
But if it's like a major failure, we'll just put it in a particular queue.
And keep it there for a manual studying of the logs and of the error itself.
The blue ones are for monitoring, just monitoring the number of instances, how
the system behaves, costs, and are very important because we are I will show
you afterwards, we are also scaling up.
So we need to see the costs and how the system behaves, like
little, let's say, central kind of monitoring or everything for errors.
And yeah, also the end channels, if you view them you can view them in the right
Twilio, like SMS mail through A-S-S-E-S.
Or through Telegram, which is like a backup thing from from
our end, from our perspective.
The purple ones are for CICD, like pipelines for creating the infrastructure,
but also we are thinking you, you can view it in the next slides to scale up
and down through this kind of CICD idea.
The build some scannings just automation things on the repository.
And the block with with green, it's like the application, the web
application where the end subscribers or the creator infrastructure members
could join into the backbone system.
What I'm trying to say, join is like just subscribing the system.
Provide maybe a telephone number or some kind of informations about the subscriber
and based on them, you can see there that there is a database that's a Dynamo
database connected to the backbone.
Of the system, which will be just iterated.
So we will just iterate through the members, through the subscribers
and try to notify everyone.
Also, some lambda functions there for the different kind of APIs, some
protections through the firewall.
The main application was we're thinking to put it in like a Kubernetes.
Infrastructure, but we will see maybe use what kind of serverless also for that one.
Architecture of the base notification system, like I've said in the previous
slide, this is the backbone the gray area.
If you call it like this, you can see the Kafka producers, Kafka consumers,
and the Kafka topics in in the middle.
In the Kafa producers, we have this thing called multithreading.
Just using the, let's say, full capacity, full resources of instances
of Lambda the CPU, so just using the.
The full capacity and try to separate, let's say threads and link them to
certain partitions or topics in Kafka.
So this is the beginning of rapid or fast production and notification.
Basically the first the first step.
We're trying to put like a, also a topic roulette on top of what Kafka has
like for topics, partitions, and try to not override messages in
the same partitions, same topics.
Just try to have a thread and the partition and put the message
on that and try to to connect it to the Kafka consumers.
This is also like a an interesting key idea of this project to have the
partitions and basically topics in Kafka triggering certain instances of
Kafka called the, called consumers.
Achieving a, a parley list, distributing the message the most fast way possible.
Can be.
The yellow box is monitoring and mo monitoring, at least
for the kind of major failures.
So for the air errors, which are put in like the error queue and for
the failed events, health events, just networking, latency, things
like that, we will try to retry.
Two times, third, three times.
We have the end channels where the notification is sent.
SMS mailed Telegram is back up and the dynamo.
So the database, the blue one is where the subscribers are
just present and registered.
And the Kafa producers listen for that data, which I've explained
before, which is provided through through the systems of the institute.
Another interesting thing here will deep dive bit into the architecture of Kafka
integration in the serverless context.
Like I said, Kafka is is here to somehow parallel distribute the messages, the
notifications, and to also trigger parallel pins of plum duck consumers.
So every partition we will have a certain code execution, like a
certain, trigger, let's say certain lambda as a consumer, which will
try to notify a certain subscriber.
We have the threads just using the resources, the topic roulette, build on
top what Kafka or they have or topics, partitions, and try to fill up every.
Partition to be able to trigger from it the Lambda consumers, which have the
logic to just iterate the database of subscribers and notify them in the end.
Yeah.
So here we also have some yellow dots.
Which represent like the scaling possibilities, so scaling in terms
of resources, threads and scaling in terms of partitions topics.
And in the, again, in the Kafka consumers just trying to scale
the triggers based on what Kafka has and also the Lambda consumers,
the Lambda Kafka producers key elements.
We have the multithreading explained in the previous slide using those
resources from the Lambda instances thread lock to not write in the
same topics from separate threads.
Using partition number to not override data in partitions and
using lambda retrial and sqs that letter for major failures.
And the retry is just for latency, networking, minor problems.
Consumer key elements.
Here are some a again, some nice ideas from our perspective.
Custom timestamp blanc lock for mechanism for parallel dynamo update.
So we are trying to here was an interesting dilemma,
let's say, because we.
We saw that
there were multiple notifications for a certain subscriber.
So let's say one, one one, one man and one fellow could
receive multiple notifications.
And the second guy could not receive anything just because the, there
was a, like a. Race between them and the first one won and just
had all the notifications there.
So we wanted to just lock somehow certain subscribers.
If those were notified, basically the code already was applied on them
and they received, or it's like in progress of receiving a notification
and not, trying to notify it again, just moving to the next one.
This is like the lock per subscriber, per item to have for
each individual one one notification.
Optimistic locking in the custom dyna concurrency control is.
Method, just build on top of the first one with the custom timestamp log.
We are just putting like a version and we know if the version changed, if we
need to notify the subscriber or not.
Just let's say another fence of achieving the concurrency and the
fast thing, but also to notify all the subscribers and not miss one of them.
Yeah, using lambda retry que for errors, like for the Lambda
producers and Lambda consumers using a Coldstar worm start methodology.
We are seeing that using a worm start helps because it's the
last phase of the notification.
So started from the producer.
Reaching the Kafka cluster.
And after that we need something that's in a ready state.
Just have it in there, a number of instances and lambdas, and try to have
them in a ready state, just ready to go to a list of subscribers, execute the code
for reach, and send the notifications.
But this is again, an interesting procedure also in auto scaling basically.
So if you have like thousands of subscribers or reaching a certain
number of subscribers, maybe you need certain instances in Worm State,
resting cold start, things like that.
The Lambda Kafka consumer integration with database.
So there is like a custom dynam blocking.
This was the first version, like I've explained a bit in the previous slide.
So from the earthquake topic, the partition we're just trying to
trigger a certain lamb execution.
And from that execution, the code will act on certain items
and using this kind of lock.
Like the timestamp on, on which we are placing the log.
We know that an item is in progress of being updated or it was just updated like
one millisecond ago, something like that.
And the next trigger, so the trigger topic two will go to the next one.
So just going like a domino thing, but from the front.
So view viewing like a domino row from the front and just go going
gr gradually with each trigger.
So with which it, with it with which, which partition just going through each
subscriber and notify each one of them.
This is four, just using the full, full capacity of triggers and not
wasting like multiple triggers.
So multiple code executions on a certain item, although
there is there is like a loop.
So going gradually through the items, we can end up using the
same trigger for the same item.
Yeah, it won't be, notified like twice or three times.
But we will have we have a waste, let's say of time.
So trigger the trigger.
We lacked like twice on the same item, but just updating it once.
Yeah.
So this is like the first version of what we were trying to achieve.
Concurrency of notifications and also notifying the subscribers in
a best way and notifying all the subscribers, the custom dynamo,
segmented subscriber processing.
This was like the second version of the Lambda consumers logic.
So from the previous version, we are trying to segment the list of subscribers
and having like logs per certain segments.
So for example, one trigger from a partition from Kafka will just act on
a certain segment and just lock that segment with a block ID or something
like that from ID one to ID 400.
Something like that.
So just the trigger will act on, on that particular segment.
So in this one the the triggers won't conflict.
There won't be any race conditions.
Let's say each trigger will be allocated to a certain segment
and will just do the work.
So the trigger can be viewed like worker, so the worker will do
the job just for that segment.
So from now we can view that domino ization from the parallel from
the lateral view, we are viewing with workers, like soldiers, just
particular segments of items which are basically subscribers, peoples people,
infrastructure members, things like that.
Just on them the triggers the workers will focus.
This is an improvement in terms of how we are reaching those subscribers through
Kafka, partitions topics and lambda code.
Different kind of lambda code executions, which are done in parallel.
Also moving with this parallel view through the end.
So having the triggers, the workers act in parallel the subscribers.
But although having one worker for a segment like one to 400 or 4,000
is like a lot, even though there will be like, lo, very low level
party threading in that particular segment, you only have one worker.
It's a hard work there.
This is like the second version, so this is like a parallel
distribution and segmentation.
In Lambda Kafka consumer, there are like maxim invocations and locks per segment.
Also like domino style, but we led multiple executions.
So multiple code triggers to just act on let's say particular segments.
So we allow multiple workers to work on the same segment and doing
that low level multithreading.
So letting, I dunno, maybe two.
Like in this example, we let two triggers just act on on on the
same segment so they can resolve or notify and execute the code.
Or faster, let's say we can allocate, like in this example,
maximum workers like five.
Or anything like that, just depending on the segmentation, which is
done like in an automatic way.
Just just having the nu the maximum number of subscribers, maybe dividing
somehow that list and having the numbers of locks Mexican workers.
Things like that.
So what's the important idea is that we are viewing the domino from the
lateral view and we have multiple soldiers located for the same segment.
They will not conflict with each other, there won't be any
risk conditions between them.
They will just act like multiple workers are dividing the work in the same segment.
This is like the improve the.
Improved version of the distribution between the Kafka code execution the
Lambda code execution and the Dynamo where we store the subscribers.
So this is like the level right now.
But we're also trying other things and just improve the speed and everything,
like I've said with it, that granular multithreading apply that segment level.
So it is very blow level I could say.
So we will just keep on researching and see what's the best for for that moment.
Increasing every time with new, new possible ways.
Again, one interesting thing.
So like I've said Kafka was integrating here just to resolve this par
distribution, but instead of Kafka, we could use Kinesis from AWS, which could
work well or a different kind of cues to just be to just be close to the.
To the cloud, not go to to, to the internet, the cluster and just come
back, just be in the same network.
May, maybe it could decrease the latency, but this is for the later
some some things about auto scaling.
Which is very interesting from this project because we don't want to have
manual interventions when it reaches like a base kind of stable state, the project.
We won't need those manual interventions anymore.
This is why we are thinking to put in place a kind of auto scaling,
scaling up or down the system based on the number of subscribers.
So if we have like a range between, I dunno, zero and 201 or 301,
we'll just increase with threads.
So basically resources allocated in Lambda producer instances topics, partitions,
the Kapha clusters, and also the number of triggers and number of Lambda consumers.
Yeah, like the warm state consumers, if we are going down with the subscribers,
we can just lower the resources on the fly again, if we, yeah, it's hard
to predict something to autoscale through the roof before having that
major earthquake, but at least.
There is a possibility to scale it based on this number of subscribers.
For now this is our idea for now, but we can think about some other features,
let's say which could be linked auto scaling, not predicting 'cause.
Predicting is hard.
Hard.
We need something that's realistic.
So this was made through custom scripts like some Python combined
with Azure with A-W-S-C-L-I commands for for Lambdas, for the Kafka
clusters, just on some custom scripts.
And these were placed in a lambda, like an external lambda.
We just, which orchestrates the entire, auto scaling and everything.
This kind of external lamb does act also like for the previous
segmentation, just allocating different kind of maximum workers.
Segmentation ID just on the fly, again, viewing the number
of subscribers, just doing the calculations and having these numbers.
The custom auto scaling using Terraform.
So infrastructure as a code.
So right now we are trying to reduce the custom scripts, so custom shells
and things like that and just move to an infrastructure as a code.
Something that's more used in every project.
Right now, software project is the same kind of logic, just, we're
also provisioning the infrastructure through infrastructure as a code.
It's all the services and everything, and we also use it.
That's the interesting part, to scale or up or down with GitHub actions.
Or we could use, something in AWS for code deployment because we need to check
the number of subscribers and to trigger the Terraform execution with a new list
of what we need, like parameters, like we need two threads, we need four threads.
We need four topics.
So these are the kind of, numbers, which are placed on a certain file,
which is sent to the Terraform just to execute on on that list.
So it's a very interesting approach of how infrastructure is a code act or scaling
up or down or serverless, lambda AWS architecture and, ion system basically.
Yeah.
These were like in the area of development operations, how to operate entire system
when it's like in a stable position after we figure out the services,
what we need and so on and so forth.
This like a separate lane of, how to keep it automatic, let's say also
in for creation of the resources and the services and everything.
Then to have it a certain time step of them, if something happens, we could
revert a certain step, certain snapshot of the resources and the architecture.
And we could also use it for scaling up or down, depending of the.
Number of subscribers in in this phase.
So this is the project.
This is the researches.
These are the researches done until now.
We yeah, we are in touch with some we have some discussions how to.
Proceed further how to do other interesting researches for these kind of
earth failure notifications May maybe to integrate it in what there is on provided
by the governments, not only Romania, who knows to have it like an alternative on
just having some kind of integrations.
Between what's already there or providing new ideas, anything like that.
Yeah, just bringing something to the table, a kind of
innovation we could we could say.
So if you have any questions or remarks or opinions.
Please feel free to contact me on mail, LinkedIn.
Just be aware that the project itself, it's it's registered.
So yeah, if you want to, if you have some inspiration from it or you want to do
something similar, please just contact us or me just in in the first place and
we could align and we could discuss.
Firstly, so like I've said, it's a protected project under some licenses.
Yeah, we also have some open discussions now and there.
So yeah.
Thank you.
And thank you for being here.
It was it was pleasure.
And thanks for the, thanks to the organizers for inviting me to present.
This this research.
Thank you.