Engineering Golang-Powered Notification Systems: Scaling to Millions While Preserving Millisecond Latency

Video size:

Abstract

Discover how to build notification systems that scale to millions of messages per second with Go’s concurrency model. Learn architectural patterns that maintain millisecond latencies while handling extreme loads, and see how technical excellence translates directly to 42% higher user engagement.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. I'm Nikita Kamath and I'm thrilled to be here today to talk about real time notification systems, a critical component in today's digital world. These systems are responsible for delivering adults messages or updates to users instantly. Whether it's a push notification, a SMS or an email, they play a huge role in user engagement and retention. When done right, they feel seamless and intuitive. When they fail, users will notice immediately. Today we'll explore how to architect these systems to scale globally, stay reliable under pressure, and deliver messages that matter at the right time. So let's talk about the first thing that's most important in these large scale, real time notification system and that scale. Or rather, let's call it the scale challenge, because before we optimize, we have to survive. Platforms like Facebook handle about 8 billion notifications every day. That's nearly a hundred thousand notifications per second. And these systems are expected to deliver with latencies of less than 75 milliseconds. Do you know that's less than a, how much a human can blink? On top of that, usage patterns can be unpredictable. Think major news events, celebrity live streams, or e-commerce, flash sales. These triggers, traffic spikes like, like 300% over the baseline. Architectures need to handle these, this level of scale gracefully, which means being both horizontally scalable and resilient in real time. So let's talk about the first thing that's, architecture, which drives these real time notification systems. 200 such scale and complexity. We move away from monoliths and adopt a microservice architecture. Each service is small, focused, and independently deployable. One service handles user preferences. The other manages device tokens, a third one probably queues messages, and a fourth one actually sends them. This lets teams deploy and scale just the services under stress, like delivering service, like delivery services during a big campaign without impacting the rest of the system hidden. Savings there on cogs, you only scale where you need to and not the whole system. Another huge benefit is technology diversity. You might use no JS for low ancy delivery services, Python for analytics, and go for performance critical components. This modularity also further announces fault tolerance if one service fails. It doesn't just bring down the whole system, it's limited, and just that feature, that part functionality is what is impacted. Moving on, the next thing we would talk about is event driven architecture, often tightly coupled with services. Microsoft has shine actually when paired with event driven architecture. In this model, services emit events instead of calling each other. They're asynchronous, meaning producers no longer wait for consumers. For example, when a user likes a photo, say on Facebook or Instagram, the app for emit an event, the notification system in the background listens to this event and triggers the appropriate notification. All where the user is totally believes and the user's experience as such is absolutely not impacted or affected. This pattern causes natural deconflict, so services can evolve independently. It also introduces buffer zones that absorb traffic spikes without passing on back pressure to the whole system. It's resilient, scalable patterns that can. 10,000 to a hundred thousand messages per second with sub millisecond internal latency. Isn't that fascinating? What can event driven architecture not really function without, for folks who, you guessed it. You guessed it, right? A message broker. Message broker is basically a glue that holds the whole system together. Two of the most popular options that we have on the market are Kafka and Rabbit mq, and they serve different purposes. Kafka is ideal for high throughput streaming, capable of handling millions of events per second. It's optimized for bulk data fault tolerant and built to scale. On the other hand, RabbitMQ is more feature rich in terms of routing. It supports conditional delivery. Topic based routing and priority queues. This is perfect when you need to deliver different types of messages to different devices or user segments. Many real time systems use both Kafka for ingest events and rapid MQ for routing and personalization issue. We discussed earlier that two distributed systems drive on two specific Cornerstone. One being scale, which we just discussed, and the second being resiliency. So in the next slide, we are gonna be talking about resiliency patterns. Let's take a closer look at like resiliency, because in large scale distributed systems failure is not just an, it's imminent, it's a certainty that's gonna happen someday or the other. This system consists of dozens or even hundreds of services. Communicating over the network. This means a failure can cascade fast if not handed properly. A small issue like a slow red, instant can ca, can literally snowball into a full outage. There are different ways how we can literally stay resilient. Let's start off with the first one on the slide that's circuit breakers. A circuit breaker has nothing but a fuse in your electrical system. If a downtime service starts failing, say 50% of the requests of a push certificate provider are timing out the circuit breaker, the circuit opens and stops sending traffic temporarily. This prevents your system from wasting resources on a failed component and gives the troubled service some time to recover after a. The ser, your service will again test the waters and resume only if the downstream service is healthy back again. Next, we are gonna talk about our Balkans. let's think of a ship here. It has com compartments or Balkans if one floods. We just hope and play that the ship doesn't go down right or it doesn't sink. Similarly, in the concept or in the world of software, it primarily means isolating components, so failures can be contained. For example, if an SMS delivery fails due to a third party provider, it shouldn't affect push notifications. Each channel operates in its own resource pool, separate threads, separate queues. Sometimes even separate services, so when there is isolation, you at least your whole system is not failing. Catastrophically. Let's talk about time management here. A slow service can be more dangerous than a failed one. It consumes threads, memory, and blocks the upstream components. That's why we use strict timeouts with exponential back off if a downstream calls takes more than. Say your timeout is set to 200 milliseconds. We just abandon it. We don't wait endlessly, and with exponential back off, each retry is spaced out to reduce load and increase the chance of recovery for the downstream system. That brings us, we spoke about exponential back off, but wait, how, where does it come into a picture? That brings us to the next topic, which is retry policy. Not all failures are permanent transient failures like a network hiccup can often be resolved by retrying, but we must be smart. Use exponential backup to avoid hammering the downstream service repeatedly. Add jitters or randomness to spread out retrials over a period of time and avoid synchronized spikes a problem also known as the thundering herd. These resiliency patterns together reduce the blast radius of a failure and let the system degrade gracefully instead of catastrophically. Is it a great thing? No, but at least it's more contained and handled better than the whole system going down in like a J. Coming to our next topic, sharding and partitioning. Sharding is how we scale horizontally and keep the performance consistent. Even as we add users, region, and functionality. there is not a one size fits all solution here. So let's take a moment and explore the different sharding strategies. Starting off with user based, shared by user ID to ensure ordering and consistency. If all notifications for a user are handed by the same partition, we can guarantee that they arrive in order. This is a critical experience, especially for messaging, security alerts or sequential delivery where ordering matters. The next thing is the geographic charter. Be by user ID to ensure ordering and consistency. If all notifications for a user are the partition. We actually ensure that they arrive in order. Now, what is geographic sharding come into picture here? The closer your data and the services are to the user, the faster you can deliver sharding. We route users in Europe to a EU based cluster and us to a US based cluster. This reduces round trip time, especially important for push notifications and mobile devices where latency is visible to users. So with the combination of where we need user, where order matters, and we also latency matters too. It's a combination where geographic and user based sharding comes into picture. The next thing is time-based charting. This separates hot data, like recent notifications from cold data archive messages. It improves performance and simplifies data lifecycle management. For instance, messages from like today are placed in fast access partition, while older messages are placed in cheaper, slower storage. What is functional charting? We often separate different types of notification. A high priority alert, like a suspicious login, doesn't go through the same pipeline as a promotional banner. This allows for tailored delivery strategies, failure, isolation, and even different compliance requirements prototype. By combining these strategies like we just discussed a little before, we can scale to millions of users while keeping latencies under 15 milliseconds. That brings us to our next topic, which is adaptive rate. Limiting rate limiting is traffic control for your services. It ensures that no component is over and that critical traffic gets through first. But in modern systems. Static limits just don't work. The system needs to adapt to these limits in real time, monitor, load, system load, and calculate these capacities dynamically. So let's go through this phase, like a brainstorming session, right? What do we need to have adapting rate limiting sessions? Firstly, we need to know and monitor the system load. That's where the observability of our services comes into picture. We continuously track these health metrics, whether it's the CPU, our memory, our request, latencies Q Depth, and also the downstream success rates. This gives us a live pulse on how the system is performing, or what we call in software like cost, quality of service. This enables us to calculate the capacity dynamically. Based on current and historical data, we estimate safe throughputs using predictive models. If a service has been degrading around 70% of CPU U historically, we set that as our soft ceiling. We also monitor bus patterns. For example, say that every Friday at 6:00 PM we see a spike or a peak. Then we can adjust our models proactively. This like this kind of helps us adjust and we have the data, we have our capacities, and now we can actually talk about what we do or how we adjust rate limits when loads spike. One of the ways we can do this is adjust in ingress rates. How many new events can our service except per second, we can modulate or titrate this data. Second, we can also work on our prioritization. Give more preference to urgent traffic. For example, two FA Codes, alerts, throttling. Before or further the matter even drop low priority events at a given time, like newspaper, less letter things. If your service is not healthy, then you actually prioritize, and in that low prioritize, higher priority notifications in the load times so that your system is actually doing what it essentially needs do, deliver what's highest priority at that. This ensure survival system survival and graceful degradation again, and not just an abroad failure where, you might end up delivering a new newsletter notification and not really, delivering a notification from the bank, which is of, speech log. So that's where stuff like prioritization and throttling help comes handy. So now we have gone through a whole cycle of how we have adjusted our rate limits, but what's the most important thing after all, this is feedback loop. We implement a closed loop feedback system that refines our rate limiting decisions continuously based on outcome metrics. what were our vitri rates, what were the latencies? Or one of the event drop counts. This reduces overload incidents by up to 85% and keeps the system utilization at peak efficiency. We have gone through a gamut of different architectures, strategies, techniques, which brings us to our next highly important topic, which is caching. So in a distributor system. We talk about ma multilevel caching strategy. Caching is what allows our system to meet performance goals of sub millisecond latencies or deliveries. In this case, even when we are handling billions of events, but. One cache is not enough when you're talking about the scale of billions of notifications or message deliveries, right? So that's where multi-level caching hierarchy comes into picture. So first we start off with L one, which is application level memory cache. This is like an inpro memory stored directly on your service runtime. More, you might have heard of terms like in memory objects or LRU caches. It provides sub microsecond access time, perfect for extra extremely hard data. Things like notification templates, flags, or user personalized reference toggles. But it's limited. It's a limited till you don't have unlimited storage on that and. It's local to your service runtime and cannot be stored across, shared across instances. That's where our L two cache or distributed cache comes into picture. Here we could use technologies like reddish or memc. This cache is shared across nodes and supports. Fast read writes in say, one to five milliseconds. We store data like device tokens. Rate limiting metadata or frequently access personalization info. We also set that TTL broadly known as time to live per, and it's done per key to balance, freshness and efficiency. TTLs can be dynamic, for example, shorter for volatile data and longer for templates, things that don't train so often. Moving on from L one and L two, we move to L three, which is more persistent storage. That's our database. You could use something like a Postgre, SQL, cosmos, DB Cassandra, and multiple other options on the market. Access to these systems are generally taking between 10 to a hundred milliseconds, so we use them only as a fallback. Only when the data missing is missing in the L one and L two cache, that's the upper layers. The database holds the source of truth, but our caching ensures that it. It's, it only serves a fraction of the traffic. Remember that our latency and efficiency numbers will only be met when most of the traffic is served from L one and L two Caius. With this approach, we offload 95% of the reads from the db, dramatically improving system performance and scale it. We've spoken about the first part of efficiency. How scale and how motivation systems can actually handle those facets. Now the second thing that we spoke about, especially pertaining to notification systems, is user engagement. Or how do we actually optimize user engagement or retention tube? So it's not just about sending notifications, right? It's about sending the right message at the right time. So what's important here is timing. ML models can analyze user behavior and identify windows of high engagement. For example, sending messages for lunch apps during, like for food apps during lunchtime, nothing like it. Second thing is personalization. We tailor the content based on user interest, location, and recent actions. This increases click through rates in instantly and dramatically. For example, you might have noticed that sometimes if you have a NTO gift card in your wallet and you are around a n some store, you see a pop popup from your Apple wallet. That's because you ubiquitously checks that you are in the area and goes ahead and pops that user notification for you. Another thing to keep in mind is frequency management. We dynamically adjust how often we notify each user. If someone isn't engaging, we slow down to avoid fatigue. Brings us to our next topic, which is cross channel coordination. If a push call unseen for a long time, we do follow up with an email, making sure that we don't keep bombarding the user, of course. So having discussed all these topics. What are our key takeaways from this presentation? Let's summarize the key ideas here. Starting off with scale. Again. Designed for extreme scale, built for five to 10 times of the expected peak using distributed microservices, built and resilience. Use circuit breakers, retries, and isolation to protect against catastrophic failures. Optimize for latency. Use caching and sharding to keep delivery times under 15 milliseconds. Lastly, maximize engagement. Obtain delivery by timing, content frequency, and channel realtime notification systems aren't just technical challenges. They're critical to a user experience and business success. Thank you. For. So thank you so much for all your time. I hope you had a practical look into how we build and operate large scale, real time notification systems. If you are working on something similar or just wanna geek out on distributed systems, I'd love to connect. Please feel free to reach out to me after the session or online. Thanks again and.

Slides

Download slides (PDF)

See all 65 talks at this event!

Conf42 Golang 2025 - Online

April 03 2025 - premiere 5PM GMT

Engineering Golang-Powered Notification Systems: Scaling to Millions While Preserving Millisecond Latency

Video size:

Abstract

Summary

Transcript

Slides

Ankita Kamat

Senior Software Engineer @ Microsoft

Join the community!

Featured event

2025

2024

Info

Conf42 Golang 2025 - Online

April 03 2025 - premiere 5PM GMT

Engineering Golang-Powered Notification Systems: Scaling to Millions While Preserving Millisecond Latency

Video size:

Abstract

Summary

Transcript

Slides

Ankita Kamat

Senior Software Engineer @ Microsoft

Join the community!