Intelligent Notification Systems at Scale: Combining Machine Learning with High-Performance Architecture

Video size:

Abstract

Discover how leading platforms process millions of notifications per second while using ML to deliver personalized experiences. Learn architectural patterns and AI strategies that transform basic alerts into engagement drivers with millisecond latency at massive scale.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everybody. I'm Anta Kamath. I'm a senior software engineer at Microsoft and have spent the last several years designing and building high scale distributed systems, especially in the notification and messaging domain. I'm thrilled to talk to you today about combining machine learning with high performance architecture to delivers of notifications in real time efficiently and intelligently. So let's bring of what is our notification landscape today, and let's talk a little bit about it. It's basically, we are talking about billions of notifications daily here, and these are delivered across apps, browsers, iot, and variables. Users expect something like a one millisecond response latency, even during peak load think Black Friday, or a large scale breaking news era. Systems must operate with fine lines of availability across regions. To support this, we must decouple services, ensure low latency paths using memory-based message buses, and handle failovers without state loss. Let's put our problem landscape to be like speed, scale, and precision at this point. Briefly touching up on what are the m service architecture benefits here? We build notification systems as a polyguard Microsoft AR m service architecture broken into components like event receiver, for example, a Kafka consumer or HTTP trigger message router and forter channel adapters, which could be a push notification email or SMS. Think of it as triggers or adapters and analytics and feedback pipelines. Key architectural choices have known to be using Kubernetes for orchestration with horizontal port autoscaling based on message queue lens at any point using GRPC or product of APIs. For service to service communication, basically primarily due to compact payloads and cannery rollouts. Using something like agile DevOps to ensure zero downtime deployments, each component scales independently in this case, and stateless is the key for all of us. Our systems are basically fully event driven and designed around cqrs. It's primarily known to be command query response, segregation, and event sourcing. Enrichment layer adds personalization by pulling user preferences, location and history from Redis backed graph and makes it available. At runtime or at near runtime scenarios. Smart delivery involves using ML models to rank the likelihood of engagement, for example, at a given point, what is the user being more responsible to, whether it's a push notification or an email or it's visits. An SMS And our ML models basically help us rank these in the a certain order priority based on previous data. Selecting optimal delivery windows. We'll talk about the optimal windows in a bit coming soon in the late in the presentation. Feedback loops are being captured for every event given process. What is the scale of opens, clicks, dismissals, uninstall. We could store all this data that we get all the plethora of data in something like Azure Data Explorer, A big query. And this can be further used to train reinforcement learning agents and. It comes back into the system as feedback loops and can be helped to train the system holistically. A brief comparison of different message brokers, which are being used in the different notification systems today. Let's talk about Kafka first. Primary use cases to ingest high volume events and it's best for durability and uses compression to reduce the throughput costs. So it's pretty much well known for that. The second option that we. Discuss here as RabbitMQ. It's known for complex routing and acknowledgement systems, and it has a strong delivery guarantees with dead letter queues. Also ready streams known for ephemeral messages and low latency and great for send in for forget pushes. It's like you don't really care what the response is. It's more like it's a fire and forget scenario, and you just let it go and let the system takes its own course. One lesson known one is ser, which is known primarily as a use case for geo application and backlog handling. And it's built in on tiered storage and partitioning often combined scenarios or Kafka for ingestion and ready streams for real time delivery so we can make the best known use of both of both of the systems. Next we will touch upon system resilience strategies. We use multiple layers of resiliency in this point, and that's like primary one is self-healing. Readiness and liveliness probes are a. Good for restarting pods alert fatigue radio use with LY based alerting is like Grafana or AI model combined can help us alert and figure out when our system has cer certain spikes or certain tips and can basically help without any human intervention to take quicker action and self-healing of the entire system. Circuit breakers implemented via poly or high stakes like wrappers, defines timeouts, retries, and fallback handlers. Another one coming to it as redundancy. It's basically active deployments using geo distributed clusters. We could use something like Azure or AWS here. S testing. We run periodic chaos tests using tools like Gremlin to simulate broker outages or CP starvation, and that basically helps us making our system more resilient. It's basically, it's not optional at scale. Your system has to be reliable and it's not like an option for us at this point. You. It. Each of these strategies could be used individually or in combination of each other to help make the system more reliable as a whole, coming to one of the primary topics of this, of this talk. And that's what, like how is machine learning integration? It's makes it more intelligent and advanced in terms of delivery, and that goes beyond the basic delivery of notifications. So let's focus on the part where the system shifts from being just an infrastructure layer to becoming a truly intelligent layer. And this is where machine learning inter integration transforms the game. Most traditional systems are rule-based targeting, send message X to Y users after Z Minutes, right? It works until it doesn't. Rules break at scale. And this accounts for user behavior, preference and context. So this is where ML steps in and learning from past behaviors, predicting future preferences, and delivering real time personalized engagement. So let's start by talking about smart filtering and which is more like classifying a value, right? What's our primary problem here? And that's not every message is meaningful. Users are constantly bombarded. How do we filter what to send? What is our solution? Use binary classification models. For example, XG Boost or light, G-P-M-G-B-M trained on engagement levels. Open, clicked, dismissed, or ignored. Input features could include something like message type. Is it a promo? Is it an alert? Is it a transactional message? What's the context of the user? Is it a device os. Region, historical responsiveness and time of day and channel. Basically, does the user historically respond on the phone during evenings and more on the say your system or a desktop during? That's a channel which allows for the delivery of the notification during daytime, and all this is captured in data and are very important input features to your machine learning models. What's your output? A relevant score per message. Per user pair, right? So only messages above, above a certain threshold make it to delivery. And that's one of the brief explanations of what smart filtering could possibly be. This basically helps to reduce low. Low I, low importance message noise low value noise, and helps reduce the notification fatigue for the user. The second one where machine learning is instrumental is targeted content and its personalization at scale. Problem. Let's talk about briefly what the problem is here, and it's like generic messages don't convert, but user content which caters to a specific user, like handcrafted content is really impossible to. Make it for every user. So how do we make targeted content, right? So we try to use something which is like energy, just natural language generation models. It's basically either template based or fine tuned Transformers. Examples could be TFI or distill GPT. Let's take an example of an e-commerce platform here. You left this in your card. Have you seen that recently coming to you? Pretty often, yeah. Still thinking about Nike Air. They're almost gone. What this does is, one, it reminds you of you left something and you moved on. You moved on for any reason, but it basically adds a personalization tone that it look it things about what you left in the cart and are you still thinking about that? And the almost gone part tells you that there is an urgency. This deal is going to be gone in the next three hours, right? Is it really gonna be gone? We don't know, but we are trying to personalize the content for you. The notification seems to be more targeted and that basically helps you add. Add that targeted audience to the notification. Personalization input could be purchase history, favorite categories, recent session behavior and social proof triggers. For example, what's its ubiquitous, what's trending near you? Where do you live, and what's the local sales going on? What's the market like in that region? And these are basically social proof triggers. So this is known to engage or boost engagement by 20 to 50% in production environments. Oh, the next one where machine learning comes into picture as continuous impro improvement or feedback loops. User behaviors keep on evolving and one time model training leads to stale productions. What is current today mean will already be stale tomorrow. And users and trust as well as, the market is evolving that fast, that you really cannot have stale notification data. So basically we can implement in the scenario is what it's called, a closed loop learning system, and every user interaction, whether it's a click, ignore, mute, dismiss, is logged and used as labeled data. Daily batch jobs run, which retrain the model and update the weights or the scores on of each of the feature values. New models are deployed using shadow mode, for example, AB test versus current production. A scenario to avoid any regression for on already learned data. Some of the well-known tooling, which is used is ML Flow or Azure ML pipelines for experiment tracking and deployment feature stores like Feast to manage real time and historical data consistency. Then we next move on to multi object optimization, which involves timing, channel, and volume beyond just content. We use the ML pipeline to optimize when to send. Using time series predictions with using something like maybe profit or where to send, right? Like we spoke about channels briefly before as is the push notification more ac more accurate at the given point, or is it an SMS or an email? We have different channels which are available at our disposal. And how often do sing, for example. If you see that a user is continuously dismissing a particular notification, then we have to ensure that the user gets that notification at a much lower frequency than what is currently available, and that is something which is widely known as adaptive throttling. In some systems, we try to combine these multi-arm banded frameworks. Each arm is a content timing channel combination, and we optimize basically for accumulative reward, which is user engagement in our case, while we try to increase the weights and scores and determine and explore new strategies. This approach ensures responsiveness and adaptability in the fast changing world. We also start using models like Onyx and low latency environments or via like tfs, running our cloud interface inferences for determining the different models which serve and what is more ac accurate based on our adaptive l continuously learning models. We use feature gating systems basically to roll, make rollout safe and ensure no doubt, like the learning models or incon incorrect label data does not cause our models or outputs To fail always, we need to maintain something which is called as explainability, especially in the regulated industries. Whether it's Lyme can explain health model decisions and precision is very important from our learning experiments. In short, ML doesn't just make notifications part, it makes them personal, context aware and self-improving. It's the difference between an ignored ping and a message that genuinely delights your user. Moving on to adaptive rate, limiting traditional rate limiting uses, fixed thresholds, and it's more like. We need to go adaptive at this point and analyze the user data, time of day, location, historical response. Windows predict the likelihood of a user, basically responding to a category, whether it's a promo system alert or, and make it more personalized on a weekend. Maybe a user is more, aligned towards answering a promo than a working day in the morning hours. But this is what data tells us. This is what our learning models continuously tell us. Throttle using priority cues and leaky bucket or in, in combination of ML predictor logic rate limits aren't static and they adapt to per user over time. This definitely reduces churn and notification fatigue for the user. We have known pretty widely in even our day-to-day lives. That timing is everything, and that's what brings us to predictive delivery, timing. One of the most overlooked aspects of notifications is when they are delivered. We intend to focus so much on the content, on targeting that we. Generally forget even the most relevant message when it ends up on the user's channel or device. It just gets ignored. And this is where timing isn't just delivery optimization, it's the engagement multiplier for us. Let me walk you through how we analyze, predict, and optimize delivery, basically based on their data and machine learning combined. Let's start with something which is known as let's call it step one, and that's known as behavioral pattern mining. We start by ingesting here, historical engagement logs. Open times, click times, mute, unsubscribe events. And each of this is stacked with timestamp, user id, device notification, category and channel. Using time series clustering algorithms like K means with dynamic time wrapping, warping, sorry. Dynamic time warping DTW or DB scan. We identify user cohorts with similar notification engagement rhythms. We often find patterns like office workers engage at lunch, which is like 12 to 1:00 PM. Students respond post classes after six to 9:00 PM is the primary window, and parents engage late at night, say 9:00 PM to 11:00 PM. These temporal fingerprints form the foundation of personalized scheduling. The next thing we move on to is the predictive modeling of what we have captured. So far. We have grouped our user cohorts and we have, we can now have the ability to train time aware ML models to forecast the open probability model choice could be something like gradient boosting or temporal neural networks like lstm. What are the primary input features for us? Local time of day of week, users, device status. For example, screen time, app usage, and recent engagement trends. The model outputs a heat map of likelihood scores for engagement per hour in the next 24 to 48 hours. Which brings us to our, now we have spoken about predictions and we know what data is there. It brings us to our next important step, and which is intelligence scheduling instead of a fixed delivery window, our notification service towards the next best delivery time in the metadata for each message uses a lightweight scheduler. It just could be a background worker or a crown job to NQ messages into red streams or Kafka. At the right moment and supports a snooze fallback if the user hasn't interacted for say X hours, retry the second best predicted time. This allows us to delay non-urgent messages, prioritize system critical alerts in real time, and our clustering of messages around high churn hours. Now. We have done the tasks of delivering the actual notification the right time, but how do we really know that it worked? And that's where our la, our last step of impact measurement comes into picture. So we compared productive versus random delivery windows. And one thing we have realized that open rates have improved by almost a. To 35% engagement rose by 45%, and general uninstall and mute rates dropped by a whooping 18%. And this wasn't a one time experiment as we saw the metrics were. Gathered over several weeks with different user segments, and we've seen this consistently in different experiments over the industry. We can really track performance using real time dashboards, Grafana or influx tv daily reports, slides by time buckets and user demographics and longitudinal studies to verify sustainable sustained impact. What are the key considerations for us at this time? All scheduled messages are item potent and stored with time to leverage TTLs to prevent outdated delivery. Predictive timing works best when combined with adaptive rate limiting and. We need to be aware be aware of notification storm. Use a distributed scheduler jitter to prevent spikes. You don't wanna deliver all the notifications to a user at the same time, and it could be through independent scenarios. So we need to be really careful about that. To add jitters to suddenly increase a notification fatigue for the user. So to sum up, great content needs great timing. Predictive delivery ensures your message lands, not just where it matters. But when it matters most, it turns your system from reactive to proactive, and that's a game changer, which is known in this industry. Moving on to multi-level caching strategies and notification system. One thing that you should think when you think of caching and that speed our caching strategy is generally like in memory. Something like Redis, it's stored session and personalization data and available for quick access. For example, less than a millisecond. The second one is more like a distributed cache. You could use something like age or Redis with clustering, and this can be used for throttling counters or template data, predictive fetching, and these are the ML model, preloads most likely templates and images before usage. And the last one where we. Speak about is the persistent blob storage for cold and archival storage, and it's used for audit trails and retries kind of stuff. We benchmark the cash hit ratio after every release targeting 90% or greater than 90% hit rates for hot paths. And those are very, the hot parts are primary important factor here. Let's briefly talk about a case study based on a e-commerce platform. And this is let's talk about it as a real success story. Client basically had, say, a 68 person cart abandonment. And the previous solution used batch based fixed schedule emails. Open rates was like 4%. See, every night at 10:00 PM you get an email about what you left in your cart. Do you really care about it? Mostly people won't. What's our solution is like real time eventing for cart abandonment. We just spoke about that example where you get, Hey, are you still interested in this? This is gonna be gone soon. And that's real time eventing ML model predicted intent. To purchase within one to three hours and messages had personalized spec subject lines and discount triggers based on past purchase history. You bought this previously. Are you still interested in this? You bought a mattress, you still want a mattress cover, right? It really. Triggers you and you feel like you need that stuff when you don't know how much you really need it at the given point, but the message still ends up as a notification for you. What's the impact that we saw? A 42% increase in recovery, 28% fewer messages sent and 37% higher click through rates and section second reaction rates isn't Those numbers sound really promising. Now let's talk about an implementation roadmap or our playbook for building something similar. One is the Microservice Foundation. Break your system into core services, ingestion, formatting, delivery, and analytics. Use scalable brokers and content based deployments. Have an intelligent layer. Start with engagement prediction models, add adaptive rate limiting, and dynamic channel ranking over it. Lastly, introduced natural language graph for content personalization. And one primary important thing that we talk about in machine learning systems is continuous optimization. Use AB tests to evaluate every model change. Build automated retraining pipelines using airflow ML flow and Azure ML pipelines. Track long-term LTVs as lifetime value and not just CDRs, and this is a very good recipe or just a high level playbook for you to implement an notification system which integrates with basically all the important parameters or important gotchas of ML integration. Thank you so much for your time and building scalable, intelligent notification system as a marriage of real time infrastructure and ML intelligence is challenging, but incredibly rewarding, and especially when users experience your craft. I'd love to hear from you and what challenges you're facing with engagement or delivery systems. You can find me on LinkedIn or just let me know whatever questions you have regarding dread. So thank you so much for your time.

Slides

Download slides (PDF)

See all 137 talks at this event!

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Intelligent Notification Systems at Scale: Combining Machine Learning with High-Performance Architecture

Video size:

Abstract

Summary

Transcript

Slides

Ankita Kamat

Senior Software Engineer @ Microsoft

Join the community!

Featured event

2026

2025

Info

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Intelligent Notification Systems at Scale: Combining Machine Learning with High-Performance Architecture

Video size:

Abstract

Summary

Transcript

Slides

Ankita Kamat

Senior Software Engineer @ Microsoft

Join the community!