AI-Driven Rate Limiting for Scalable, Secure, and Cost-Efficient APIs

Video size:

Abstract

Discover how AI can revolutionize API protection! Learn to build intelligent, adaptive rate-limiting systems that cut false positives, block attacks with 96.8% accuracy, and reduce cloud costs by 27%. A must-see session for security and DevOps teams building scalable, resilient APIs.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everybody. A warm welcome to you all to this conference 42. JavaScript 2025. It's my real pleasure to be here with you today. I'm here to talk about using AI to have a smart rate limited system, which is not as scalable, and it can be cost efficient as well. It is going to be cost efficient as well. Let me ask you a question. How many of you were here when building an API real reach out for rate limiting as your first line of defense? All of us, right? Almost all of us. It's a fundamental practice. We use it to protect our systems from abuse, ensure stability, manage cost, and prevent any denial of service attacks as anything that is important, attracts thieves. As we also know that as APIs become more important, they attract users with malicious intent to either steal the data or cut down the incoming business. But here's the critical challenge we face now, is that attackers have become more smart and the threats have become more, have evolved and become more sophisticated. But a primary defense mechanism has not. We are relying on a traditional starting rate limited system in a dynamic, intelligent world of attacks. This isn't a failure of concept, but rather missing evolution In par with the threats. Today we are going to explore why these traditional methods are no longer sufficient, how they are actively costing companies millions and still damaging user experience. All while creating a false sense of security. Most importantly, we are going to discuss a new path forward. The one that is meant to form a essential tool rate limiting is an with artificial intelligence to build APIs that are not just secure, but also remarkably scalable and crucially cost efficient. Okay. With that said, let me move on to the next slide. Let's begin with the fundamental truth of the current situation, right? The API forms the heart of the modern web world, the, they power and touch multiple aspects of our life, like iot devices, whether we use our mobile phones and applications shopping and much more. But this is critical infrastructure, but this critical infrastructure. Mostly is protected by static rate limiting a defense strategy that in terms of age, it might look like an ancient fossil when compared to current sophisticated cyber attacks. Actually, the idea of static rate limiting is by setting a fixed threshold, like 500 requests per minute per user does not work or prevent attacks anymore because. Let's be honest, people have become smart and so have those cyber attackers. With that said, one of on the one hand, it is set to limit too low to protect against attacks. Okay, let me talk about this. In one hand, if you set it to too low the to protect against attacks, you end up blocking your best customers. Imagine a scenario like where loyal users flee face a or during a flash sale face. Issues with being able to access the. Portal that they are trying, because guess what? They got the error. 4 29. Too many re requests error. The data shows that these instances aren't rail on an average 41% of estimate traffic gets blocked by overly aggressive static rate limiting rules. But on the other hand, if you set the limit too high to avoid blocking users, you might be actually opening the flood gates to abuse and misuse the infrastructure costs might spiral out of budget and become more inefficient, and you are left with. Left to be with the system, which is vulnerable to very attacks. You were meant to stop to begin with. This isn't just an inconvenience. This is the direct hit to the bottom line, leading to millions in cost, revenue, and wasted cloud spent opening path for sophisticated hackers to get in. The core of the problem is that static thresholds are blind. They are not intelligent, are flexible enough to adapt to the changing scenarios, or rather the changing context. Let's diagnose the illness in the current context the static rate limiting. Why is this 30-year-old paradigm failing us in 2025? It boils down to three critical flaws. First, the rigid thresholds. A fixed limit cannot distinguish between the good attack traffic spike or actual attack and a good traffic spike and a successful product launch going viral on Hacker News looks identical to DDoS attacks from static rate, limiter point of view, reasons being both generate a massive. Search in traffic. One in your dream. One is the dream scenario and other one is your nightmare. Second com. A complete lack of context awareness. Static rules ignore everything that makes a user who they are. They don't care if the user has authenticated successfully from the past two years being able to connect it using their office network. They don't care if the traffic is flowing. A logical sequence of APIs like getting the products, viewing the products, and adding to the cart. Under a static, non-intelligent system, this legitimate users looks like a scripted adapter from a data center in a foreign country hammering your login endpoint. And third, the result of this flaws which are, which is nothing but a sky high operational cost. Because companies can't trust relate limiters to peace smartt. They are forced to over provision infrastructure. They pay for enough servers, database capacity, and cloud resources to handle worse DOS scenario, even though 99% of the times those resources. Sit idle burning money. Even if the scaling is dynamic on the backend, it is very likely the system scales to cater to a coordinated attack, not only responding to attack, but also adding to the computational cost of the system. And these are the three traditional rate limiting fails Now. I want to zoom in into most real, potent, modern day threat that exploits these weaknesses, which is distributed denial of attack or rather DDoS. When many of us think DDoS, we imagine a more massive high volume nurse attack, which is like a tidal wave of traffic that crashes cause failures on the servers. But this is, this landscape has changed on world today, the most insidious and common attacks are low and slow application layered DDoS attacks. These attacks, don't try to bring down your front door with raw power as the name suggests. Denial of that is denial of service. Instead, they pick the lock of the front door. They target your API endpoints directly like your login, your search, or your checkout. And that is the most expensive past part of your application to run. And here's the most genius and most terrifying part of these attacks. They are designed to be stealth steal. It is not like a lock picking in, it's like a lock picking in invisible thief. A botnet of thousands of compromised devices will send few requests per minute. Staying carefully just below your static rate limit, so as not to get caught individually. Each IP address looks like slightly active, but less make user, which is not bombarding the setup. But collectively, they consume all of your database connections, exhaust your server CPU, and can rake up massive cloud compute bills all while your basically rate limiter gives them a green light as they look legit. This is the ultimate demonstration of why counting requests is no longer enough to prevent DDoS attack. We need to understand their behavior and come up with a solution that handles such sneaky attacks. Let's talk about the answer for this insidious situation, right? We need to replace a static blindfolded rate limiter with a dynamic intelligent one. We need to move from simply counting rate limiter to the one which truly understands the situation and takes decisions dynamically. So as to who should be allowed access to the APIs, this is where we introduce the AI powered framework instead of single number, our system analyzes a rich mashup of 27 different. Behavioral features in real time. It just doesn't look at the number of requests. It tries to understand what is the pattern of this request? Who is making them? What is their intent? What is the sequence of requests from each single user trying to do? By understanding these patterns, the smart setup can dynamically adapt. It can confidently allow a surge of LE legitimate users during a marketing campaign while identifying and throttling a sophisticated DDoS. Attacks happening parallelly. It provides robust security without sacrificing the user experience and the business's bottom line. All while keeping reputation, cost, and data security intact and in and checking and keeping the attackers in check. Now, since. Since the smart system does so much effectively and tactfully, you might be thinking, oh, this smart system sounds complex, but the smarty process can be broken down into clean four step cycle, which is collect, engineer, train, and deploy. Now, let me talk about collecting the right data. It all starts with the data. The principle of garbage in garbage, how out has never proven to be more true. We are not just collecting logs, we are gathering the digital DNA markers of each API interaction. We instrument our API gateways and load balancer to capture a rich tapestry of over 14 criteria data points data, DNS markers. Which we group into several key categories, in the category of request metadata and patterns, which is one of the most important ones we capture the velocity and the rhythm of requests, for example, is we, let's talk about frequency and burst patterns, which actually tries to see if the traffic study. Stream or is actually a steady stream or a burst of violent machine gun, like bus and human browsing website has natural pauses, whereas a script does not. Another example of request metadata and patterns category is timing and inter request delays. We measure the milliseconds between calls. Real users have variable delays between the requests. However, the automated attacks often operate with metronomic time differences, which are humanly impossible. Another category that we measure is sequential data and behavioral intent. This is where we start to understand the intent of the API call being done One. Parameter we use to understand this is endpoint access sequences. What we do here is track the journey and not just a single request. A SML users might follow a logical path of finding the products, selecting the products, and actually then viewing, continue to viewing the same and adding it to the cart. An attacker who actually does not have intent of ordering and buying products might just be hammer a single. Endpoint like login or pro random endpoints like search or export in a quick suce succession. This sequence of API calls and even the missing API calls, tells a story about the intent. Another parameter that we use in this category is action outcomes. In this, we just don't look at the requests, but also look at what happened after the request. A series of HTTP. Four. Oh 4 0 4. Not found errors. Might indicate a scanner, A rapid sequence of http 4 0 1 unauthorized, followed by a single 200. Okay. Could be a credential stuffing attack in progress. Let's look at another category here, which is authentication and session context. In this. This layer adds identity to the behavior. For example, talking about login success and failure rates, a single, a couple of failures during log user login is actually very normal, 20 failures in a minute from the same IP or IP subnet, even if the API calls are below the static rate limit. Is a different story altogether and is a huge red flag. Another example to talk about in this category is session token usage, or few questions to check around session R. How is the session being used? Is it a newly created token immediately making high value transactions? Basically, if a newly created token that is just generated. Is suddenly very active is a token from one geographic location suddenly being used from another an hour later. These form critical trust signals, which may point to leaked passwords or even bigger problems. The next category we'll be talking about will be system health and resource consumption. Here we listen to what the API itself is telling us. Example of example here would be like the API response time and the error rates or DDoS attacks or resource intensive scraping bot will often cause elevated response times and a spike in 500 series server errors on the endpoints that they're targeting. This resource intensive scrapping is a crucial symptom. Of sophisticated attack that a static rate limit is completely miss. The last, but not the least category we will talk about is the contextual signals. Here, what we are looking for here would be the who and the from. We use the geographical location and network resource and find answer to questions like. Is the user who not money logs in from London, suddenly making requests from a data center in different country, or it is a bot. We correlate with IP reputation scores and known VPN proxy networks to verify the validity of the requests. We also use device fingerprint and user agent to understand the context answers to question like. A request coming from standard browser with a con with a consistent set of headers or a headless client with a suspicious or a missing user agent helps us identify trustworthiness of their request. This rich, multidimensional data. Forms the rich material, it forms the foundation that our system can begin to extrapolate the subtle nuances that differentiate between the real user and that from the Han attacker. Now moving on till now we have just been talking about the data and which is just the list of facts. Now step two would be engineer in. Which is the art of and science of transforming these facts into meaningful, insightful trends, setting signals. And these signals are the features our AI model will actually understand. Think of it as this way. Roll logs are useless to a machine learn learning model. We must teach it, teach the language of behavior, understanding. From the loss. This is where we create a med layer of 27 separate behaviors, which we group into four powerful categories and help analyze that, help analyze and infer the complete story of the actual API requests. Let's break down these categories with concrete examples of what we built. The first category is. Temporal patterns that gives us the rhythm of crust in this. First, we analyze time. We move beyond a simple count to understand the pattern of requests. We create features like request per second volatility and collect data on it. This is a statistical variance. In the customer request rate, humans are volatile and unpredictable, whereas bots are often metronomically consistent and persistent. We also calculate a burst score to identify short, high intensity explosions of traffic that haul marks of automated, that forms the hallmarks of automated scripts. We even look at time of the day anomalies. Where we identify if the requests are happening at the rec user's, typical time on, on the time zone or at 3:00 AM in a different time zone altogether, gather from a location that they have never been to. The goal is to answer a critical question. Is the traffic coming in smooth, human-like rhythm, or a throttling, robotic, consistent bus, even if they're. Smaller parts. Another category that we need to talk about is access behavior. That gives us the narrative of intent. Here we analyze what is being accessed and how this is about understanding the user's point of view. Here we calculate the endpoint entropy, a measure of randomness in the api. Endpoints being accessed. A real user has low entropy following a predictable path like home search and product page, and possibly adding to the cart. Whereas a scanner has high entropy, high disorder list like jumping randomly between different APIs, login, admin, export, or even search using continuous. Our using continuous search APIs with multitude of inputs and not using the other routine kind of APIs, which an end user uses. We also create a suspicious sequence flag that triggers that the user's parts matches a known malicious pattern. These patterns are based on industry or sector wise, or data scan, realize patterns identified and updated on a law ongoing basis. An example of this is accessing a login endpoint immediately after trying to hit a sensitive data export endpoint or immediately accessing a search endpoint after. Oh, search and an export in the previous call. The insight we are engineering here is to identify and realize that the user is browsing a diver set of endpoints like a human, or they're laser focused on a simple, extensive APIs in the illogical and unhuman like sequence and speed. The speed is not calculated here, but you get the point. The next category we will talk about is network signal that help us understand the context of connection. Here we look at the origin of the request. The network doesn't lie. We build a geographically impossible score, calculating the physical possibility of a user moving from New York to London between two subsequent requests. This is a massive red flag. We incorporate realtime IP reputation scores from threat intelligence feeds and calculators source an anal anomaly index. A measure of how unusual the user's networks source is compared to their history and their general use base user base. This category is also incorporating external context and looking for geographic and infrastructural anomalies that static systems completely ignore. Last category that we have to talk about is the user context, the and the, basically that highlights the power of the baseline. Finally, and most powerfully, we analyze. Identity. We don't treat every user as a stranger. We derive a session confidence score based on the age of the session, the diversity of action taken, and its geographical geographic stability. We calculate fail login velocity, but we normalize it by the user's historical basis. A user who never fails a login certainly fails 2020 times. In a minute or two is a much bigger alert than a known clumsy typist who might make unsuccessful attempts, or someone like me who forgets password soft one and takes time to remember and has failed logins, but that would be based on the historical PA pattern that this uses, has such issues. This is the crown jewel of all the features so far. We are continuously asking here about if and how's this current session compared to the user's 90 day baseline? Has their behavior ly and suspiciously changed? Now with the 27 powerful behavioral features, data stats, ready, we move to then step three train. This is very. Important step. This is where we build the intelligent brain of the entire system. We use a detection tree ensemble model. We can think this, we can think of this as a committee of many simple interpretable subject matter experts, more relevant to their own feature. Sit together, these experts work together to make. Highly accurate and robust decision. We specifically use a combination of random forest for robust generalized classification and gradient boosting for precision tuning on difficult edge cases. Now, you might be wondering what this terms mean. Don't worry. You don't need a data science degree to get to the core idea. Let me explain them here first. We we ha we have what we'll call the committee of expert or specialist. Imagine we had 10 different security experts, or rather a hundred different security experts, and each one of them is a specialist in a different area. This is our random forest, and we don't give them all the same information. One expert, one expert only looks at the timing and rhythm of request. List to check if it's smooth, flow or robotic burst. Another expert only focuses on geographic location, like looking at things like login patterns to identify that it's a login from London just two minutes after logging from New York and so on. The third specializes in the sequence of pages or the API visits the U user basically to fit whether it's looks like a natural browsing journey or a random, suspicious. Scan another one could only analyze the user's past behavior in this. Is this action normal or abnormal for this user, we give each expert a slightly different view of the request focusing or which where they need to focus on their specialty. We ask them all the same question. Does this look like an attack? Each expert makes their own decision based on their unique lens and the data that is given to them. In the end, we just take the majority vote. This method is incredibly robust and reliable because it does not rely on us any single piece of evidence. It's hard to fool around a whole committee of ex specialists who are all looking at different angles and different clues. But sometimes attacks can be so clever and subtle that they can slip past this general vote, and that's why a second technique comes into picture, which we call it ours. Master Investigator. This is our gradient boosting model. This detective examples, the entire case file, learning from mistakes and becoming exceptionally skilled and connecting subtle dots. To catch the most elusive threats, sophisticated and sneaky threats as well. Now, why did we choose this specific combination? There are three critical advantages, and let us look at them. Firstly is that we achieve both breadth and depth and accuracy. The committee provides reliable baseline detection while the investigator handles sophisticated edge case. This hybrid approach is a key to our. 97.5% model accuracy. Second is that the division of labor enables realtime performance despite multiple, despite using two models. Our sophisticated architecture makes predictions in microseconds, which is actually crucial for live API traffic. And third, we gain enhanced interoperability. The committee ran the committee, random forest shows us. Which feels like most influential across all experts. While the investigator reveals the sequential logic of complex cases, this multifaceted understanding is vital for trust and debugging. The training process is continuous. We feed the model historical traffic data labeled as estimate and malicious as malicious data. And cross verification to ensure generalization and maintain automated pipelines that regularly retrain and on new production data and basically become more sophisticated and up to date. This allows the system to adapt to novel attacks. And evolving user behavior, creating a learning system, not a static rule-based, which functionally overcomes the limitations of the traditional rate limiting we discussed earlier. The final step over here is deploy. How do we put this intelligent brain into production without creating a bottleneck or a single point of failure? The answer is a cloud native serverless architecture. And here what it looks like in practice, the model is packaged and deployed as a serverless function, for example, in AWS Lambda or an Azure function. Now, why is the serverless approach, so transf, let's me, let me explain it over here about its core advantages. For our use case, it is elastic and event driven, basically, unlike. Traditional servers that you have to provision and pay for 20 24 7 serverless function scales to zero. It only wakes up when the API request ticks. When you see a certain traffic spikes, like during the product launch or marketing event, the cloud provider automatically spins up thousands of parallel instance in milliseconds. There is no capacity planning and no manual intervention. The system scales precisely as the demand increases. Now the next advantage is granular paper use cost model. This is a game changer with cost efficiency. You are not built for the ideal time. We did see earlier how. The older infrastructure had extra cost owing to the fact that we are maintaining the infrastructure despite not having enough request. Over here, you are only charged for the milliseconds of the time, compute time it takes to execute the model inference for each request during quite, quite periods when there is no traffic. Your cost for these API component drops to absolute zero. While your system remains ready to spring in action now talking about built-in fault tolerance and high availability feature Cloud provides run cloud. Cloud providers run serverless functions across multiple availability zones By default, this means if an entire data center is in one zone, has an outage. The platform automatically routes traffic and executes the function in another time zone. You are, you get a highly resilient system without having to architect the redundancy yourself. Second is reduce operational over it. We. Completely eliminate the need for you to manage servers, operating systems, runtime environments. There are no patches to apply, no servers to reboot, no clusters to monitor this no ops model allows your team to focus on building features and not managing the infrastructure. This serverless function integrates seamlessly into your existing API gateway. Every incoming request that needs inspection has its features calculated and is then sent to this function for real time inference. The gateway then enforces the decision of allowing, delay, denying, or blocking based on the models confidence score. Now, you might wonder, what does intelligence system not become? It's it's not what, how does this intelligence system not become a bottleneck? The answer lies in the powerful synergy between the static model choice and the inherited strengths of serverless architecture. First, our optimized decision tree example is purpose built for this environment. It delivers the high accuracy we need with microsecond inference. Speed and crystal clear interpretability, ensuring it can take lightning fast decisions without slowing down your API. Second, we deploy it using the serverless function. The cloud platform provides automatic, near infinite scaling and true parallel processing. If 10,000 requests arrive at once, it's been sub 10,000 parallel instances. There's no queue because there's no single point to queue at all. We complete this with a robust and operational practices. We implement zero down deployments using green blue deployment model, allowing us to update the AI model seamlessly and roll back instantly if needed. All without users having to noticing this powerful combination gives us three massive advantages of infinite scalability, true cost, efficiency, and built in availability makes this rigorous architecture that guarantees a system which not only is intelligent, but also performant, reliable and cost effective. This four step cycle, collecting rich data engineering intelligent features. Training a powerful model and deploying with cloud native agility is how we transform the blunt instrument of traditional rate limiting into a precise, adaptive, scalable security system. Now talking about advanced strategies, or even before that. Let's think sometimes makes people wonder where this actually work. These are not just lab result. The other trail work reports, these are metrics that come from real world production deployment across AWS Azure and Google Cloud. We see 96% product detection accuracy for malicious threats, and we also see 68% reduction in false positives. Now, that's the number of millions of estimate users. Who are no longer accidentally blocked, which increases better, which increases the user experience of that many users by throttling precise, precisely and only when needed. Companies have achieved over 27% in infrastructure cost savings. This model itself operates with 97% accuracy. This is an efficient and tangible. Bottom line improvement. Looking at advanced strategy like progressive throttling, a key innovation that drives down false positives, how we respond, we reject the binary block or allow paradigm. Instead, we use progressive throttling now. What happens here is the AI assigns a threat confidence score from zero to hundred based on the score. We apply a graduated response in case of low score, which means a low threat. That request will be full speed, will have. Full speed access, no impact to real users. In case of medium score, we introduced a slightly incremental delays. A script will be crippled by 500 millisecond delay, but human user might not even notice it in case of high. Medium scores can be from 31 to 70, low, zero to 30 high, 71 to 99. We enforce much stricter rate limits. In case of confirmed threat, we will completely block the request. This graceful, slow down and degradation is what allows us to ensure security without being hostile and completely shutting down. The system doesn't stand still. It continues continuously segments. Users, monitors model performance and incorporates new data to retain and improve. It's a living learning system that attach to your unique traffic patterns and evolving tactics of attackers. You if you are convinced it's time from move, from dark ages to rate limiting, here's a practical phased roadmap to get you there. Assessment phase audit, your current rate limiting. Identify pain points. What are your false positive rates? What does your attack traffic looks like? Establish a baseline, then do a infrastructure setup. Configure the data collection. Pipelines set the. Cloud resources and monitoring dashboards. This is the foundation model development. Engineer your features and train your initial models on historical data. Validate their performance against your baseline. This is crucial. Deploy a single API or a small percentage of traffic. Monitor everything. Closely tune the model based on real world feedback. Full rollout. Once your confidence scale across all your APIs, implement the continuous learning loop and start optimizing the cost efficiency. Now, I would like to wrap up this discussion and if you have any questions, you can reach me at LinkedIn that is looking up against Rihanna Han. There's only one person you will get and you can connect with me again, you can connect with me to discuss about this or anything else, my friends. Thank you again. Okay.

Slides

Download slides (PDF)

See all 25 talks at this event!

Conf42 JavaScript 2025 - Online

October 30 2025 - premiere 5PM GMT

AI-Driven Rate Limiting for Scalable, Secure, and Cost-Efficient APIs

Video size:

Abstract

Summary

Transcript

Slides

Rehana Sultana Khan

Technical Program Manager, Engineering Manager, Senior Member Engineering Staff @ Versa Networks

Join the community!

Featured event

2026

2025

Info

Conf42 JavaScript 2025 - Online

October 30 2025 - premiere 5PM GMT

AI-Driven Rate Limiting for Scalable, Secure, and Cost-Efficient APIs

Video size:

Abstract

Summary

Transcript

Slides

Rehana Sultana Khan

Technical Program Manager, Engineering Manager, Senior Member Engineering Staff @ Versa Networks

Join the community!