Scaling AI: Fault-Tolerant GenAI Solutions for Millions

Video size:

Abstract

Building GenAI products at scale is hard! Serving millions of users demands fault tolerance, DX, and flexibility. This talk will explain how we built a scalable infra that integrates multiple AI vendors and switches between them as needed. This talk will prepare you for scaling GenAI products

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello and welcome to my talk Today. I'll share with you how we at Elementor have created a fault tolerant solution on top of our Gen AI products that serves millions of users. It wasn't an easy ride, and today I wanna share with you. A lot of our, a lot of our journey and the lessons that we've learned. But first I'd like to start up with a real story. So several months ago we've launched a new product. It's called. The AI site planner, it's a really cool product. It helps professional web creators plan their website that they're gonna build for their client. It starts with an agent that conducts a sort of AI based interview with the web creator in order to understand the goals, the purposes, and. Interesting elements of the website that's going to be implemented and developed for the client. Then it has some layer of site maps and wire frames to start off with a really good draft for website. It's a very cool product and is supposed to be launched on the 23rd of March this year, however. Just two days before the launch, our open AI account, basically our main vendor that we have been building the product on top of has locked our account and we were not able to make any calls. I bet that many of you right now think that back then this is how we would look like. Running around in hysteric mode, incident mode, war zone, but actually that looked much more like this. Yes, it wasn't pleasant, but due to the infrastructure that we've developed, it wasn't that of a big deal, and we have easily mitigated the situation and had a successful launch. So before I I dive even deeper, I have to start with an apology. I'm gonna talk briefly about two companies, Microsoft and OpenAI, and in this specific talk, I won't be their fan. And if any of the listeners are from Microsoft Open ai. I'm sorry in advance. Hi everyone. My name is Dennis. I lead the AI at elementary. I've been in this industry for the last 17 years, seven of which I've been practicing AI way before. AI was cool. I live and breathe this this topic. I try out the new tools, the technologies I play around with Volvo. I live and breathe on this topic. I have a lot of passion about it. I play all the different tools, read the papers, and always try to find new technologies that can improve my life and my, and the product that I'm building. So today. How we are, how we've developed an infrastructure, is able to handle 1 million AI requests, how our solution is resilient to handle failures as severe as one of our main AI vendors lock up our account, and what are we doing in order to continuously improve? Let's start. So a few words about elementary. Elementary is number one web platform for building websites in the world. Right now we're around 19 million websites that were built by elementary and right now there are almost 19. So before we dive in few words about Elementor. So Elementor is the number one platform for building websites in the world. It's being used by professionals, and there are almost 19 million, even more than 19 million websites. It started off as a drag and drop tool that can help web creators easily build beautiful web pages and websites. But in the last two and a half years, it's also started to add more and more AI tools, and I'm proud to say that elementary is one of the pioneers of AI products for large scale applications. Because you know what BC stands for, right? Before Chief Chat, g pt, of course. So Cheche, GPT was launched on November 20, 22nd, and just a little bit more than four months. Elementor has launched our first AI product. We had several tools right on the start. We allowed our users to generate texts, generate content for their websites, generate and edit images using ai, even write some sophisticated code using AI like for animations even before. Chat, GPT and OpenAI had memory. We invented what we called the AI context where users could upload some additional data about Versa site and the business and their tone and voice, and we would incorporate that in any AI requests. And lastly, we have some building tools that can predict the next layout and has help where creators to be more efficient and more creative in their work. So since then, we have more than 2 million users experiencing AI of elementary AI products. We have more than 40 AI based features, and we exchange monthly around 10 billion AI tokens. So the first topic that I want to talk to you about today is how we're handling 1 million daily AI requests and to understand that when to go back to the beginning. 'cause when we just started, like many companies, we started straight. We've open ai, but for those of you who are early adopters, might remember that JGPT 3.5 turbo rate limit requests per minute was only 60. Think about this number. It back in the day, they allowed just 60 requests in one minute. Something that is just not feasible for enterprise software. And also need to remember that two and a half years ago, open ai, wasn't that a big deal? So we, as a big company, we wanted to find a vendor that might be more suitable for us. This is why we moved to Microsoft Edge Azure. But back in the day, there are limits and support was much better than open ai. So it worked really good for almost a year until. We build elementary copilot, and similar to GitHub copilot, it's able to predict the next line of code. The elementary copilot is able to predict the next layout. So let's say you have a title of your page or the next element elementary copilot understands whether it's an about page and automatically suggests a full layout. With text based from the AI context that is suitable for this section. And from there, web creator can just click on the next section, the next section and the next section, and basically build pages much faster to support that. However, we had to, the Microsoft Azure rate limits simply were not enough. Because back in the day, even though their rate limits were higher than what Op OpenAI OpenAI provided there were not enough. To support the copilot abilities. Think about it for a second. We're taking a lot of context. We are firing events almost every interaction users is making 'cause we need to understand when exactly the user wants to create the next layout. Sometimes we just moving the mouse around. Sometimes we are editing some stuff and sometimes we do need to create a new block. So we've been cha firing a lot of events and besides the rate I thing. The more we used, the more we adopted AI abilities the hard it became to work with Azure, because if, for most of you who are working with Microsoft Azure, know that every time a new model is being released, but open ai, it becomes available in Azure pretty fast. However, in Azure is a context of a region. It does not exist in open ai, and it means that not every model. It's available in every region. And that becomes a very complicated thing for developers. 'cause we need to remember that they have a region and in every region they need to create a dedicated resource where they will deploy the model. And then the quarter, the rate limit is across all those deploys within that region. Developer need to know against what region and what deploy they're working. But what happens when we need to add a new model that's not available in one one region? We need to replicate the whole architecture from the start. It's very tedious and time consuming for all the teams. And in case you need to increase your rate limit, this is an actual form from Azure Portal, which you do that you need to fill in. Understand what it is you're asking. What type of filter do you want and in what region? And then submit this request and then get approval for additional requests. Shortly speaking, that's very complicated and tough. So we were looking for something much easier, both in terms of rate limits developer experience, and we wanted to practice and cha and test new models the minute they, they become available. So we moved back to open ai that around that time, their rate limit, it was actually higher than Azures. From developer perspective, there is nothing to compare in several lines of code. You already are fully integrated with open ai. You don't know what regions and you don't care, and you can use any model you want and you can just switch the model in a matter of seconds. But that still wasn't enough of a copilot. It still wasn't enough of a copilot, so we went. To the most reliable source, JGPT to see how can we get even higher rate limits from OpenAI. Basically, OpenAI suggests two options. One, the second one is to apply to a higher rate rate limits for basically engage with OpenAI and start an enterprise tier, but that is actually very expensive. And it requires us as a company to commit to a certain spend of money that we were not ready to commit. Yet this, and this is why we go, went with the first approach of having multiple API keys for multiple organizations and create some sort of a round Ruben around them to make sure that we basically have more available organizations, and therefore more our rate limit becomes higher. So let's see how our infrastructure, the OpenAI multi organization proxy works. We have, we've created several organizations within OpenAI and we've listed them along with their information in inside our code base. Then we extended the OpenAI, SDK. So the developers will be very easy. To work with this infrastructure without actually understanding or knowing or caring about what organization exactly they work about, they work with what we're doing. We're simply called, instead of calling open AI service and straight going to check completion, we first call get SDK, which is our extension that basically pulls the next. Or the next organization and that this is how we can work with multiple and multiple organizations and extend our raise limit basically in indefinitely. That work. That worked beautifully. It even helped us with our next product, the site planner. But if you remember, as I mentioned at the beginning, it starts with an agent that interviews the web creator, ask different guiding questions to understand the goal of the websites, and then it prepares a draft. Since we're working with professionals, we first want to create a site map, the bird's eye view of all the different pages, their content, the paragraphs, and the goal of every page and every paragraph for that website, because we're doing some things that are professional for professionals, so we need to make sure we create a decent design and only with them we can move. To the wire frame, to this first beautiful draft, it will show both the upgrader and the user, the client, how the website will look like. You can see that it happens really fast and actually in parallel, there are hundreds of AI requests that happen here in order to generate all this contact because both the layout and the content, and actually the whole structure is created by ai. So we had to have really high limits. So some takeaways first. Now, if once you start developing your product and you're getting into production, you first need to understand that rate limits exist and you need to track them. You need to know your usage because if you have a product and you start to expose it to a world and more and more user using it. This is awesome. This is awesome. You just need to make sure that you're not getting too close to a rate limit. Otherwise, it might be very dangerous and your users might not like your service not being available, and in case you are getting closer for whatever reason, just remember that you can easily create more organizations and use a shared pool around them. All right. The second topic would be how we're building. A resilient AI solution. So we, we understood how can, how could we handle the load and have handled the rate limit. But rate limit and load is not only the case, and I would quote one of the best engineers, Mike Tyson, that everyone has a plan. Tva, he get punched in the mouth. So you would expect. This is how if you're working with multiple ai, open AI organizations, this is how you, your usage and your performance would look like similar because what's the difference between one organization and the second one and the, and a different organization? Let's assume that all of them are under the same tier. You would assume that this is how the chart should look like. However, this is not exactly the case. If you actually track duration, error rate, any other performance metrics, it would look more of this. You would see that, yes, most of the time they have the same response, the same duration of requests, the same error rate. Sometimes they diverge and one organization becomes much slower or even return failures while others don't. And if you actually look on the status pages of Tropic Open ai, any other AI provider which you would like, you would see that it's not as green. As you would expect, and by the way, you don't have to have major outages like the red plus spots, even the yellow ones. That means that some of the functionality is not working or some over some delays or some timeouts. It's something that will, your product will suffer. Need to see how you can manage this, especially than not. All organizations usually in open ai, suffer in the same way during those outages. As we know, this is how what happens, our users actually expect that our system will be a hundred percent uptime. They don't care about open ai if they don't care about outages and they don't care about the rate limit. They want the product that we're using, even the product that we've purchased, for example, to work all the time. So in this scenario where we see we have two different organizations and one of them is divergent becoming worse while the green one still or operating in a sufficient level. What we would want is for our system not to use the yellow organization, but to actually use only the green one. And this is how we've had an additional layer for our multi-organ solution in order to make it more resilient. So let's say we have a client request, it goes to our infra that has several organizations, and let's say that one of those organizations hit a certain threshold. For example, it it hit their own rate limit or for some reason it returns 500 error code or any other request within a certain time timeframe. And we see that it's not something that just happens once. It happens three times. So what our solution is able to do is to take this organization and remove it from the total pool of the organizations market as a, let's say, sick organization and work only with the healthy one. In this scenario of the client will never experience the failures that happen only in this organization. 'cause if we would keep it there, that means out of four organizations. Every one out of four requests will get an error. And we don't want that. We wanna have as much successful requests as possible. And only after a while we'll try to reconnect to this organization again. And in case it works fine, we can keep it and restore to normal. So this pattern is called a circuit breaker. It's not something new, it's something that actually comes from the electricity world. But basically what we've done, we've extended the previous infrastructure solution by adding two additional parts. One is the Redis cache. We needed a distributed cache because as many companies we work in Kubernetes, we have different pods and we could, we don't want to make, and what we don't wanna. Let any, every pod suffer from the same threshold. 'cause if we know that organization two, for example, is supposed to be out, we don't want it to heat a threshold in one pod, and then the second pod, and then the third pod want to heat it certain threshold and then be removed from all the pods. So the whole management of the organization, cycles of actual pool is being managed inside. Ve And the circuit breaker basically is the management system. That operates on top of all organization and decides which organization should be removed from the pool of the health organizations and when it should be returned and that shouldn't be returned. We connected it to our slack, so we actually can see. It's very nice because think about all the time, all those times where you had a certain resource, it wasn't behaving in a good way and the developer had to. Manually deploy some code, do some changes in order to remove it, and then set a reminder to the next day to make bring it back here. Everything is done automatically. And when we wake up, we just see this site of a log, this sort of a log in our slack where we hit a certain threshold. The circuit break is open, it removes a certain organization and after a while. It returns it and no engineer had to do anything. It just works in a very resilient way. So let's go back to the story that I started this session. So two days before the global launch of the site planner open AI locked our account and it means we couldn't pay for additional tokens for additional credits, and we didn't have enough credits. So the system could not work with open ai. This is how it looked like. We wanted to top, top up the balance, but for some reason we got this exception, this error. And obviously when we tried to contact someone from support, no one answered and we just didn't have the time to wait for others, so we had to act. And this is why we, from the start, we knew that we didn't want to be in a vendor lock position. We didn't wanna be in a position where in, in such case, if one of those AI vendors will fail, our product will not work. This is why we've started from day one. Also working with. Now there are some differences. So with immigration from OpenAI to cloud is not as seamless as you would think because there are some functionalities that OpenAI support supports that cloud doesn't like structured output, for example, and defense validations and, the system prompts are not the same. You can have a prompt and then use it in open AI and in cloud. In using the same system prompt, you'll get different results. And by definition it's two different models, so you had to adjust the system prompts a bit depending on the provider, which you're working with. But this is, and although this sounds so complicated, this is something. That is necessary. 'cause if you're building a product. For millions of users. You cannot rely on any vendor that for any reason will not be available. It might be out for whatever reason. And you just don't want to be in a position saying to a users that your product is not working. 'cause open AI is not working you. You gotta have a fallback. In our case, we have automation, so we wanted to have a, a fallback in just one click. So we, so what it means that we adjusted our prompts or system prompts to the provider. We knew exactly if we were moving from open AI to cloud the whole system. Works the same. What changes is just with, instead of calling the open ai, SDK, we're calling the SDK, which in this case is Claude, and it then touches the correct system prompts and additional instruction to make the responses be as close to open AI as possible. Yes, it's not ideal, but it's much better than being out totally. So right now we actually have this. So this is not something that happens automatically because it's a very rare scenario, but we do have a single key in our system that once we change, it moves the entire system to work against cloud and not against open ai. And then reverting it is exactly the same. And we are right now working on open sourcing this framework for everyone. So stay tuned. Alright, takeaways from BI building and being a resilient AI solution. Everyone need to remember, and it doesn't, it has nothing to do with ai, but failure in software development is inevitable. It doesn't matter of if, it's a matter of when Using a proven design partner, like a circuit breaker is a great solution for being full tolerant. I think that, again, it really depends on the scenario. It really depends on the level of support and service that you wanna provide for your clients and the necessity. If it's a product that's working once a day, it's fine. But if you working globally and you wanna provide your user with sufficient coverage for the entire time, then I would consider an additional vendor and doing whatever is necessary. To make sure it's easy to switch between them. It's easy to switch between open AI to cloud, to grok, to whatever vendor you want and without adding additional code or changing stuff in the middle of the night. And another tip, it's not written here, but it's to test, test. That means that it's one way. It's one thing to just build it and it's ready. But every now and then practice. Try to change the provider and see that everything is work, but the tests are hitting the limits that the most important scenarios are passing and the system is operational because what happens usually with fallbacks, the being left behind, and then it's very then. There is a difference between what the main provider, how the system is working with the main provider versus the fallback that's usually not as good as the main provider. Alright, the last subject that I want to talk to you about today is how we create, how AI solutions need to continuously. So we, earlier we said that this is what our users expect with our system to be a hundred percent up and always work. But actually this is what we expect. They expect that not only if the system is up, they expect that everything works great to where to what they expect from the system and how we use it. But how we even know that we are doing a good job. How do we know that our system is doing what our users want? So I bet many of you right now are thinking about evolves and evaluation in general. And you're correct. And the way to understand if, and the AI system is working correctly is by evaluation. And when we're talking about tax generations, it's pretty straightforward. For example, back in the day we started to evaluate diff our different results in the text area. So when we have our users wanted to change their text of a button widget, for example, we suddenly saw. That the AI was generating huge responses back. You can see here it's a paragraph right on the button. It makes no sense that this type of text should be on the button. So once we understood that those scenarios exist, we created, we had evaluations to ensure that texts that are being, that AI should generate four. Button widget, for example, would be different in length than a text for a heading and a text for a full paragraph. But for texts it's pretty straightforward. You put text in LLM provides text out and that's it. But how do you evaluate images? 'cause it's not only that a user asks for image of a dog and where you see the dog. Does the dog look okay as it has four legs and just one head? Is it actually walking or is it running where not only the functionality that's supposed to work, but also a certain amount of taste, right? 'cause some, someone would, might look on a picture and say it's good picture, but someone else is, was envisioning something else. So it's hard to evaluate images and it's much harder to evaluate a whole structure of a website, of a page, because. You would, a user can ask for a about me page and will have seven different sections, but the user might have a different idea of how this page should be structured. So how should we evaluate this type of interaction? So in order to do that, we didn't we had to re move away from the traditional evil of financial evaluations. And we've defined a success metric. We called it insert rate. Basically, for many of our features, the user had to enter a prompt, then see a preview of a certain a preview of a certain result, and only if the user would use the image. We would mark that this specific interaction was successful. If, for example, we would click on generate again, or even close the page. We would know that this interaction ended with an empty result and we would mark it as inserted false. Then we would keep all of this data in a dedicated database where we would see the user actual input, the enhanced prompt that, as AI engineers we are enhancing prompts and providing additional context. But we wanted to list everything in a very clear way. Then we had the full prompt with the system prompts and the additional, and basically everything that eventually goes to the LM, the result that gets gets back. And the indication, whether it was inserted or not, all this information was stored for every interaction inside our database. On top of that, what we started. We started with just manually going over the results. We would fetch the information of a day, of the week of the month. We started first with just trying to understand what's wrong with the ones that were in, not inserted when we asked the GPT for help. We would add provide this information in an anonymized way into check GPT and ask it to see if it sees any similarities. And eventually develop our own clustering job. We even have an article right here, have a QR code, but I invite you to scan and read how we've developed basically offline jobs that are a, that are AI powered that go through this data. Create different clusters and find anomalies, find those requests that have some common ground and they failed. A great example that I love to share is what we've noticed. That we've noticed that by we, our AI job noticed that several image generation requests were not being inserted. And the AI actually once analyzed that. It's suggests that all of those requests that had transparent background actually got rejected, and the reason was so simple. The AI model that we were working on the image generation was not, didn't support the transparent backgrounds, and therefore all the images always returned with, with not transparent background, with a full background, and users didn't want that, so we rejected it. So the AI actually here, once scanning the information, clustered them correctly, understood the reason, and even suggested, suggested the reason for this failure. The solution, by the way, was very simple. No, we didn't change payment provider. We just added a hint that once we saw in the input that the user was typing something about transparent, we also automatically hinted and say, Hey, we're not supporting transparent images at the moment. And that's it. We aligned expectations with the users and they were much happier. Here you can see that every, basically every week we would get automatic responses in Slack that would get, all those clustering informations with different suggestions and different insights of fa potential failure reasons. And yeah, as I mentioned, it found real issues with transparent background that I just mentioned, the text length of the buttons that I mentioned before. Even that our Japanese understanding was not as good as we thought it was. And yeah, many more examples that manually, especially for large scale application, it's just not visible. So you have to get some sort of automation. On. So first you would want to persist with the data, with all the details, get some sort of information to scan this data on a repetitive basis, and use Ally AI to get insights. Yeah, this is exactly the takeaways. Persisted data, define a success metric, measure a success metric, and then pick a way to evaluate success. Always start manually, always start simple. And only once you fine tune the metric and the data and everything works, then you can move to a more automatic stuff. And also you leverage AI for your needs. To summarize this session first. Not every system needs to be fully resilient and support huge rate limits, identify first the need for scalability, then design the solution. So if it's rate limit, we've talked about adding additional organizations, maybe changing to a different model, then a different provider that has a larger context window or a larger rate limit or, and but in many cases. It'll be just adding more resources and then controlling them in a healthy in a healthy and a resilient layer. Using something like a circuit breaker to identify one, once a certain organization isn't good enough, move it away and work only with the good ones and eventually establish a continuous improvement mechanism will help your product and your AI solution be better. So I hope you, you enjoyed this session. It was a pleasure speaking to review today. May the AI be review right here. You can scan this barcode and see, get the presentation for this talk along with additional presentations that I've been shared in the past. With different recordings and different materials, different block posts that I mentioned in this talk and in other talks, it's everything the same place. So it's easy for you to easy for you to check. And also it has my contact details there. So if you want to chat more about ai, feel free, curious.

Slides

Download slides (PDF)

See all 37 talks at this event!

Conf42 MLOps 2025 - Online

September 18 2025 - premiere 5PM GMT

Scaling AI: Fault-Tolerant GenAI Solutions for Millions

Video size:

Abstract

Summary

Transcript

Slides

Dennis Nerush

Director of AI Engineering @ Elementor

Join the community!

Featured event

2026

2025

Info

Conf42 MLOps 2025 - Online

September 18 2025 - premiere 5PM GMT

Scaling AI: Fault-Tolerant GenAI Solutions for Millions

Video size:

Abstract

Summary

Transcript

Slides

Dennis Nerush

Director of AI Engineering @ Elementor

Join the community!