Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to my talk Today.
I'll share with you how we at Elementor have created a fault
tolerant solution on top of our Gen AI products that serves millions of users.
It wasn't an easy ride, and today I wanna share with you.
A lot of our, a lot of our journey and the lessons that we've learned.
But first I'd like to start up with a real story.
So several months ago we've launched a new product.
It's called.
The AI site planner, it's a really cool product.
It helps professional web creators plan their website that they're
gonna build for their client.
It starts with an agent that conducts a sort of AI based interview with the
web creator in order to understand the goals, the purposes, and.
Interesting elements of the website that's going to be implemented
and developed for the client.
Then it has some layer of site maps and wire frames to start off with
a really good draft for website.
It's a very cool product and is supposed to be launched on the
23rd of March this year, however.
Just two days before the launch, our open AI account, basically our main
vendor that we have been building the product on top of has locked our account
and we were not able to make any calls.
I bet that many of you right now think that back then this
is how we would look like.
Running around in hysteric mode, incident mode, war zone, but actually
that looked much more like this.
Yes, it wasn't pleasant, but due to the infrastructure that we've
developed, it wasn't that of a big deal, and we have easily mitigated the
situation and had a successful launch.
So before I I dive even deeper, I have to start with an apology.
I'm gonna talk briefly about two companies, Microsoft and OpenAI, and in
this specific talk, I won't be their fan.
And if any of the listeners are from Microsoft Open ai.
I'm sorry in advance.
Hi everyone.
My name is Dennis.
I lead the AI at elementary.
I've been in this industry for the last 17 years, seven of which I've
been practicing AI way before.
AI was cool.
I live and breathe this this topic.
I try out the new tools, the technologies I play around with Volvo.
I live and breathe on this topic.
I have a lot of passion about it.
I play all the different tools, read the papers, and always try to find new
technologies that can improve my life and my, and the product that I'm building.
So today.
How we are, how we've developed an infrastructure, is able to handle 1
million AI requests, how our solution is resilient to handle failures as
severe as one of our main AI vendors lock up our account, and what are we
doing in order to continuously improve?
Let's start.
So a few words about elementary.
Elementary is number one web platform for building websites in the world.
Right now we're around 19 million websites that were built by elementary
and right now there are almost 19.
So before we dive in
few words about Elementor.
So Elementor is the number one platform for building websites in the world.
It's being used by professionals, and there are almost 19 million,
even more than 19 million websites.
It started off as a drag and drop tool that can help web creators easily
build beautiful web pages and websites.
But in the last two and a half years, it's also started to add more and more
AI tools, and I'm proud to say that elementary is one of the pioneers of AI
products for large scale applications.
Because you know what BC stands for, right?
Before Chief Chat, g pt, of course.
So Cheche, GPT was launched on November 20, 22nd, and just a
little bit more than four months.
Elementor has launched our first AI product.
We had several tools right on the start.
We allowed our users to generate texts, generate content for their websites,
generate and edit images using ai, even write some sophisticated code using
AI like for animations even before.
Chat, GPT and OpenAI had memory.
We invented what we called the AI context where users could upload some additional
data about Versa site and the business and their tone and voice, and we would
incorporate that in any AI requests.
And lastly, we have some building tools that can predict the next layout and
has help where creators to be more efficient and more creative in their work.
So since then, we have more than 2 million users experiencing
AI of elementary AI products.
We have more than 40 AI based features, and we exchange monthly
around 10 billion AI tokens.
So the first topic that I want to talk to you about today is how
we're handling 1 million daily AI requests and to understand that
when to go back to the beginning.
'cause when we just started, like many companies, we started straight.
We've open ai, but for those of you who are early adopters, might
remember that JGPT 3.5 turbo rate limit requests per minute was only 60.
Think about this number.
It back in the day, they allowed just 60 requests in one minute.
Something that is just not feasible for enterprise software.
And also need to remember that two and a half years ago, open
ai, wasn't that a big deal?
So we, as a big company, we wanted to find a vendor that
might be more suitable for us.
This is why we moved to Microsoft Edge Azure.
But back in the day, there are limits and support was much better than open ai.
So it worked really good for almost a year until.
We build elementary copilot, and similar to GitHub copilot, it's able
to predict the next line of code.
The elementary copilot is able to predict the next layout.
So let's say you have a title of your page or the next element elementary copilot
understands whether it's an about page and automatically suggests a full layout.
With text based from the AI context that is suitable for this section.
And from there, web creator can just click on the next section, the next section
and the next section, and basically build pages much faster to support that.
However, we had to, the Microsoft Azure rate limits simply were not enough.
Because back in the day, even though their rate limits were
higher than what Op OpenAI OpenAI provided there were not enough.
To support the copilot abilities.
Think about it for a second.
We're taking a lot of context.
We are firing events almost every interaction users is making 'cause we
need to understand when exactly the user wants to create the next layout.
Sometimes we just moving the mouse around.
Sometimes we are editing some stuff and sometimes we do
need to create a new block.
So we've been cha firing a lot of events and besides the rate I thing.
The more we used, the more we adopted AI abilities the hard it became to
work with Azure, because if, for most of you who are working with Microsoft
Azure, know that every time a new model is being released, but open ai, it
becomes available in Azure pretty fast.
However, in Azure is a context of a region.
It does not exist in open ai, and it means that not every model.
It's available in every region.
And that becomes a very complicated thing for developers.
'cause we need to remember that they have a region and in every region they
need to create a dedicated resource where they will deploy the model.
And then the quarter, the rate limit is across all those
deploys within that region.
Developer need to know against what region and what deploy they're working.
But what happens when we need to add a new model that's not
available in one one region?
We need to replicate the whole architecture from the start.
It's very tedious and time consuming for all the teams.
And in case you need to increase your rate limit, this is an actual
form from Azure Portal, which you do that you need to fill in.
Understand what it is you're asking.
What type of filter do you want and in what region?
And then submit this request and then get approval for additional requests.
Shortly speaking, that's very complicated and tough.
So we were looking for something much easier, both in terms of rate limits
developer experience, and we wanted to practice and cha and test new models
the minute they, they become available.
So we moved back to open ai that around that time, their rate limit,
it was actually higher than Azures.
From developer perspective, there is nothing to compare
in several lines of code.
You already are fully integrated with open ai.
You don't know what regions and you don't care, and you can use any model
you want and you can just switch the model in a matter of seconds.
But that still wasn't enough of a copilot.
It still wasn't enough of a copilot, so we went.
To the most reliable source, JGPT to see how can we get even
higher rate limits from OpenAI.
Basically, OpenAI suggests two options.
One, the second one is to apply to a higher rate rate limits for basically
engage with OpenAI and start an enterprise tier, but that is actually very expensive.
And it requires us as a company to commit to a certain spend of money
that we were not ready to commit.
Yet this, and this is why we go, went with the first approach of having multiple API
keys for multiple organizations and create some sort of a round Ruben around them
to make sure that we basically have more available organizations, and therefore
more our rate limit becomes higher.
So let's see how our infrastructure, the OpenAI multi organization proxy works.
We have, we've created several organizations within OpenAI and
we've listed them along with their information in inside our code base.
Then we extended the OpenAI, SDK.
So the developers will be very easy.
To work with this infrastructure without actually understanding
or knowing or caring about what organization exactly they work about,
they work with what we're doing.
We're simply called, instead of calling open AI service and straight
going to check completion, we first call get SDK, which is our extension
that basically pulls the next.
Or the next organization and that this is how we can work with multiple and
multiple organizations and extend our raise limit basically in indefinitely.
That work.
That worked beautifully.
It even helped us with our next product, the site planner.
But if you remember, as I mentioned at the beginning, it starts with an
agent that interviews the web creator, ask different guiding questions to
understand the goal of the websites, and then it prepares a draft.
Since we're working with professionals, we first want to create a site map, the
bird's eye view of all the different pages, their content, the paragraphs,
and the goal of every page and every paragraph for that website, because
we're doing some things that are professional for professionals, so we
need to make sure we create a decent design and only with them we can move.
To the wire frame, to this first beautiful draft, it will show both
the upgrader and the user, the client, how the website will look like.
You can see that it happens really fast and actually in parallel, there
are hundreds of AI requests that happen here in order to generate all
this contact because both the layout and the content, and actually the
whole structure is created by ai.
So we had to have really high limits.
So some takeaways first.
Now, if once you start developing your product and you're getting
into production, you first need to understand that rate limits
exist and you need to track them.
You need to know your usage because if you have a product and
you start to expose it to a world and more and more user using it.
This is awesome.
This is awesome.
You just need to make sure that you're not getting too close to a rate limit.
Otherwise, it might be very dangerous and your users might not like your
service not being available, and in case you are getting closer for
whatever reason, just remember that you can easily create more organizations
and use a shared pool around them.
All right.
The second topic would be how we're building.
A resilient AI solution.
So we, we understood how can, how could we handle the load
and have handled the rate limit.
But rate limit and load is not only the case, and I would quote
one of the best engineers, Mike Tyson, that everyone has a plan.
Tva, he get punched in the mouth.
So you would expect.
This is how if you're working with multiple ai, open AI organizations,
this is how you, your usage and your performance would look like similar
because what's the difference between one organization and the second one
and the, and a different organization?
Let's assume that all of them are under the same tier.
You would assume that this is how the chart should look like.
However, this is not exactly the case.
If you actually track duration, error rate, any other performance
metrics, it would look more of this.
You would see that, yes, most of the time they have the same response, the same
duration of requests, the same error rate.
Sometimes they diverge and one organization becomes much slower or
even return failures while others don't.
And if you actually look on the status pages of Tropic Open ai, any other
AI provider which you would like, you would see that it's not as green.
As you would expect, and by the way, you don't have to have major outages like
the red plus spots, even the yellow ones.
That means that some of the functionality is not working or some
over some delays or some timeouts.
It's something that will, your product will suffer.
Need to see how you can manage this, especially than not.
All organizations usually in open ai, suffer in the same
way during those outages.
As we know, this is how what happens, our users actually expect that our
system will be a hundred percent uptime.
They don't care about open ai if they don't care about outages and
they don't care about the rate limit.
They want the product that we're using, even the product that we've purchased,
for example, to work all the time.
So in this scenario where we see we have two different organizations and
one of them is divergent becoming worse while the green one still or
operating in a sufficient level.
What we would want is for our system not to use the yellow organization,
but to actually use only the green one.
And this is how we've had an additional layer for our multi-organ solution
in order to make it more resilient.
So let's say we have a client request, it goes to our infra
that has several organizations, and let's say that one of those
organizations hit a certain threshold.
For example, it it hit their own rate limit or for some reason it returns
500 error code or any other request within a certain time timeframe.
And we see that it's not something that just happens once.
It happens three times.
So what our solution is able to do is to take this organization and remove it
from the total pool of the organizations market as a, let's say, sick organization
and work only with the healthy one.
In this scenario of the client will never experience the failures that
happen only in this organization.
'cause if we would keep it there, that means out of four organizations.
Every one out of four requests will get an error.
And we don't want that.
We wanna have as much successful requests as possible.
And only after a while we'll try to reconnect to this organization again.
And in case it works fine, we can keep it and restore to normal.
So this pattern is called a circuit breaker.
It's not something new, it's something that actually comes
from the electricity world.
But basically what we've done, we've extended the previous infrastructure
solution by adding two additional parts.
One is the Redis cache.
We needed a distributed cache because as many companies we work in Kubernetes, we
have different pods and we could, we don't want to make, and what we don't wanna.
Let any, every pod suffer from the same threshold.
'cause if we know that organization two, for example, is supposed to
be out, we don't want it to heat a threshold in one pod, and then the
second pod, and then the third pod want to heat it certain threshold and
then be removed from all the pods.
So the whole management of the organization, cycles of actual
pool is being managed inside.
Ve And the circuit breaker basically is the management system.
That operates on top of all organization and decides which organization should
be removed from the pool of the health organizations and when it should be
returned and that shouldn't be returned.
We connected it to our slack, so we actually can see.
It's very nice because think about all the time, all those times where you had
a certain resource, it wasn't behaving in a good way and the developer had to.
Manually deploy some code, do some changes in order to remove it, and
then set a reminder to the next day to make bring it back here.
Everything is done automatically.
And when we wake up, we just see this site of a log, this sort of a log in our
slack where we hit a certain threshold.
The circuit break is open, it removes a certain organization and after a while.
It returns it and no engineer had to do anything.
It just works in a very resilient way.
So let's go back to the story that I started this session.
So two days before the global launch of the site planner open AI locked our
account and it means we couldn't pay for additional tokens for additional credits,
and we didn't have enough credits.
So the system could not work with open ai.
This is how it looked like.
We wanted to top, top up the balance, but for some reason we
got this exception, this error.
And obviously when we tried to contact someone from support, no one answered
and we just didn't have the time to wait for others, so we had to act.
And this is why we, from the start, we knew that we didn't want
to be in a vendor lock position.
We didn't wanna be in a position where in, in such case, if one of those AI vendors
will fail, our product will not work.
This is why we've started from day one.
Also working with.
Now there are some differences.
So with immigration from OpenAI to cloud is not as seamless as you would think
because there are some functionalities that OpenAI support supports that cloud
doesn't like structured output, for example, and defense validations and,
the system prompts are not the same.
You can have a prompt and then use it in open AI and in cloud.
In using the same system prompt, you'll get different results.
And by definition it's two different models, so you had to adjust the
system prompts a bit depending on the provider, which you're working with.
But this is, and although this sounds so complicated, this is something.
That is necessary.
'cause if you're building a product.
For millions of users.
You cannot rely on any vendor that for any reason will not be available.
It might be out for whatever reason.
And you just don't want to be in a position saying to a users
that your product is not working.
'cause open AI is not working you.
You gotta have a fallback.
In our case, we have automation, so we wanted to have a, a
fallback in just one click.
So we, so what it means that we adjusted our prompts or
system prompts to the provider.
We knew exactly if we were moving from open AI to cloud the whole system.
Works the same.
What changes is just with, instead of calling the open ai, SDK, we're
calling the SDK, which in this case is Claude, and it then touches the
correct system prompts and additional instruction to make the responses
be as close to open AI as possible.
Yes, it's not ideal, but it's much better than being out totally.
So right now we actually have this.
So this is not something that happens automatically because it's a very rare
scenario, but we do have a single key in our system that once we change,
it moves the entire system to work against cloud and not against open ai.
And then reverting it is exactly the same.
And we are right now working on open sourcing this framework for everyone.
So stay tuned.
Alright, takeaways from BI building and being a resilient AI solution.
Everyone need to remember, and it doesn't, it has nothing to do with ai, but failure
in software development is inevitable.
It doesn't matter of if, it's a matter of when Using a proven design partner,
like a circuit breaker is a great solution for being full tolerant.
I think that, again, it really depends on the scenario.
It really depends on the level of support and service that you wanna provide
for your clients and the necessity.
If it's a product that's working once a day, it's fine.
But if you working globally and you wanna provide your user with sufficient
coverage for the entire time, then I would consider an additional vendor
and doing whatever is necessary.
To make sure it's easy to switch between them.
It's easy to switch between open AI to cloud, to grok, to whatever vendor you
want and without adding additional code or changing stuff in the middle of the night.
And another tip, it's not written here, but it's to test, test.
That means that it's one way.
It's one thing to just build it and it's ready.
But every now and then practice.
Try to change the provider and see that everything is work, but the tests
are hitting the limits that the most important scenarios are passing and
the system is operational because what happens usually with fallbacks, the being
left behind, and then it's very then.
There is a difference between what the main provider, how the system
is working with the main provider versus the fallback that's usually
not as good as the main provider.
Alright, the last subject that I want to talk to you about today is how we create,
how AI solutions need to continuously.
So we, earlier we said that this is what our users expect with our system to be
a hundred percent up and always work.
But actually this is what we expect.
They expect that not only if the system is up, they expect that everything
works great to where to what they expect from the system and how we use it.
But how we even know that we are doing a good job.
How do we know that our system is doing what our users want?
So I bet many of you right now are thinking about evolves
and evaluation in general.
And you're correct.
And the way to understand if, and the AI system is working
correctly is by evaluation.
And when we're talking about tax generations, it's pretty straightforward.
For example, back in the day we started to evaluate diff our
different results in the text area.
So when we have our users wanted to change their text of a button
widget, for example, we suddenly saw.
That the AI was generating huge responses back.
You can see here it's a paragraph right on the button.
It makes no sense that this type of text should be on the button.
So once we understood that those scenarios exist, we created, we had
evaluations to ensure that texts that are being, that AI should generate four.
Button widget, for example, would be different in length than a text for a
heading and a text for a full paragraph.
But for texts it's pretty straightforward.
You put text in LLM provides text out and that's it.
But how do you evaluate images?
'cause it's not only that a user asks for image of a dog and where you see the dog.
Does the dog look okay as it has four legs and just one head?
Is it actually walking or is it running where not only the functionality
that's supposed to work, but also a certain amount of taste, right?
'cause some, someone would, might look on a picture and say it's
good picture, but someone else is, was envisioning something else.
So it's hard to evaluate images and it's much harder to evaluate a whole structure
of a website, of a page, because.
You would, a user can ask for a about me page and will have seven
different sections, but the user might have a different idea of how
this page should be structured.
So how should we evaluate this type of interaction?
So in order to do that, we didn't we had to re move away from the traditional
evil of financial evaluations.
And we've defined a success metric.
We called it insert rate.
Basically, for many of our features, the user had to enter a prompt,
then see a preview of a certain a preview of a certain result, and
only if the user would use the image.
We would mark that this specific interaction was successful.
If, for example, we would click on generate again, or even close the page.
We would know that this interaction ended with an empty result and we
would mark it as inserted false.
Then we would keep all of this data in a dedicated database where we would see the
user actual input, the enhanced prompt that, as AI engineers we are enhancing
prompts and providing additional context.
But we wanted to list everything in a very clear way.
Then we had the full prompt with the system prompts and the additional, and
basically everything that eventually goes to the LM, the result that gets gets back.
And the indication, whether it was inserted or not, all this
information was stored for every interaction inside our database.
On top of that, what we started.
We started with just manually going over the results.
We would fetch the information of a day, of the week of the month.
We started first with just trying to understand what's wrong with
the ones that were in, not inserted when we asked the GPT for help.
We would add provide this information in an anonymized way into check GPT and ask
it to see if it sees any similarities.
And eventually develop our own clustering job.
We even have an article right here, have a QR code, but I invite you to scan
and read how we've developed basically offline jobs that are a, that are AI
powered that go through this data.
Create different clusters and find anomalies, find those requests that
have some common ground and they failed.
A great example that I love to share is what we've noticed.
That we've noticed that by we, our AI job noticed that several image generation
requests were not being inserted.
And the AI actually once analyzed that.
It's suggests that all of those requests that had transparent
background actually got rejected, and the reason was so simple.
The AI model that we were working on the image generation was not, didn't
support the transparent backgrounds, and therefore all the images always returned
with, with not transparent background, with a full background, and users
didn't want that, so we rejected it.
So the AI actually here, once scanning the information, clustered them correctly,
understood the reason, and even suggested, suggested the reason for this failure.
The solution, by the way, was very simple.
No, we didn't change payment provider.
We just added a hint that once we saw in the input that the user was
typing something about transparent, we also automatically hinted and
say, Hey, we're not supporting transparent images at the moment.
And that's it.
We aligned expectations with the users and they were much happier.
Here you can see that every, basically every week we would get automatic
responses in Slack that would get, all those clustering informations with
different suggestions and different insights of fa potential failure reasons.
And yeah, as I mentioned, it found real issues with transparent background that
I just mentioned, the text length of the buttons that I mentioned before.
Even that our Japanese understanding was not as good as we thought it was.
And yeah, many more examples that manually, especially for large scale
application, it's just not visible.
So you have to get some sort of automation.
On.
So first you would want to persist with the data, with all the details,
get some sort of information to scan this data on a repetitive basis,
and use Ally AI to get insights.
Yeah, this is exactly the takeaways.
Persisted data, define a success metric, measure a success metric, and
then pick a way to evaluate success.
Always start manually, always start simple.
And only once you fine tune the metric and the data and everything works, then
you can move to a more automatic stuff.
And also you leverage AI for your needs.
To summarize this session first.
Not every system needs to be fully resilient and support huge rate
limits, identify first the need for scalability, then design the solution.
So if it's rate limit, we've talked about adding additional organizations,
maybe changing to a different model, then a different provider that has
a larger context window or a larger rate limit or, and but in many cases.
It'll be just adding more resources and then controlling them in a healthy
in a healthy and a resilient layer.
Using something like a circuit breaker to identify one, once a certain
organization isn't good enough, move it away and work only with the good ones
and eventually establish a continuous improvement mechanism will help your
product and your AI solution be better.
So I hope you, you enjoyed this session.
It was a pleasure speaking to review today.
May the AI be review right here.
You can scan this barcode and see, get the presentation for this talk
along with additional presentations that I've been shared in the past.
With different recordings and different materials, different block posts that
I mentioned in this talk and in other talks, it's everything the same place.
So it's easy for you to easy for you to check.
And also it has my contact details there.
So if you want to chat more about ai, feel free, curious.