Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
My name is Soap.
I am the co-founder and CTO of Ku ai and today I'm going to talk about practical
tips for building AI applications or AI agents using LLMs at KU ai.
We have been, working on building AI applications and,
agents for the last, 18 months.
And, during our journey, we have identified a bunch of unique, problems.
That people generally face.
And, we have also faced, those same problems specifically while
building, applications using LLMs.
And, the, agenda for today's talk is that, we want to, educate, devs about
these problems, so that, when they are building apps on top of LLMs, they're
aware of these problems and, Also discuss what are, the solutions that
work for us and, the dev tools or tooling that we use to solve these problems.
and, by sharing this information, we want to save time, when devs
are, building applications on top of, for the first time.
The first thing that you'll need to solve when you start building
apps on top of, LLMs is, how to handle LLM inconsistencies.
you, if you have some experience building, applications using normal
APIs, you would've seen that they don't.
fail that offer.
while building, general applications, you don't really worry about,
inconsistencies or failures, that much.
if an API fails, you just let, the API fail.
And, the user, when they refresh their page, you make another API
call and it'll most probably succeed.
but in case of LLMs.
this is the first thing that you'll probably need to solve, when you're
actually building an application because, LLMs have a much higher
error rate, than your normal APIs.
And unless you solve this particular thing, your application will have a
terrible ux, or user experience because, in your application, you'll generally
use the LLM response, somewhere else.
And every time the LLM gives you.
A wrong output, your application will also crash.
So this is the first, problem that, you should solve, while
building LLM applications.
now before we get into, Like why, how to solve this, particular problem.
let's just talk about why this even occurs, in these, in
these specific applications.
so like I mentioned earlier, if you are working with, normal
APIs, you generally don't worry.
Much about the error rate, in like fairly stable, APIs, but even the most
stable LLMs give you a much higher error rate, than your normal APIs.
And the reason for this is LLMs are inherently non-deterministic.
so what do you mean by that?
so if you look at an LLM under the hood.
they're essentially statistical machines, that produce token after token based
on, the input prompt and whatever tokens have been generated previously.
statistical machines are basically probabilistic and as soon as you bring
probability into software, you are going to get something non-deterministic.
Now what do we mean by non-deterministic?
You basically, will get a different output for the same input every time.
you, ask LLM for a response, you could, I, and I'm pretty sure like you are,
you would have seen this problem while using, all the different, chat bots
that are available, like chat, GPT or.
the Deep Seeq or Claude chat, you would've noticed that, every time you give,
give, give an input, for the same input.
Every time you hit retry, you'll get a different output.
that's the same thing that will happen with, the LLM
responses in your applications.
you most of the time don't, have a lot of control or, will you
get the exact output or not?
Now because of this particular problem, which is, a being non-deterministic, every
time you give it an input, you'll not always get the response that you want.
for example, if you ask an L-L-M-A-P-I to generate, JSON, which is a
structured output, You might get more fields than, what you asked for.
Sometimes you might get less fields.
sometimes you might have, a bracket missing based on what we have seen.
if you have a normal stable API, you'll see an error rate of something like 0.1%.
But if you are working with an LLM, even the most stable, the LLMs, which have
been, here for the longest amount of time, they'll give you an error rate of,
something like one to 5% based on what kind of task you're asking it to perform.
And if you are working with chain LLM responses, basically you,
provide, the L-L-M-A-P-I with.
A prompt.
You take that response and then you provide it with another prompt, using
the response, that you got earlier.
this is basically a chained, LLM responses.
you'll see that your error rate gets compounded and, this particular thing,
will probably not have a solution.
In the LLMs because of LLMs, like I mentioned, are
inherently non-deterministic.
that is how the architecture is.
So this is something that needs to be solved in your application.
you can't really wait for like LLMs to get better and, start
providing better responses.
they will definitely, get better and, Reduce the error rate, but, I think
as an application developer, it's your responsibility to take care of this
issue within your application as well.
So what are your options?
The first thing that you should definitely try out is, retries and timeouts.
Now, these are not new concepts.
if you have worked in software development for a while now,
you would know what a retry is.
Basically.
when an API.
Gives you a wrong response.
You try it again with some, cool down period or, maybe not depending
on, how the rate limits are.
retry is basically, you make an API call the API fails.
you wait for a while and then you retry it again.
as simple as that.
Now, when you.
Are developing, general applications, I think retries and timeouts are
something that, are not the first thing that you would implement.
Because, you just assume, you just go with the assumption that the API response
rate is going to be fairly reasonable.
they'll, most of the time work and, like not adding retries and timeouts
to your APIs will, not really degrade the application performance.
unless like you are working with very critical, applications
like, something in finance.
Or health where, the operation has to finish, in which case you'll, definitely
start with retries and timeouts.
But, in our general experience, if you're working with normal APIs, you
don't really worry about these things.
but because LLM APIs specifically have a higher error rate, retries
and timeouts are something that, Need to be implemented from day one.
timeouts again, I think, I don't need to get into this.
A timeout is basically, you make an API call and you wait for X
seconds for the API to return.
If it doesn't return an X seconds for whatever reason.
you terminate that, particular a p and you try again.
this basically is protection against, the server, the API server being down.
And, so if you don't do this, and if the API takes a minute to respond, you, your
application is also stuck for a minute.
And, so are your users.
So a timeout is basically protection so that if an a p doesn't return in
like a reasonable amount of time, you cancel that API call and you.
retry that API again, that's where timeout comes into picture.
cool.
So how do you implement this into your application?
I would suggest don't, write the word for retries and timeouts from
scratch because there are a bunch of, battle tested libraries available
in every language that you can use.
And, with a few lines of code add these.
behaviors to your application.
So let's look at a few examples.
the one that we actually use in production is, this one called
Tensity by, it's a Python library, and, it allows you to add retry to
your functions by simply doing this.
you add a decorator.
Which is provided by the, by this particular library.
you add it to a function and this function will be retried.
whenever there is an, exception or error in this particular function.
Now, you'd ideally want more control over, how many d tries to do, how,
like how long to wait after, every try and those kind of things.
those.
All options are present in this library.
You can, give it stopping conditions where you want to stop after three retries.
You want to stop after, 10 seconds of retrying.
you can add a wait time before every retry.
you can add a fixed wait time.
You can add a random wait time.
all these, different kinds of, behaviors can.
Be added using this library, with a few lines of course.
if you are working in Python, this is our choice, has been working,
very well for us in production.
this is what we, have been using for a very long time.
So I would recommend this, if you're working in js, there is a similar library
called Retract, very conveniently.
that you can, that is available on NPM.
similar type of functionalities.
It gives you, retries and timeouts.
Oh, by the way, tenacity also has, timeout related decorators.
works the same way.
if you want to add a timeout to particular function, you just add
that decorator, specify the timeout.
yeah.
This library, if you are working with a JS application, retry is, our choice.
the third option is, basically, a lot of people who are, developing
LLM applications are using these frameworks, LLM frameworks to
handle, API calls retries and like a bunch of different things.
So the most famous LLM frameworks, framework, which a lot of
people are using is land chain.
And if you are working on top of land chain, land chain provides
you a, basically a, some, mechanism to retry, out of the box.
it's called.
the retry output parser, where, you can use this to make the LLM calls
and whenever the LLM call fails, this parser will basically, handle retry
on your behalf, by passing, The prompt again, and, also the previous output,
so that the, L-L-M-A-P has a better idea that, okay, the last output failed.
And, I'm not supposed to, give this response again.
So if you're on, then it's already sorted out for you.
You use the retry output parcel.
Alright, so this sorts out how to implement retries
and timeouts the next most.
Common reason for, failure or LLM inconsistency is when you are
working with structured outputs.
So when I say structured output, something like you asked the LLM to
generate A-J-S-O-N or X-M-L-C-S-V, even list ra, those kind of things.
whenever you are asking an LLM to generate a structured output, there
is a slight chance that, there'll be something wrong with that structure.
Maybe there are some fields missing.
there are extra fields.
in case of JSONs, XMLs, there are brackets missing, might happen.
So how do you handle that?
the simplest way to do that is to, is to integrate a schema library instead
of doing it on your own every time.
a schema library could be something like pedantic and, this is what.
We use, in our production.
PTECH is basically, the most commonly used data validation library in Python.
And what it does is it allows you to, create classes, in which you describe
the structure of your response, and then you use this particular class to,
check whether the LRM response fits.
This particular structure or not.
it'll check for, fields, extra fields, or less fields.
It'll check for data types, and a bunch of other options.
on Python, just go for Edan Tech.
it is a tried and tested library, and, it'll make the data validation
part when you're working with structured outputs, hassle free.
similarly, if you are working with NPM, there's something called Yap, same
stuff as pedantic, data validation.
you essentially, define the shape of your output and, yap basically uses that shape,
which is essentially a class, JS class, or a JS object to, Check or enforce,
the structure of your LLM responses.
and the idea is to use, these, these, data validation libraries,
along with, retries and timeouts.
what you basically do is when you make an L-L-M-E-P-I call, and you
get a response, you pass it through pedantic or, whatever data validation.
Letter you are using, and if you get an error, you use the retry, to
like basically let the LLM generate that structured output again, most
of the time, you will see that, a couple of retries sorts it out.
it's not like every API call will feel in the same way.
So if, let's say there are a few things missing.
In your structured output the first time when you do a retry, the next
time you'll get the correct output.
but just as a general advice, if you see that there are particular
kind of issues happening again and again, you should mention that,
instruction in the prompt itself.
Because what happens is that, when you do retry, an API call, which
was supposed to take five seconds, might end up taking 15 to 20 seconds.
And, it'll make your, make your application feel laggy.
because, at the end of that API call, you're going to provide
some output to your users and, they're waiting for that output.
so if, that there are particular kind of, problems that are happening again
and again, like for example, If, if you are, generating JSON, using, JSON,
using an L-L-M-A-P-I, and you'll see that, like the LLM is always using
single code sensor or double code, which will generally cause issues, you should
specify that as an important point in your prompt so that, you get the correct
output in the first, attempt itself.
this is just an additional level of check, but the idea is that the first response
should itself give you the correct output.
anything that is, that is known, should be mentioned in the prompt
as a special instruction, so that you don't keep retrying and you
use this an waiting for an output.
one, one additional option worth, one special mention here is, The structured
output capabilities provided by open ai.
if you're using, GPT models, and open AI APIs, what you can do is
there is a response format field where you can specify a class and,
the open air APIs themselves will, try to enforce the structure.
but this one problem here, which is if you want to switch out.
The model and, use something else like Claude Orrock.
then you have basically, lost the structured output capabilities
because those are not, available right now in, other LLM APIs.
my suggestion is to just handle the, schema enforcing and checking in
your application itself so that like it's easy for you to switch out,
the models and use different models.
That's all for, handling LLM inconsistencies.
two main things, retries and timeouts.
use them from the start.
if you are working with structured outputs, use a data validation
library to, check the structure.
you see the options here.
any of these are good.
The next thing that you should, start thinking about is how to implement
streaming in your LLM application.
generally when you develop APIs, you, you implement, you
implement normal request response.
you get an a p call and, the server does some work, and then you, then you
return the entire response in one go.
in.
In case of, LLMs, what happens is sometimes it might take a long time
for the l LM to generate a response.
That's where streaming comes into picture.
Streaming of your responses allow you to start returning partial responses,
to the client, even when, the LLM is not done, done with the generation.
let's look at why.
Streaming is so important.
while building LLM applications, like I mentioned, LLMs might take long
time for, for generation, to complete.
Now, when your user is, using your application, most
users are very impatient.
you.
can't ask them to wait for, seconds.
like I'm not even talking about minutes.
if you have a ten second delay in showing the response, you
might see a lot of drop off.
Um, and like most of the LMS that you would work with would take five to 10
seconds for even the simplest, prompts.
So how do you improve the ux?
and make sure that your users don't drop off.
that's where streaming comes into picture.
what streaming allows you to do is, LLMs generate.
Response is token by token.
they'll generate it word by word.
And, what streaming allows you to do is, you don't need to wait for the LM to
generate the entire response or, output.
What you can do is as soon as it is done generating a few words, you can send
them to the client and, start displaying them on the UI or, what your client
you're using, in this way, the user.
Doesn't really feel the lag, that, LM generation results in.
what they see is that, as soon as they type out a prompt, immediately
they start seeing some response and they can start reading it out.
you, this is a very common pattern in any chat or LLM
application that you would've used.
As soon as you type something out or you do an action, you start seeing
partial results on your UI that is implemented through streaming.
the most common or the most, used way, to implement streaming is web sockets.
web sockets allow you to send, generated tokens or words in real time.
The connection is established, between client and the server, and then, until
the entire generation is completed, or, as long as the user is, live on
the ui, you can just like reuse that connection to keep sending response,
as and when it gets, generated.
this is also a bidirectional, um.
communication method, so you can use the same method to get
some input from the client.
Also, no One drawback of, web sockets is that they need
some custom, implementation.
you can't just take like your simple HDP rest server and,
convert it into web socket.
You'll need to redo your implementation.
use new libraries.
probably even new use a new language.
for example, if you are working on Python, Python is not, very, efficient,
way for implementing web sockets.
You probably want to move to a different language which handles, threads or
multi-processing in a much better way than Python, like Golan or Java, or c plus.
so generally web socket implementation.
Is a considerable effort.
And, if all you want to do is stream LLM responses, it probably
is not the best way to do it.
there is another, solution for streaming, over SGDP, which is
called Server Set Events, which basically uses, your, your, um.
as server itself, like basically if you are on Python and you're using
Flask or Fast API, you won't need to do a lot of changes to start
streaming, using server sentiments, code-wise or implementation-wise.
this is a minimal effort.
what this essentially does is, It'll use the same FTTP, connection,
which your STPI call utilizes.
but instead of, sending the entire response in one shot, you can
send the response in chunks and, on your client side, you can,
receive it in and start displaying.
now this is a unidirectional, flow.
it works exactly as a rest API call.
but instead of, the client waiting for the entire response, to come, the client
starts showing chunks that have been sent from server, using server sentiments.
implementation wise, it's very simple.
Like you, just need to maybe implement the generator, if you're using Python
and, Maybe add a couple of headers.
we won't get into specific details because these are, things that you
can easily Google and, find out.
But, our recommendation if you want to implement streaming in your
application and you already have a rest, set up ready on the backend.
just go for server set events, much, faster implementation,
also much easier to implement.
web sockets is a bit heavy.
And unless you have a specific use case for, web sockets, I won't
recommend, going that, on that path
streaming is a good solution.
If, the particular task that an LLM is handling, gets over in a few
seconds, like five to 10 seconds.
but if your task.
It is going to take minutes.
streaming probably is not a good option.
that's where background jobs come into picture.
if you have a task which can be done in five to 10 seconds, probably use streaming
and, it's a good way to start showing, an output, to the user on client side.
but if you have a task which is going to take minutes.
it is better to handle it asynchronously instead of
synchronously in your, backend server.
and background jobs help you do that.
So what are these particular use cases where you, might want to use
background jobs instead of swimming?
think of it this way.
Let's say if you, if you are building, something like.
An essay generator, and you allow the user to, enter essay topics in bulk.
So if someone, gives you a single essay topic, probably, you'll finish
the generation in a few seconds.
And, streaming is the way to go.
But let's say if someone, Gives you a hundred essay topics, for generation.
I know that this particular task, doesn't matter how fast the LLM is going to
take minutes at least a few minutes.
And, if you use streaming for this, streaming, will do all the work
in your backend server, and, until this particular task is completed,
which is going to be few minutes.
your backend server resources are going to get, hogged or are going to be, tied
up in this particular task, which is very inefficient because, like your
backend server's job is basically take a request, process in a few seconds
and, send it back to the client.
if you start doing things which take minutes.
You will see that, if you have a lot of concurrent users, you, your backend
server will be busy and it'll not be able to, handle tasks, which take a few
seconds and, your APIs will start getting blocked and, your, your, application
performance will start to degrade.
So what's the solution here?
You, the solution is, you don't handle.
long running tasks in backend server synchronously.
You handle them in background jobs asynchronously.
Basically.
when a user gives you a task, which is going to take minutes, you log
it in a database, a background job will pick that task up till then, you
tell the, you basically communicate to the user that, okay, this is
going to take a few minutes, once.
The task is completed, you'll get a notification, probably
as an email, or on Slack.
And, what you do is you use a background job to, pick up the task, process it, and
once it's ready, send out a notification.
easiest way to implement this is CR Jobs.
crown Jobs have been here for, I don't know for a very long time.
very easy to implement, on any Unix-based, server, which is
probably, what will be used in most of, production backing servers.
all you need to do is set up a CR job, which does the processing.
and the CR job runs every few minutes, checks the database
if there are in pending tasks.
now when your user, comes to you.
With, with a task, you just put it in a DB and, mark it as pending.
when the crown job wakes up in a few minutes, it will check for any
pending task and start the processing.
And, on the US side you can probably, implement some sort of polling.
To check if the task is completed or not.
And once it is completed, you can display that on the ui.
But, this is an optional thing.
Ideally, if you're using background jobs, you should also, sorry.
You should also, Separately communicate, that the task is completed with the user
because, the general, idea is that, when you, when a task is going to take a few
minutes, your users will probably come to your platform, submit the task, and
they will, move away from your platform.
So they're not looking at.
the UI of your application.
So you should probably communicate that the task is completed through
an email or a Slack notification.
so the users who have moved away from, the UI also know that okay, that,
that generation has been completed.
this works very well, minimal setup.
Nothing new that you probably need to learn, nothing new
that you need to install.
for the initial stages of your LLM application, just go for a crown job.
what happens is that as your, application grows, you'll probably need to scale
this now, if you run multiple crown jobs, you need to handle which crown job.
pick up which task you need to implement some sort of, distributed locking and,
all those complexities come into picture.
Basically, crown jobs are good for the initial stages, but, like
we also started with crown jobs.
we still use crown jobs for some simple tasks, but there will be a
stage, When you'll need to move away from crown jobs for scalability, and
for, better retrain mechanisms, that's where task queue come into picture.
So basically think of task queue as crown jobs with like more intelligence, where
all the, task management that needs to be done, is handled by the task queue itself.
when I say task management, on a very high level, what.
It means is that, you submit a task to the task queue.
generally a task queue is backed by some storage, like Redis or some other cache.
the task is stored over there, and then the task queue handles, basically a
task queue will have a bunch of workers running and, a ta the task queue will
then handle, how to allocate that work.
To which worker based on like a bunch of different mechanisms.
Like you can have, priority queues, you can have a bunch of different
retry mechanisms, and all those things.
two good things about using task queue.
task queues are much easier to scale.
in Aron job, if you go from one to two to 10 crown jobs, you have to handle a
bunch of, Locking related stuff yourself.
in task queue, it's already, implemented for you.
So all you can do is increase the number of workers in a task queue.
And, if you start getting more, tasks or workload, the, you can just it's as easy
as just changing a number on a dashboard, to increase the number of workers.
again, like all the, additional handling.
For race conditions, retries, timeouts, it's already taken care of.
All you need to do is, provide some configuration.
you also get better monitoring with task use.
you, every task you comes with some sort of, monitoring mechanism
or, dashboard where you can see what are the task currently
running, how much resources there.
Eating up, which tasks are failing, start or restart tasks
and all those kind of things.
once you start scaling your application, go for task use.
The task queue that we use in our production is called rq, which stands
for Redis Q. And, as the name suggest, it's backed by Redis, and it's a
very simple, library for Qing and processing background jobs with workers.
very easy setup.
Hardly takes 15 minutes to set it up.
If you already have a Redis, you don't even need to, set, set up
a red, for RQ and, very simple.
Uh.
mechanism for queuing and processing.
All you need to do is create a queue, provide it a red connection so that,
it has a place to store the tasks.
when you get a task, queue and queue, you can, and, this is basically a
function which is going to get called in the worker to process your tasks.
it's this simple and you can also provide some arguments for that function and.
The worker for the worker, you just need to start it like
this, on your command line.
And, it consumes tasks from Redis and, process them.
If you, want to increase the number of workers, you just start 10
different workers, connect them to the same Redis, and, RQ will itself
handle all the, all the complexities.
Of, managing which worker gets what task, and all those kind of things.
if you're on Python, RQ is the way to go.
Salary provides you with, similar, functionality.
but we just found that, there were a bunch of things in salary,
which we did not really need.
and it seemed like an overkill.
so we decided to go with RQH was much simpler to set up on our end from
thread, what inputs are not working, what models are working, what models
are not working, and things like that.
if you want an analogy, you can think of evals as unit testing.
think of it as unit testing for your prompts.
So this allows you to take a prompt template.
And individually just test out that template with a bunch of different values.
and, you can, there are a bunch of, reasons why you should ideally, use
evals with your prompt templates.
one, it allows you to just test out the prompts in I isolation,
which makes it very fast.
the same way unit tests are fast, because you are just checking one function
against different types of inputs.
Using prompts, using, sorry, I'm sorry.
using evals, you will be able to figure out different things like which
input works, which input doesn't work, which model works for a particular
task, which model does not work.
you'll be able to compare, costs of different models for different
types of inputs and so on.
an additional, benefit of using evals is that.
You can directly, integrate them with your CICD pipeline so that,
you don't need to manually keep checking before every release if your
prompts are still working the way, they're working just like unit test.
You just, hook it up to your CICD pipeline and, before every commit or I'm sorry,
after every commit or, after every build, you, straight up run the evals.
And, similar to, assertions in unit tests, evals also have assertions or checks
where you can check the response, and, specify whether it is as expected or not
as expected, and pass or fail an eval.
That's how on a very high level evals work.
we have tried out a bunch of different, eval libraries.
the one we like the most is Profu.
very easy to set up.
simply works using YAML files.
basically you, you create a YAML file where you specify your prompt template
and you press, specify a bunch of inputs, for that prompt template.
And, using, profu is an open source, tool.
So you can just like, straight up install it from NPM, brand it in your CLI.
and at the end of the Evalue you get a nice, graph like this.
Which will show you for different types of inputs, whether the
output has passed the condition.
it'll also allow you to compare different, models and, there is
some way to compare cost as well.
I don't think they have displayed it here, but yeah.
cost comparison is also something that you'll get in the same dashboard
and, You can start off with the open source version of Profu, but they
also have a cloud hosted version.
So if you want more reliability or don't want to manage your own instance,
that option is also available.
before we end the talk, let's do a quick walkthrough of all the different
foundational models, or foundational model APIs that are available for public use.
The reason for doing this is basically, this landscape is changing very fast.
So the last time you had gone over all the available models, I'm pretty sure
that, by now the list of models and also their, comparisons have changed.
probably the models you thought, are not that great have
become very good, and so on.
So let's do a quick run through of all the available models, what are they good at?
What are.
They're not good at what kind of use cases?
Um, you what case, what kind of use cases work with a particular kind of model?
let's start with the oldest player OpenAI.
OpenAI has, three main families of models, which is GPT 4 0 4 O, and o,
which are available for public use.
I think they've deprecated their three and 3.5 models.
so these are the models that are available right now.
If you don't know what to use, just go with open air.
these are the most versatile.
Models, work with wide, they work very well with wide, wide variety
of tasks, within these models.
between four oh and, four oh mini, the difference is mainly, the trade off
between, Cost and latency versus accuracy.
So if you have a complex task or something that requires a bit
more of reasoning, go for four.
if you are worried about cost or if you're worried about, how fast the response is
going to be, go for four oh mini, but, it'll basically, give you lesser accuracy.
O is something that I've not tried out.
these are supposed to be, open as flagship models.
but from.
What I've heard, these are like fairly new.
before you put it in production, maybe, test them out thoroughly.
four O and four O Mini have been around for a while now, so I think, you should
not see a lot of problems, with them.
Also, like reliability wise, as, according to us, open air APIs
have been the most reliable.
so you don't need to worry about, downtime or, having to, handle switching models
because, this provider is not working.
The next provider is, philanthropic.
I think for a while, these guys were working mostly on the, chat.
the APIs were not publicly available as far as I know.
but I think in the last few months, I think that has changed.
the APIs are available.
You can just directly, and they're completely self serve.
You can just directly go, on, anthropics, anthropics console and, create an API key.
Load up some credit and get started with it.
If you have any coding related use case, Claude APIs are your best choice.
I think, as far as coding is concerned, coding as a particular task, Claude, works
much better than, all the other models.
which is also why you would've seen that.
everyone is using Claude with, They're, code editors as well, like cursor.
so yeah, if code is what you want, work with clo.
Next up is gr not to be confused with x gr.
So Grok is, essentially, a company that is building, Special purpose chips.
They call them pus, for running LLMs, which, makes,
their inference time very low.
probably, even the inference cost, So if latency is what you're trying to
optimize, tryout, grok, grok, cloud, which is their API, which are their, LLM APIs.
they generally host, most of the commonly used open source models.
so you have llama, extra Gemma available.
apart from that, a bunch of other things, Latency wise, they are much faster
than, all the other model providers.
So if you are optimizing for latency and, these models work for
your particular task, go for it.
Alright, so AWS, mainly works like rock.
They host a lot of open source models.
On.
And along with that, I think they also have, their own models,
which, we have not tried out yet.
but the biggest USP of using AWS bedrock would be if you're already
in the AWS, ecosystem and you are, worried about, your sensitive data,
Getting out of your infra and you don't want to like, send it to open AI
or cloud or any other model provider.
in that case, bedrock should be your choice.
one good thing is Bedrock also hosts cloud APIs.
the limits are lower.
as far as I know.
I think you'll need to talk to the support and, get your service quota increased.
But, If you are worried about, sensitive data and you're okay with cloud,
bedrock should work for you very well.
and along with that, they also host LA and Mixture and a few
other, APIs, multimodal APIs.
Azure is, the last time I checked Azure is hosting GPT models.
separately, the hosting, which OpenAI does is separate from Azure.
And, the last time we checked, Azure GPT APIs were a bit more faster than open air.
So again, oh, if you want to use open AI APIs and you, want, a slightly
better latency, tryout as Azure, but, they'll make you fill a bunch of forms.
I think these APIs or, these models are not publicly
available on Azure for everyone.
Of now, GCP, I've not tried out.
again, I think the setup was a bit complex, we didn't get a chance to give
it a try, but from what we've heard, the developer experience is much better now.
So someday we'll give it a try again.
But GCP has, Gemini and the.
Latest, the newest scale on the block is Deeps sec. if you are active
on Twitter, you would've already heard, about deeps, SEC's, APIs.
from the chatter, it seems as if they are at par with own APIs.
again, having tried it out, give it a try.
one concern could be, the hosting, which is in China, but, um.
definitely give it a try.
probably you might find it, to be a good fit for your use case.
And, one more thing, deep seeks models are also open source so
you can host them on your own.
And that's all from me.
I hope you find the information shared in the stock useful, and it speeds up
your development process when you are building LM applications and, AI agents.
if you have any, queries or if you want to, talk more
about this, drop us an email.
You can find our email.
On Ku AI's landing page, or just, send me a message on LinkedIn.
happy to chat about this and, go with something.
Awesome.
Bye.