Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, welcome to my talk. The title is self hosted arms
across all your devices and GPU's. So in this talk we're
going to talk about self hosted oems and we're going
to primarily use three open source projects.
One is called was match and this is the CNC project that
I founded and it is a webassembly
runtime that we see great adoption in
AI and the large language model space. And then the
second project is called Lama Edge. Lama Edge is open source project that build
upon was matched. So it provides tools
and SDKs and programming primaries for building
large language model applications, especially around the llama two model on
the WASM edge app runtime itself. So when it's a runtime you
can see this as operating system. And then Llamaedge is infrastructure
that specifically for building web, for building arm applications on
top of WASM edge. And then the third project is Gaia. Net. Gaynet further
builds upon Lama Edge to provide what we
call a rag application server, meaning that
we supplement the large language model with real world knowledge
base, either from public data, from public knowledge, or from our proprietary knowledge.
And that just makes AI models and applications
that associated with that much more useful. So it's designed
for use with AI agent applications and such.
Well, start off the talk with a demo of how easy it is to run
a chatbot using the open source arm on your own device.
Using Lama H in this demo
I'll show you the easiest way to run a large language model on your own
computer. First go to the Lamaedge Llamaedge
repository on GitHub. It is an entirely open source project.
So in the readme file while you are here just
give it a start and in the readme file
you will see a quick start guide and there's just one line of command.
Copy this and then go to
your Mac or Linux computer or even windows with WSl
to run this command. What it does is that
can install a large language model runtime called the Wasmatch
plugins that required to run was to
run the large language model and it goes
fast here. But as you can see it downloads a large
language model called Gamma two B which is one of Google's
large language models. And you can actually run any
large language model that you find on hugging
face. There's over 5000 of those. And just
go back to the readme and, and look for the parameters that
you can pass to the script where you can specify the model which even as
a hugging based URL. And once the model is started it says
the application starts a local web server on your own device,
and it's started on port 8080. So what you can do
is that you can go to that server and
it would load a URL. And it's
the URL said lama h chat. Now you can start to chat with it.
Say, where is the capital
of Japan?
The first time it is a little slow because it needs to load the
model into memory, but it's fast enough. The capital of Japan is Tokyo,
and it's now you can ask what
about the USA? And it clearly shows
that the model understands
the conversation. Right? Because the second question I did not tell
to find the capital of the USA. I said, what about the USA? It knows
about Washington DC because
it knows about the previous conversation.
If we go back to the log, we can see that it's generating
with the gamma two b. It was generating over 60 tokens
per second. That's a lot faster than a human can speak.
A human typically speaks when I'm speaking. Now it's about three
to five tokens per second. So it's ten times faster than human
can speak. Now I can ask a longer question.
Plan me a one day trip there, meaning one
day trip to Japan. Okay, so it's,
I should have said Washington DC, but you know, that's close enough.
You know, that's, oh, it's giving me two days as well. But this
is one of, because this is a very small, it's a two b
model. If you change it to a seven B model or even
larger model, you're going to see. You're going to see it respond a
lot better. But even so, it is still pretty good. It gave me
itinerary of on day four. Now it's maybe
a week in Japan. Right? So that's it.
That's how you run a large language model on your own computer. This is,
I'm running it on the Mac and I'm recording this video while running it.
And so it's even then I
can get 60 tokens per second.
So it is one of the easiest way and the fastest way for
you to run again. Yeah, it gave me the whole
whole week. So again,
the URL is GitHub Lama edge.
Lama Edge. And they're just one line
of command. And once you get started running, you can run other models by reading
the rest of the readme. And you can use that as an API server and
you can do a lot of things with it. Okay, now you have seen the
demo. So what's it all for? We have seen
that you can run a large language model application, you can chat with it
on your own device, but what's it for? Most people just use
OpenAI for this purpose and OpenAI also provided SDK API,
right? So you can build applications around it as well. So why is there
need to run those large language, open source, large language models on your
own device? Well, so there are a couple of reasons. The first,
biggest reason really is that OpenAI or any other larger
language providers, basically it takes a one size fits all approach.
So basically they're using the largest model for the smallest tasks. So even
if you ask for something that can be easily handled by a smaller model,
it will still use the largest model, which generates a lot of waste.
It was very expensive and also makes the models very difficult
to fine tune because it's much harder to fine tune a large model than
a smaller model and because they're running on other people's server,
because you can't, because the model is too big to run on server. So it's
lack of privacy and control. And perhaps more interesting,
and I think people are noticing this more and more, is censorship and bias because
those companies like Openi or Microsoft all
have their own political stand and agendas. So you would
find a lot of questions that those models refuse to answer.
There's very compelling needs for
enterprises or users to run their own models, preferably on
their own devices or in the cloud.
So then that raise the second question.
The llama edge server probably is not the first or the
only open source software that can run large
language models. So for instance llama CPP can run and Allama can run it,
arm studio can run it. So why is it that we choose the
llama edge was stacked to run our API
server around those models as we have shown in the previous example?
First, I think there are several features that distinguish llama
edge from other products or
other open source projects in the space. The first
is to support a wide variety of models. The lamo edge
application server supports over 6000 large language models,
including all the ones
that you have heard of. And there's not just the base models and
chat models, but also embedding models which are essential and very important
for running sophisticated applications as we are seeing like RAC applications.
It even support larger models like X AI squawk model which is
I think 314 billion parameters. And it
requires top of the line, say a Mac studio or,
or two H 100 chips graphic
cards to run those models. But they can run using
Lama H on the other hand.
On one hand that supports a wide variety of models. But on the
other hand, it also supports a wide range of devices, because once
we start to run on our own devices, in our own
cloud, we have to deal with many different types of different devices,
different gpu's, different cpu's and different graphic
cards, different accelerators and all that. So the
architecture of Lama H, because it's based on the wash runtime which we'll get into
in a minute, it provides abstraction layer that
abstracts the application from the underlying runtime
and hardware. So that allows it to the same application
to run on a wide range of devices and drivers,
ranging from Nvidia to AMD to ARM devices,
to Intel AMD CPU
and GPU devices, Apple devices and all of those.
So in a minute. So I just showed you it runs on
a Mac device and I'm going to show you how it runs on Nvidia device.
And we also have tutorials on how to run, say on raspberry PI
or the Jetson devices. All those,
I would say, devices more
commonly find in the edge cloud.
The Lama Edge API server also provides enough flexibility to
run multiple models in the same server. So, meaning that I can run a
chat model to chat with people and run the embedding model in the
same server, so that the user can upload
or process their external knowledge base with the same application server.
It can also do things like function calling by forcing the output to
be JSON. That's something that needs coordination between both
the model itself, the prompting, and the model runtime, meaning that the
model runtime would have to force the output to meet certain grammar checks,
to pass certain grammar check, for instance, that's provided JSON. Right. And as
you can see, it's very easy to install in a run. It's just one line
of command, and I highly suggest recommend you to try it out.
And it's also very lightweight because it's the entire
runtime wash. Plus the application itself
is less than say 30 megabytes. That compared to say,
the Pytorch Docker image we talked, you know, the Pytorch
Docker image is 4gb by itself. So that's
why you, you know, even when running
a large language model and put the API server in front of it, put a
chatbot in front of it, we use Lama edge, right? But all those
benefits actually come from one most important
attributes of the llama edge runtime, which is it is
not really designed to be, to be a standalone
server to run large language model. It is
actually a developer platform. You can write your own code to
extend it and to customize and build your own applications on
top of it. If you think about how large language
model applications are built today, typically you have
an application server that provides API.
It's either OpenAI or some other SaaS provider,
or you can use your own, you know, like using Lama H, you provide open
compatible API server and then you build an application that consumes
this API in say in a relatively
heavyweight stack like in Python or LAN chain or lamba index.
There are lots of tools out there to do that. And you would also have
a UI, have a web server, have all those
components either contained in a docker file,
something like that, and have them all tied together and build
an entire application like that. But while this provides
flexibility and easy allows people, allow developers to experiment
with different parts of the stack, it is also making it really difficult
to deploy those applications because they have too much dependencies.
They are huge because of the python dependency and they
are often very slow. Llama Edge
is a rust based SDK. Essentially it allows you
to build a single portable and deployable app. So basically you can write all
your logic into one single application. There's no need to write part
of the logic in C, part of that in Python,
and then use the HTTP API to connect them together. You don't need
to do any of those. So he improves efficiency, make it a
lot easier to deploy and it simplifies
the workflow. And again, there's no python dependency. A lot of
people, if you haven't really worked in the large language model
space in the past, that Python dependency is actually a
nightmare even for very experienced people in this field.
You can use rust or you can use JavaScript
that allows us to build applications that are similar to
say the latest advance in OpenAI. So if you look at OpenAI,
they have a systems API, they have a stateful API that
build a lot of functionality into the API server itself.
So the Lama H platform allows you to develop your
own API server to serve whatever
the front end that you want to do. So that
gives you a lot of flexibility in terms of building
advanced applications if you pay
close attention. There's a word that I put in bold here that
is single portable and the deployable application,
meaning you are talking about a developer platform that is based
on rust that has no python dependency. One of the biggest issues
people are going to ask is that how about cross platform compatibility?
Because for highly efficient
native applications like that, if you write it
on one platform, it's likely you are not going to be able to deploy it
on a different platform. That sort of makes
it really difficult for developers because I have a Mac or Windows,
but my application that I compiled and tested there,
is that really true that I can't deploy it on a media device in the
cloud? So in the next demo I'm going to show you how
the was manage and the laminage stack address this problem
that we allow you to build truly portable large
language model applications. So here's a demo. In this demo I'll
show you a key benefit of using webassembly
or wasm to run ARM applications. Portability.
The webassembly application is a binary application that is directly
portable across different operating systems,
hardwares and drivers. So on this screen I have
opened a terminal to
my remote machine on Microsoft Azure. That machine runs Linux,
has Nvidia tiffo card and have the CudA twelve
driver installed on it. If I see what's in here,
you can see I have downloaded the large language model,
the Gamma two B model in GguI format,
and then a bunch of HTML and JavaScript files in the chatbot UI.
I have also installed the was image runtime here. So if I say wash,
it's was 13.5. So now we have the large language
model and we have the runtime for the large language model. We are
still missing the application. The application is something that
takes a user input, runs a large language model, prompting the larger language,
runs a large language model, generate a response and then send it back to the
user. So you can call it a chatbot application, it could be a web
application, it could be a discord, or it could be asian
application that connects to other applications that takes the large language
model output to do things. So the application
is a key part that as developers that you would write in
the past, you wouldn't expect this application to be portable because
when you develop this application you are probably on the Mac or Windows machine
and you compared to say Apple silicon and use the metal
framework, you have all this built in. And you wouldn't
think that by just copying this binary application to
Nvidia machine it would just run there, right? And of course
there are ways to make it easier. For instance, if you use Python, Python has
a lot of abstractions that allows you to write
at a fairly high level so that you can write a Python script on
Mac and then try to run this same script on
the Linux machine. However, as you would imagine Python,
in order to achieve that, Python have a huge amount of dependencies.
Like if you go to the Pytorch official docker image, it is
4gb right there. So the Pytorch docker image
itself is 4gb and it's platform dependent,
so you have to install the right Python version. And within Python
you also have oftentimes you need to specify what is the underlying architecture.
So it is huge dependency and it's not really that portable
for any other languages like rust or go, or if you
write your application in any of those other languages,
you would not imagine that it would be portable. That's because the
underlying GPU and CPU architectures are entirely different. What WASM
does is that it provides abstraction for those applications
so that it can run smoothly across all different platforms. In order to
demonstrate that, I switch to a window. This is my,
this is on my local machine, which is a MacBook,
and what I have already downloaded one
of the API server applications from the Lama Edge project which is a rust
application. And I compiled it on my Mac and I tested it on
my Mac, right? So now what I'm going to do is I'm going to just
scp this entire file to the remote azure machine.
So as you can see, the entire file is only nine megabytes and
we didn't package it in any way, we didn't have a docker
image around, you know, wrapped around it.
We just scp the whole thing to another machine and
with entirely different architecture in both hardware and software and
expect to run there. Can we run there? So let's see. So we use
the WASM edge runtime to run it.
So the WASM edge runtime starts instantly. It's because
it's an application, so it loads the large language model and then
it starts an API server. The web server is actually accessible
through port 8080. So if you have this machine,
the port open public, you would be able to load up browser and go
to port 880 and see it. But for
now we want to just stay on this machine because we have it under a
firewall. What are we going to try is that we're going to do API
request, because this API server also takes open AI style
API requests that allows us to integrate with all the
openi ecosystem tools. So here's
how the request looks like. So as you can see,
we request at the localhost and then we send a
message with a row of user and say where's Paris? We ask the model,
where's Paris? Right. If I do this, it does the
inference, its result come back before I can finish speaking.
So the result is, the role of the result is
system and the content is Paris located in hardware friends, blah, blah, blah.
So now we have achieved something, I think very interesting,
that we compiled a rust application on the
Mac and fully taking advantage of the Mac GPU
and the metal framework. And then we just copied this
WASM application into a remote Linux machine
running on Nvidia. And we can see this application,
that's which the role of the application start server and interact
with the underlying large
language model runs just perfectly on
the new hardware, fully utilizing the
media GPU capabilities to accelerate. Without the GPU
capabilities, you would not be able to have nearly 100
tokens per second speed on this
Linux machine. If you're just doing the cpu, you'll be more like one tokens
per second, you know, it would be two orders,
magnitude's difference. So that's it, that's what
we have shown, that the WASM application is truly portable.
All right, to recap the demo,
there's so for the longest time we
have platform engineering, or DevOps,
that combines the role of developer and Ops. But with the new
hardware, with more and more different devices and different drivers,
all that stuff, that's that coming along for AI applications.
And I think it's time to separate the dev and Ops role all
over again. So the way it works is that
Webassembly is a virtual machine format. You can think of it
like a Java Java bytecode, so it provides abstraction over
the real hardware. And for developers, you just need to
write to the Webassembly interface. In our case, you write
to the Lama edge SDK interface and it
tells the SDK to say load a model and
do the inference. And the
developer only need to write application this way. So if the application,
once the application is compiled to WASM, the developer's
job is done. He or she can ship the application anywhere that
they want and lets the runtime takes
over the rest. So it's the Ops people that
needs to install the correct runtime and driver for each device.
So for instance, on a Mac we want to install the Mac version of was
made. It's sort of like Java. On the Mac you need to install the Mac
version of Java, right, the JVM, right. You know, so it's the
same thing here. So you want to install the Mac version of the was runtime.
If you have Nvidia device, you need to Ubuntu with CUDA twelve,
you need to install the appropriate was runtime
in there as well. So for the Ops guys, they can,
once they take care of that, they would be able to just run
that wasm application without any modification.
Because the WASM edge runtime has a contract, has a
standard API that's, that is defined in the
llama HSDK, right? You know, so once it sees those instructions to
say, load a model and you know, send some text to the model
for inference, and it sees those instructions,
the byte code, you automatically run those code and it would translate those code
into that's instructions that are appropriate for the
underlying accelerator and the drivers. So it allows developers
to write truly portable applications that can be deployed anywhere,
and it can be managed by tools like kubernetes and
let ops people worry about installing the right driver and
installing the right was match runtime on every single node or
every single edge devices that you have in your cluster.
So that leads us to our last demo, because we
have talked about Lama Edge being a developer platform,
and one of the most popular applications people do with Python, at least
today, is what they call a rack application, meaning that
you use a standard large language model, a fine tuned large
language model, but you feed it with your proprietary
knowledge base. The knowledge base was divided,
it's typically a text, a PDF or image, or you
know, or a text file, and it was divided into segments and each
segment was turned into a vector and stored in a vector database.
And when the user
asks a new question from the API, the application
would take, would take that question, turn that into
vector as well, and perform a vector search in the database to find out
which other, which texts in the knowledge base are most related
to the question. And then it would add the context that retracted
from the knowledge base and the new question into the prompt,
and asks the large language model to give an answer. As you
can see, this is a fairly complex process, and it involves
not just the large language model, not just the runtime, but also
things like the vector database, the embedding model and all
that. Things that you have to tie together,
as we talked about previously in this talk, is that
things like that was traditionally done by say fairly complicated
Python programs that does the orchestration, the queries,
the turn into act with embeddings and all that stuff,
and then you attach another UI in front of it. So it's a fairly involved
and a complicated process. But with Lama Edge, we will
be able to build a single application that can
talk to the vector database, call the embeddings when it's needed,
and perform the vector search, and then at
the end prompts a large language model to get an answer.
So this project, we call it an
integrated assistant API server, which because it looks
a lot like open eyes assistant API and
we call this project the Garnet. And here's a demo for that.
Hi. In this demo I'll show you the easiest way to
run reg or RaC API server
for a large language model. So if you are familiar with RAC,
it is, it is a way to add knowledge.
It could be additional public knowledge or proprietary
knowledge that you don't want other people to see to an existing
large language model, so that the larger language model can answer questions and chat
with people based on additional context that you provide to it.
So in order to do that, a typical rack
setup requires a fairly complex setup that requires say
install digital vector database UI, how to upload
the knowledge and the tools like Lanchain
to orchestrate how to manage retrieve data from the vector
database and how to prompt application.
So in our approach that we want to introduce a project called
Gayanet and the Garnet is an application that build on was
matched and also using the Lama edge
framework. So what it does is that it allows
applications, you can write simple
rock applications using rust and then compile it into a single
wasm binary with zero python dependency and
run it very efficiently at the server. So let's see how
it works. If you go to the Garnet GitHub repository, by the way,
if you are here, just give us a star.
And here is a quick start guide and
this is called Gaia node. And there is a one line
of command which you can use to install Gaia which let's
do that and explain what it does. So this is on my local Mac machine.
And if I say install, what it does is that it installs the
was match runtime which is required to run a
large language model. It installs a quadrant vector
database which is required, and it's download a chat
model which it has already downloaded. We call it a
standard Lama two seven b chat model and download the embedding model
which used to process the vector.
But here what's really interesting, you can ignore the error here,
what's really interesting is that it also downloads a knowledge collection
which is a knowledge base we vectorized into
the quadrant format. You can read in the documentation how
we do that, and we have a separate rust application that
helps you to do that. And then it
installs that snapshot into the quantum database.
So what it does is that it creates a Gaia net
directory in your home directory
and put everything in there and including this wasm file
that starts up the rack application server. And here
if you look at the config JSon,
you will be able to see the chat model that's being used.
The parameters for the chat model and the snapshot
is knowledge
base that vertebrates knowledge base and the prompt and you can change,
you can modify any of the things that you want, use a different model,
use a different knowledge base and rerun install. Right.
So once you have the, once you have everything installed that
you would be able to say run
the start script. What it does is that it's going to start the vector
database like we said, and they're going to start the was image application
server and it's going to start a domain server which give it
the public access for domain for the server.
Right. So what are we going to do is that we're going to open while
this is runs on our local machine. I can use localhost, but I can
because it gives me a publicly accessible domain. So what I'm going to do is
that I'm going to go to load this public accessible domain.
So it gives me how to run access API.
But what I would do is that I would just chat with the node while
it's loading. I would come back here and tell you and open
up the log here.
So this is a lama edge startup log which
logs all the interactions with large language model,
right? So now it's,
the UI has come up. I want to demonstrate to
you that it does use a knowledge base to chat with us. So if I
say where's Paris? Because the knowledge base I'm using right
now by default is a
guide, the knowledge comes from Paris guidebook.
So I vectorized that Paris guidebook and put that into the vector
database so that the large language model
can generate answers for me. So it says thank you for the
question based on the context. Paris located at north central part
of France, blah blah, blah blah. How do
we know that this is actually using
our context? How does we know that our RAC
application server works? We go back to the log and we can
see the actual prompt. So you can see this is the
actual prompt the application server sends to the database. So you
are helpful assistant. And then here's the context. If you don't
know if the answer is not in the context, you don't answer it. So those
three paragraph context come from our vector database.
That's come from the Paris guidebook. It's about general information about
Paris. And then we ask question, where's Paris? Right. You know,
so it's answers based on that, based on that context.
So you can ask follow up questions. Of course you can plan when they trip
or whatever, but. And you can also use it as open AI
API server. As you can imagine, you can have your knowledge base
about a code repository and then use
a fine tuned model that generates code or jSon, and have this
packaged together as an API server that connects to a chatbot or
something like that. That allows you to build
an entirely portable application that can
run across many different GPU architectures and drivers without
any need for large dependencies like Python.
All right, so that brings the end to our
talk. I think we're going to have 30 minutes. And so we
have done three different demos, from the
easy to the most sophisticated. And I hope you
would have time to at least install Lama
edge on your own computer and install large language models so you can play with
it. And if you find it interesting, you could install garnet
node and build your own knowledge base and have
a large language model on your own device, answer questions in the way that you
want. So yeah,
there are a lot more that we can get into. So for instance,
what's the rust application look like, how does SDK look like? And things
like that. But I don't think we have time for that at this moment.
And if you're interested, we have those three open
source GitHub repositories that you can,
that you can go to.
And please feel free to raise the issue and find
us there and chat with us. All right, thank you so much.