Self-hosted LLMs across all your devices and GPUs

Video size:

Abstract

Fast and Portable Llama2 Inference on the Heterogeneous Edge: An alternative to Python. Compared with Python, Rust+Wasm apps could be 1/100 of the size, 100 times the speed, and, most importantly, securely run everywhere at full hardware acceleration without any change to the binary code.

Summary

The title is self hosted arms across all your devices and GPU's. We're going to primarily use three open source projects. Lama Edge is open source project that build upon was matched. Gaia Net. Gaynet further builds upon Lama Edge to provide what we call a rag application server.
Using Lama H in this demo I'll show you the easiest way to run a large language model on your own computer. You can actually run any large language models that you find on hugging face. And you can use that as an API server to do a lot of things with it.
The lamo edge application server supports over 6000 large language models. It also provides enough flexibility to run multiple models in the same server. It is not really designed to be, to be a standalone server to run large language model. You can write your own code to extend it and build your own applications on top of it.
Using webassembly or wasm to run ARM applications. A binary application that is directly portable across different operating systems, hardwares and drivers. What WASM does is that it provides abstraction for those applications so that it can run smoothly across all different platforms.
Lama Edge is a developer platform for Python. It allows applications to talk to a large language model using a knowledge base. Here's a demo showing the easiest way to run an API server for such a model.
All right, so that brings the end to our talk. We have done three different demos, from the easy to the most sophisticated. And I hope you would have time to at least install Lama edge on your own computer and install large language models so you can play with it. There are a lot more that we can get into.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, welcome to my talk. The title is self hosted arms across all your devices and GPU's. So in this talk we're going to talk about self hosted oems and we're going to primarily use three open source projects. One is called was match and this is the CNC project that I founded and it is a webassembly runtime that we see great adoption in AI and the large language model space. And then the second project is called Lama Edge. Lama Edge is open source project that build upon was matched. So it provides tools and SDKs and programming primaries for building large language model applications, especially around the llama two model on the WASM edge app runtime itself. So when it's a runtime you can see this as operating system. And then Llamaedge is infrastructure that specifically for building web, for building arm applications on top of WASM edge. And then the third project is Gaia. Net. Gaynet further builds upon Lama Edge to provide what we call a rag application server, meaning that we supplement the large language model with real world knowledge base, either from public data, from public knowledge, or from our proprietary knowledge. And that just makes AI models and applications that associated with that much more useful. So it's designed for use with AI agent applications and such. Well, start off the talk with a demo of how easy it is to run a chatbot using the open source arm on your own device. Using Lama H in this demo I'll show you the easiest way to run a large language model on your own computer. First go to the Lamaedge Llamaedge repository on GitHub. It is an entirely open source project. So in the readme file while you are here just give it a start and in the readme file you will see a quick start guide and there's just one line of command. Copy this and then go to your Mac or Linux computer or even windows with WSl to run this command. What it does is that can install a large language model runtime called the Wasmatch plugins that required to run was to run the large language model and it goes fast here. But as you can see it downloads a large language model called Gamma two B which is one of Google's large language models. And you can actually run any large language model that you find on hugging face. There's over 5000 of those. And just go back to the readme and, and look for the parameters that you can pass to the script where you can specify the model which even as a hugging based URL. And once the model is started it says the application starts a local web server on your own device, and it's started on port 8080. So what you can do is that you can go to that server and it would load a URL. And it's the URL said lama h chat. Now you can start to chat with it. Say, where is the capital of Japan? The first time it is a little slow because it needs to load the model into memory, but it's fast enough. The capital of Japan is Tokyo, and it's now you can ask what about the USA? And it clearly shows that the model understands the conversation. Right? Because the second question I did not tell to find the capital of the USA. I said, what about the USA? It knows about Washington DC because it knows about the previous conversation. If we go back to the log, we can see that it's generating with the gamma two b. It was generating over 60 tokens per second. That's a lot faster than a human can speak. A human typically speaks when I'm speaking. Now it's about three to five tokens per second. So it's ten times faster than human can speak. Now I can ask a longer question. Plan me a one day trip there, meaning one day trip to Japan. Okay, so it's, I should have said Washington DC, but you know, that's close enough. You know, that's, oh, it's giving me two days as well. But this is one of, because this is a very small, it's a two b model. If you change it to a seven B model or even larger model, you're going to see. You're going to see it respond a lot better. But even so, it is still pretty good. It gave me itinerary of on day four. Now it's maybe a week in Japan. Right? So that's it. That's how you run a large language model on your own computer. This is, I'm running it on the Mac and I'm recording this video while running it. And so it's even then I can get 60 tokens per second. So it is one of the easiest way and the fastest way for you to run again. Yeah, it gave me the whole whole week. So again, the URL is GitHub Lama edge. Lama Edge. And they're just one line of command. And once you get started running, you can run other models by reading the rest of the readme. And you can use that as an API server and you can do a lot of things with it. Okay, now you have seen the demo. So what's it all for? We have seen that you can run a large language model application, you can chat with it on your own device, but what's it for? Most people just use OpenAI for this purpose and OpenAI also provided SDK API, right? So you can build applications around it as well. So why is there need to run those large language, open source, large language models on your own device? Well, so there are a couple of reasons. The first, biggest reason really is that OpenAI or any other larger language providers, basically it takes a one size fits all approach. So basically they're using the largest model for the smallest tasks. So even if you ask for something that can be easily handled by a smaller model, it will still use the largest model, which generates a lot of waste. It was very expensive and also makes the models very difficult to fine tune because it's much harder to fine tune a large model than a smaller model and because they're running on other people's server, because you can't, because the model is too big to run on server. So it's lack of privacy and control. And perhaps more interesting, and I think people are noticing this more and more, is censorship and bias because those companies like Openi or Microsoft all have their own political stand and agendas. So you would find a lot of questions that those models refuse to answer. There's very compelling needs for enterprises or users to run their own models, preferably on their own devices or in the cloud. So then that raise the second question. The llama edge server probably is not the first or the only open source software that can run large language models. So for instance llama CPP can run and Allama can run it, arm studio can run it. So why is it that we choose the llama edge was stacked to run our API server around those models as we have shown in the previous example? First, I think there are several features that distinguish llama edge from other products or other open source projects in the space. The first is to support a wide variety of models. The lamo edge application server supports over 6000 large language models, including all the ones that you have heard of. And there's not just the base models and chat models, but also embedding models which are essential and very important for running sophisticated applications as we are seeing like RAC applications. It even support larger models like X AI squawk model which is I think 314 billion parameters. And it requires top of the line, say a Mac studio or, or two H 100 chips graphic cards to run those models. But they can run using Lama H on the other hand. On one hand that supports a wide variety of models. But on the other hand, it also supports a wide range of devices, because once we start to run on our own devices, in our own cloud, we have to deal with many different types of different devices, different gpu's, different cpu's and different graphic cards, different accelerators and all that. So the architecture of Lama H, because it's based on the wash runtime which we'll get into in a minute, it provides abstraction layer that abstracts the application from the underlying runtime and hardware. So that allows it to the same application to run on a wide range of devices and drivers, ranging from Nvidia to AMD to ARM devices, to Intel AMD CPU and GPU devices, Apple devices and all of those. So in a minute. So I just showed you it runs on a Mac device and I'm going to show you how it runs on Nvidia device. And we also have tutorials on how to run, say on raspberry PI or the Jetson devices. All those, I would say, devices more commonly find in the edge cloud. The Lama Edge API server also provides enough flexibility to run multiple models in the same server. So, meaning that I can run a chat model to chat with people and run the embedding model in the same server, so that the user can upload or process their external knowledge base with the same application server. It can also do things like function calling by forcing the output to be JSON. That's something that needs coordination between both the model itself, the prompting, and the model runtime, meaning that the model runtime would have to force the output to meet certain grammar checks, to pass certain grammar check, for instance, that's provided JSON. Right. And as you can see, it's very easy to install in a run. It's just one line of command, and I highly suggest recommend you to try it out. And it's also very lightweight because it's the entire runtime wash. Plus the application itself is less than say 30 megabytes. That compared to say, the Pytorch Docker image we talked, you know, the Pytorch Docker image is 4gb by itself. So that's why you, you know, even when running a large language model and put the API server in front of it, put a chatbot in front of it, we use Lama edge, right? But all those benefits actually come from one most important attributes of the llama edge runtime, which is it is not really designed to be, to be a standalone server to run large language model. It is actually a developer platform. You can write your own code to extend it and to customize and build your own applications on top of it. If you think about how large language model applications are built today, typically you have an application server that provides API. It's either OpenAI or some other SaaS provider, or you can use your own, you know, like using Lama H, you provide open compatible API server and then you build an application that consumes this API in say in a relatively heavyweight stack like in Python or LAN chain or lamba index. There are lots of tools out there to do that. And you would also have a UI, have a web server, have all those components either contained in a docker file, something like that, and have them all tied together and build an entire application like that. But while this provides flexibility and easy allows people, allow developers to experiment with different parts of the stack, it is also making it really difficult to deploy those applications because they have too much dependencies. They are huge because of the python dependency and they are often very slow. Llama Edge is a rust based SDK. Essentially it allows you to build a single portable and deployable app. So basically you can write all your logic into one single application. There's no need to write part of the logic in C, part of that in Python, and then use the HTTP API to connect them together. You don't need to do any of those. So he improves efficiency, make it a lot easier to deploy and it simplifies the workflow. And again, there's no python dependency. A lot of people, if you haven't really worked in the large language model space in the past, that Python dependency is actually a nightmare even for very experienced people in this field. You can use rust or you can use JavaScript that allows us to build applications that are similar to say the latest advance in OpenAI. So if you look at OpenAI, they have a systems API, they have a stateful API that build a lot of functionality into the API server itself. So the Lama H platform allows you to develop your own API server to serve whatever the front end that you want to do. So that gives you a lot of flexibility in terms of building advanced applications if you pay close attention. There's a word that I put in bold here that is single portable and the deployable application, meaning you are talking about a developer platform that is based on rust that has no python dependency. One of the biggest issues people are going to ask is that how about cross platform compatibility? Because for highly efficient native applications like that, if you write it on one platform, it's likely you are not going to be able to deploy it on a different platform. That sort of makes it really difficult for developers because I have a Mac or Windows, but my application that I compiled and tested there, is that really true that I can't deploy it on a media device in the cloud? So in the next demo I'm going to show you how the was manage and the laminage stack address this problem that we allow you to build truly portable large language model applications. So here's a demo. In this demo I'll show you a key benefit of using webassembly or wasm to run ARM applications. Portability. The webassembly application is a binary application that is directly portable across different operating systems, hardwares and drivers. So on this screen I have opened a terminal to my remote machine on Microsoft Azure. That machine runs Linux, has Nvidia tiffo card and have the CudA twelve driver installed on it. If I see what's in here, you can see I have downloaded the large language model, the Gamma two B model in GguI format, and then a bunch of HTML and JavaScript files in the chatbot UI. I have also installed the was image runtime here. So if I say wash, it's was 13.5. So now we have the large language model and we have the runtime for the large language model. We are still missing the application. The application is something that takes a user input, runs a large language model, prompting the larger language, runs a large language model, generate a response and then send it back to the user. So you can call it a chatbot application, it could be a web application, it could be a discord, or it could be asian application that connects to other applications that takes the large language model output to do things. So the application is a key part that as developers that you would write in the past, you wouldn't expect this application to be portable because when you develop this application you are probably on the Mac or Windows machine and you compared to say Apple silicon and use the metal framework, you have all this built in. And you wouldn't think that by just copying this binary application to Nvidia machine it would just run there, right? And of course there are ways to make it easier. For instance, if you use Python, Python has a lot of abstractions that allows you to write at a fairly high level so that you can write a Python script on Mac and then try to run this same script on the Linux machine. However, as you would imagine Python, in order to achieve that, Python have a huge amount of dependencies. Like if you go to the Pytorch official docker image, it is 4gb right there. So the Pytorch docker image itself is 4gb and it's platform dependent, so you have to install the right Python version. And within Python you also have oftentimes you need to specify what is the underlying architecture. So it is huge dependency and it's not really that portable for any other languages like rust or go, or if you write your application in any of those other languages, you would not imagine that it would be portable. That's because the underlying GPU and CPU architectures are entirely different. What WASM does is that it provides abstraction for those applications so that it can run smoothly across all different platforms. In order to demonstrate that, I switch to a window. This is my, this is on my local machine, which is a MacBook, and what I have already downloaded one of the API server applications from the Lama Edge project which is a rust application. And I compiled it on my Mac and I tested it on my Mac, right? So now what I'm going to do is I'm going to just scp this entire file to the remote azure machine. So as you can see, the entire file is only nine megabytes and we didn't package it in any way, we didn't have a docker image around, you know, wrapped around it. We just scp the whole thing to another machine and with entirely different architecture in both hardware and software and expect to run there. Can we run there? So let's see. So we use the WASM edge runtime to run it. So the WASM edge runtime starts instantly. It's because it's an application, so it loads the large language model and then it starts an API server. The web server is actually accessible through port 8080. So if you have this machine, the port open public, you would be able to load up browser and go to port 880 and see it. But for now we want to just stay on this machine because we have it under a firewall. What are we going to try is that we're going to do API request, because this API server also takes open AI style API requests that allows us to integrate with all the openi ecosystem tools. So here's how the request looks like. So as you can see, we request at the localhost and then we send a message with a row of user and say where's Paris? We ask the model, where's Paris? Right. If I do this, it does the inference, its result come back before I can finish speaking. So the result is, the role of the result is system and the content is Paris located in hardware friends, blah, blah, blah. So now we have achieved something, I think very interesting, that we compiled a rust application on the Mac and fully taking advantage of the Mac GPU and the metal framework. And then we just copied this WASM application into a remote Linux machine running on Nvidia. And we can see this application, that's which the role of the application start server and interact with the underlying large language model runs just perfectly on the new hardware, fully utilizing the media GPU capabilities to accelerate. Without the GPU capabilities, you would not be able to have nearly 100 tokens per second speed on this Linux machine. If you're just doing the cpu, you'll be more like one tokens per second, you know, it would be two orders, magnitude's difference. So that's it, that's what we have shown, that the WASM application is truly portable. All right, to recap the demo, there's so for the longest time we have platform engineering, or DevOps, that combines the role of developer and Ops. But with the new hardware, with more and more different devices and different drivers, all that stuff, that's that coming along for AI applications. And I think it's time to separate the dev and Ops role all over again. So the way it works is that Webassembly is a virtual machine format. You can think of it like a Java Java bytecode, so it provides abstraction over the real hardware. And for developers, you just need to write to the Webassembly interface. In our case, you write to the Lama edge SDK interface and it tells the SDK to say load a model and do the inference. And the developer only need to write application this way. So if the application, once the application is compiled to WASM, the developer's job is done. He or she can ship the application anywhere that they want and lets the runtime takes over the rest. So it's the Ops people that needs to install the correct runtime and driver for each device. So for instance, on a Mac we want to install the Mac version of was made. It's sort of like Java. On the Mac you need to install the Mac version of Java, right, the JVM, right. You know, so it's the same thing here. So you want to install the Mac version of the was runtime. If you have Nvidia device, you need to Ubuntu with CUDA twelve, you need to install the appropriate was runtime in there as well. So for the Ops guys, they can, once they take care of that, they would be able to just run that wasm application without any modification. Because the WASM edge runtime has a contract, has a standard API that's, that is defined in the llama HSDK, right? You know, so once it sees those instructions to say, load a model and you know, send some text to the model for inference, and it sees those instructions, the byte code, you automatically run those code and it would translate those code into that's instructions that are appropriate for the underlying accelerator and the drivers. So it allows developers to write truly portable applications that can be deployed anywhere, and it can be managed by tools like kubernetes and let ops people worry about installing the right driver and installing the right was match runtime on every single node or every single edge devices that you have in your cluster. So that leads us to our last demo, because we have talked about Lama Edge being a developer platform, and one of the most popular applications people do with Python, at least today, is what they call a rack application, meaning that you use a standard large language model, a fine tuned large language model, but you feed it with your proprietary knowledge base. The knowledge base was divided, it's typically a text, a PDF or image, or you know, or a text file, and it was divided into segments and each segment was turned into a vector and stored in a vector database. And when the user asks a new question from the API, the application would take, would take that question, turn that into vector as well, and perform a vector search in the database to find out which other, which texts in the knowledge base are most related to the question. And then it would add the context that retracted from the knowledge base and the new question into the prompt, and asks the large language model to give an answer. As you can see, this is a fairly complex process, and it involves not just the large language model, not just the runtime, but also things like the vector database, the embedding model and all that. Things that you have to tie together, as we talked about previously in this talk, is that things like that was traditionally done by say fairly complicated Python programs that does the orchestration, the queries, the turn into act with embeddings and all that stuff, and then you attach another UI in front of it. So it's a fairly involved and a complicated process. But with Lama Edge, we will be able to build a single application that can talk to the vector database, call the embeddings when it's needed, and perform the vector search, and then at the end prompts a large language model to get an answer. So this project, we call it an integrated assistant API server, which because it looks a lot like open eyes assistant API and we call this project the Garnet. And here's a demo for that. Hi. In this demo I'll show you the easiest way to run reg or RaC API server for a large language model. So if you are familiar with RAC, it is, it is a way to add knowledge. It could be additional public knowledge or proprietary knowledge that you don't want other people to see to an existing large language model, so that the larger language model can answer questions and chat with people based on additional context that you provide to it. So in order to do that, a typical rack setup requires a fairly complex setup that requires say install digital vector database UI, how to upload the knowledge and the tools like Lanchain to orchestrate how to manage retrieve data from the vector database and how to prompt application. So in our approach that we want to introduce a project called Gayanet and the Garnet is an application that build on was matched and also using the Lama edge framework. So what it does is that it allows applications, you can write simple rock applications using rust and then compile it into a single wasm binary with zero python dependency and run it very efficiently at the server. So let's see how it works. If you go to the Garnet GitHub repository, by the way, if you are here, just give us a star. And here is a quick start guide and this is called Gaia node. And there is a one line of command which you can use to install Gaia which let's do that and explain what it does. So this is on my local Mac machine. And if I say install, what it does is that it installs the was match runtime which is required to run a large language model. It installs a quadrant vector database which is required, and it's download a chat model which it has already downloaded. We call it a standard Lama two seven b chat model and download the embedding model which used to process the vector. But here what's really interesting, you can ignore the error here, what's really interesting is that it also downloads a knowledge collection which is a knowledge base we vectorized into the quadrant format. You can read in the documentation how we do that, and we have a separate rust application that helps you to do that. And then it installs that snapshot into the quantum database. So what it does is that it creates a Gaia net directory in your home directory and put everything in there and including this wasm file that starts up the rack application server. And here if you look at the config JSon, you will be able to see the chat model that's being used. The parameters for the chat model and the snapshot is knowledge base that vertebrates knowledge base and the prompt and you can change, you can modify any of the things that you want, use a different model, use a different knowledge base and rerun install. Right. So once you have the, once you have everything installed that you would be able to say run the start script. What it does is that it's going to start the vector database like we said, and they're going to start the was image application server and it's going to start a domain server which give it the public access for domain for the server. Right. So what are we going to do is that we're going to open while this is runs on our local machine. I can use localhost, but I can because it gives me a publicly accessible domain. So what I'm going to do is that I'm going to go to load this public accessible domain. So it gives me how to run access API. But what I would do is that I would just chat with the node while it's loading. I would come back here and tell you and open up the log here. So this is a lama edge startup log which logs all the interactions with large language model, right? So now it's, the UI has come up. I want to demonstrate to you that it does use a knowledge base to chat with us. So if I say where's Paris? Because the knowledge base I'm using right now by default is a guide, the knowledge comes from Paris guidebook. So I vectorized that Paris guidebook and put that into the vector database so that the large language model can generate answers for me. So it says thank you for the question based on the context. Paris located at north central part of France, blah blah, blah blah. How do we know that this is actually using our context? How does we know that our RAC application server works? We go back to the log and we can see the actual prompt. So you can see this is the actual prompt the application server sends to the database. So you are helpful assistant. And then here's the context. If you don't know if the answer is not in the context, you don't answer it. So those three paragraph context come from our vector database. That's come from the Paris guidebook. It's about general information about Paris. And then we ask question, where's Paris? Right. You know, so it's answers based on that, based on that context. So you can ask follow up questions. Of course you can plan when they trip or whatever, but. And you can also use it as open AI API server. As you can imagine, you can have your knowledge base about a code repository and then use a fine tuned model that generates code or jSon, and have this packaged together as an API server that connects to a chatbot or something like that. That allows you to build an entirely portable application that can run across many different GPU architectures and drivers without any need for large dependencies like Python. All right, so that brings the end to our talk. I think we're going to have 30 minutes. And so we have done three different demos, from the easy to the most sophisticated. And I hope you would have time to at least install Lama edge on your own computer and install large language models so you can play with it. And if you find it interesting, you could install garnet node and build your own knowledge base and have a large language model on your own device, answer questions in the way that you want. So yeah, there are a lot more that we can get into. So for instance, what's the rust application look like, how does SDK look like? And things like that. But I don't think we have time for that at this moment. And if you're interested, we have those three open source GitHub repositories that you can, that you can go to. And please feel free to raise the issue and find us there and chat with us. All right, thank you so much.

Slides

Download slides (PDF)

See all 28 talks at this event!

Conf42 Large Language Models (LLMs) 2024 - Online

April 11 2024

Self-hosted LLMs across all your devices and GPUs

Video size:

Abstract

Summary

Transcript

Slides

Michael Yuan

Co-founder @ Second State & WasmEdge

Join the community!

Featured event

2025

2024

Info

Conf42 Large Language Models (LLMs) 2024 - Online

April 11 2024

Self-hosted LLMs across all your devices and GPUs

Video size:

Abstract

Summary

Transcript

Slides

Michael Yuan

Co-founder @ Second State & WasmEdge

Join the community!