Aggressive LLMs Optimization: Making Them Work on Tiny Devices

Video size:

Abstract

Discover how to shrink GPT‑2 for ultra‑weak hardware without sacrificing performance! We reveal how pruning, quantization, and fine‑tuning can unlock big large language model (LLM) power in tiny form.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. My name is Max, and today we're gonna talk about the aggressive optimization of large scale language models in order to make sure that we can make them work on tiny devices. So let's start with a few words about us, me, myself, I'm Max Roski. I'm senior software engineer VGS. While my cos speaker today is Alexander Gev, the professor of software engineer at Lud National Technical University. Also shout out to my. Why that helped me arrange this beautiful presentation here. So yeah, my thanks. Okay, let's start from the challenge. Why actually do we need optimize those LLMs for big devices By today's standards the all new large scale variations of those models actually require significant competition resources, which may result, as you may guessed in, overlooking the possibility of their practical deployment of low end resource constrained devices, which may include. Field operations, offline systems, and better device and such. Our resource goal today is to find out whether the aggressive model optimization techniques such as architecture streaming, pruning quantization, can enable the LMS to operate efficiently under the extreme hardware constraints. So as you may guess, our demonstration approach will be. To use practical task in order to evaluate our optimization strategy. The task will be, we'll try to fine tune and use our model to generate ISO compliant software requirements basically without it. Not any product can start properly and finish not only software. We thought of it as a great task to evaluate the, and measure the trade offs between model compression, resource consumption, and output stability. Alright, so before we start optimize pretty much anything else, we need to find out which model will suit the best. Beside of the different alternatives there, we have been looking on the model that A is open source B is not already compressed not already compressed, not any techniques has been applied to it. So we need something that is kinda blank sheet, but already with a small size. So as you may guess from all the alternatives we stopped on GPT two. Why as I mentioned before, it's open source so we can play around as much as we want without encountering licensing or API limitations. Second is relatively lightweight, both on parameter size with 117 million parameters. Oh, and by the role model size, it's only 500 megabytes which is not that much as you may guess. Critically also that we have been looking for the GP D two. Is really great at generative possibilities despite the fact that it's buzz gen and it had been released in two 29, 20 19. It's actually really good at producing coherent, dramatically sound. The does aligned outputs, so it's actually very good, real solid for fine tuning for speci, specifical demands domains. And finally as a bonus point, pretty much it's actually have a transparent and well-documented architecture, which allows us, allowed us to dive deep into finding out the limitations of those optimization techniques. So it was actually great. So to sum up, if we had only one model to carry with us into forest on an old laptop, it'll be GPT two. Let's start with the plan. So it'll be really simple one. First of all, we'll go. We are going to cover up the theory, then the practical results and conclusion. So let's start with theory. First of all, we are going to look for. It's a student model concept. The core idea basically student model is a instance of the basic model by, but the difference that you can play around as much as you want. So basically you can screw it up. Maybe you will need to redo the things, but it won't affect the original. So consider this something like a clone, but you can work with as much as you want. So yeah, that's pretty much it. Metrics by which we are going to measure the success. It's four of them. First of all, perplexity, the model accuracy, whether can it output things correctly. CPU memory usage, as you may guess, it will be really critical for measuring success for like of performance on low resource, low end devices. Inter interference speed, faster response time, pretty straightforward and model size basically. The smaller, the better general pattern here, as you noticed is the smaller better, the smaller faster equals better usually. Let's start with defining what we actually define as aggressive methods and non-aggressive methods. Let's start with non aggressive methods, because there are really only one that we can specify today is a knowledge distillation. Is the technique when student mimics teacher tissue model closely, like without any changes at all to architecture. While aggressive methods, as you may guess, is those methods there will interfere with model like architecture layers, count and stuff. So there are three of them. Architecture, treatment they will have, like this technique is all around reducing layers in neuro count. Pruning, remove less important parameters in weights and ization. Basically changing the AC accuracy of the weights from like fourth float 42 into integer eight. Standard. Okay, cool. So let's start with covering up the basics of the each optimization technique. So with architecture, treatment, the main concept here, if it's too big, just make it smaller. So I would say, you should imagine this, our model as a building. So like every floor and room can be considered as layer and neurons here. And pretty much we just, in those technique, we just remove unnecessary layers and neurons. That's it. With layers, we, when we remove layers, we just reduce the number of protesting stages in the model. While with neurons, we decrease the number of computational euros per layer, right? So also we have a glossary on the bottom of the slides. So feel free to pause the window and check it out, like the better definition of layer in euro. So that's pretty much it. Pruning. If it's not important, just cut it off. I think we can imagine for this purpose, we can imagine a model like a tree when every branch is connection between neurons and even each branch has leaves, which is parameters and weights. So with pruning, we can actually decide which weights we can disable and just not use it all. Interesting thing here that while we disabling and turning off some parameters and weights, it also can. Like indirectly affect the neurons because when you delete and turn off some of the neurons some of the weights in parameters, it may just turn off some neurons. So it's really interesting here and here. Quantization. Basically the main concept here is you may see heat precision isn't critical. We do not need use heavy tools. Basically speaking we are turning off. Like changing the basic float 32 standard and like into integer eight. So it will be less, our model will be less precise, but it'll be great in like CPU performance and much more easier to work with. Yeah. So we just make sure that our weights are, turned, turned into smaller, lightweight ones. Those effects, as you may see on the graph, it affects only weights and parameters here. And for the non-aggressive optimization methods here distillation can be considered as a mutation of the teacher model here. So it's the student that copies from the experienced teacher. Very important note here. This thing, this method is non aggressive because it does not interfere with model architecture. Therefore, we do not need to be worried about performance de degradation here. Good. Also the most important know that we want to live here is that. Distillation is really, will work on it. It'll work good if only the student model has not anything changed in the architecture. So the, both the student model and teacher one should be pretty much the same. And here, as you may see in the graph it's not affecting anything in the architecture, but it just learns the outputs and the weights from the tissue once. So it's all in once. Okay, cool. So as you, we noticed those three aggressive methods, architecture, dreaming, proving quantization are playing with models, architecture in different way. One in, for instance, for archite architecture dreaming. We just remove some of the layers while pruning, for instance, just remove not layers, but weights. And quantization just reducing the accuracy of those weights but not removing them. So they're really the same, but in different way. Alright, so yeah, that's the summary for you if you want to pass out video and check it out. So let's start with part two, practical research. As you may guessed, we already have an order of experiments here. So first of all we will our, like this part will switch in several steps. First one, we are gonna fine tune our model because without fine tuning, it does not make much many sense. And therefore we'll continue with architecture, treatment, distillation, pruning, quantization, and then we're gonna combine those methods. We are gonna explain later why we do this methods in this specific order. So let's start. But before we start, we need to get the basic benchmark of training the base role, DPT two. So it's pre-trained. Non-specialized and it'll struggle with any task specific like structure there. Those, some metrics on the left that we specify before, perplexity usage, memory usage, inference, speed, and model size. And therefore they are prompt results. So you can see like visually what is going on, as you may see on the output. It's not hallucinating, I would say, but it's not that semantically. Correct. Which may result in, the possibility and the requirement of actually fine tuning our model to do specific tasks like iso requirements generation. Okay. I think the first step after comparing with the role GPT two we need to fine tune the model basically. I'm sorry. Basically this model will, there will no optimization applied first. Our model is fine tuned towards ISO compliance software requirements that I set and it'll useful GPTT architecture and it'll the reference point for our comparisons. As you may see, after we fine tune it to our, to to our specific task, the output is much more better. So that's I would say reference point. Yeah. So with fine tune, with fine tuning model, everything got better there. Only perplexity got a little bit worse. I would say perplexity value starting from one up to one and five are really solid, and we'll provide you the quality of the model that we're, that you will see that is good visually at least. Okay, so let's start with the first step. I think the most mandatory one, architecture dreaming. In our practical research, we remote, we removed half of the GPT two layers. And reduce hidden size. Basically, we shifted from 12 layers to six from like 768 dimensions to 384, which means same inputs, but just lighter architecture and fighter and faster execution. You may also ask about how did we pick which layers to trim? The strategy there was actually really simple one. We just removed layers symmetrically from center. Which allows us to preserve bottom layers that are responsible for basic syntax and token level understanding. And also we preserve top layers, which are for higher level structure, for context, you know what I'm saying? Basically speaking, this allow allowed us to maintain both low level and high level processing there. Awesome. So let's compare the performance here. As you may see, perplexity gone a little bit worse. Because like it's gone. Not that accurate, but as you may see, the output to the prompt is really solid. It's pretty much the same. It hasn't changed. But here we also notice here that the memory usage actually increased. Why? When we. We reduce our CP usage. Somebody has to pay and this somebody is wrong. Basically speaking. We with optimizations like quantization layer treatment, we reduce processing time, but they will compensate in tensor sizes, caching, parallel processing, which will increase their wrong memory. Why this matters. CPU time equals the energy cost on battery powered by the devices. And most of the time CPU U is the biggest power draw. While RO is something that we can neglect and it's generally more available, but not limited. So it's actually, we consider it as a cheap to access. But a little bit limited on the old laptops and microcontrollers, which is, in our use case. It's not that problematic as we more lean towards providing the proper experience with CPU in order to make sure that they can work on embedded devices. Yeah, so in this graph, it's pretty much the summary of all the things that we talked before with the graphs. As you as you may see on the x. Axis we have starting from the Rob GT two and at the end you will, we will compare also combine sensors there, the RAM users there changed by almost 200 megabytes. Not that big, but something that we have to consider on. Okay, let's continue. You may also ask, but why don't we apply distillation and therefore it's break point. The thing is that we student modeled it's trained. In this process, the student model trained to mimic, make a bigger teacher model. But after we chopped hard, the student model it just too small to understand the teacher. It means that student just can does not have enough layers to understand teacher outputs, which means that our student model will try to compensate it and it'll result in hallucinations and repetitions. Yeah, this graph pretty much basically represents the effect that we are trying to represent here. While at some point, student model will just stop, basically understand what model tries to learn and to teach. So it'll do the only thing it's possible to do with architecture when LLMs. Hallucination and repetition. As we mentioned before. As a result, we finishing critical failure point and here there's nothing we can do with distillation and we need to drop it off. Yeah. As you may see with a performance snapshot for distillation, the output is straight up non-usable with propensity well is for it's. As you may see by output, it's not usable. So yeah, we just skip it all together. That's why it's called non aggressive, as you may see. As a quick conclusion here, non aggressive methods are not working right with aggressive ones. Okay? Yeah. So let's continue with pruning. With pruning. As we mentioned before, the theoretical part, really more parameters that have low or no contribution. How do we choose those weights? We just rank all weights by getting their absolute value, then just remove the one one. Yeah, it's pretty much simple. It's not that hard. We just muting the quietest voices in the nose, noisy room, and why it's working, because many weights are in large, more are close to zero, which allows them to balance out. Some like context and for the larger models, it'll be really great in order to have general context. But in our use case, the le less memory, faster competition, but only slightly. Yeah, from the benchmark here, perplexity here is pretty much the same as for architecture training. But the memory, use it really hard. Like going really like big, I would say as you may see it sometimes it hallucinates and just does not do anything, but no worries will get better results after we combine those methods. Quantization quantization, as we said before in theoretical part, when we convert them, model weights, accuracy from flow through the two into enter eight. We make sure, like we don't delete anything but just make the model a little bit less precise. So here we'll get a huge benefit of less CPU load and basically no work actual changes needed. So pretty much no performance degradation there. Okay, so when we go to performance, you will see that the memory usage there is pretty much almost non didn't change at all. While the model size is produced drastically with the CPU usage at 12%, 12 and a half percent, while basic RGBT model will occupy like 45% of your like low end device, which is huge. And you'll see, and you'll see you later see by, see it by yourself, by at the end of the talk while we get the conclusion. So we are coming up to the combined optimization here and, in those optimization methods, we are combining treatment, pruning and quantization. So at the end, we'll have fine tuned model, fine tuned, GPT model for one, task, three model to six layers, prune 40% of smaller weights, and we'll quantize. Two in eight from flow through the two for CPU Efficiency. As as you may see, we don't use distillation at all because it just does not stack neither work with aggressive methods combined about in this case, there are no methods that will allow distillation work with aggressive methods as architecture. Defense is too drastic. Okay, so with the performance snapshot of our best combined methods here, perplexity changed by really little bit. While CPU usage there got almost twice as small while memory usage, as you may see, is almost like 200 megabytes higher with inference speed really perfect, almost newer. Perfect, I would say. And the model size shrunk into like almost five times of the original. Yeah, it's really perfect. One. So let's see how it's gonna represent in different metrics. So we'll we can see here by using the company combined optimization methods those, our students, our student model will reduce CPU load by almost a half compared to row G PT two which makes it significantly more suitable for battery powered CPU only devices where energy is critical. Basically it means less CPU, faster response times, lower energy costs, everyone's happy memory usage. As you may see, the change there is not that big, but it's a kind of trade off that we pay for lower CPU costs. It's acceptable, especially in Muslim low end laptops and body boards. But this, there is the thing that we should consider of a memory usage. This is the thing that pays off for the all optimizations. Perplexity actually remained the same. So we were really happy with the results. So it means at the end we can conclude that the optimization, the combining optimization techniques are not equal degradation if fine tuning done right, model size as you may see it shrink almost in five times of the original, basically smaller model, same brain. We are going to part three conclusions. So you might also thought if their methods are so good, why should we like not always use them? It me, like basically we omitted the fact that it's work perfectly only on fine tuned models. So actually why fine tuning map like fine tuning technique matters. Fine tuning aligns the model to a single atomic task. Therefore, that specific focus make the model resistant to aggressive compression While out of the box models are kinda generalist and they're like not good in anything else, like jack of all trades. And they basically, they're trying to do everything and they will be really fragile towards like applying any of them. Yeah. So with the League of Mind, Alexandra, we defined definition of Reist model resistance. It means that it defines how well model maintains output quality after aggressive compression. So by high resistance, we define fine tuned models for one clear task, which means strong internal signal, the translation to low dependency on full architecture. When quality drops really slowly, even heavy pruning on quantization and for the lower resistance, we define this and the, as the general purpose model that has like many overlapping skills and even the smallest optimization there would result into breaking the visual results and the performance metrics. Also, we have a graph to we also applied the same methods to the raw G PT two, and the results were not. Usable at all. So as you may see for the fine tuned models it's almost changed like it hasn't changed at all. While for BT two it changed really drastically. So as you may see, the re resistance team here is really important. So fine tuning models here is really important by as treatment stable perplexity. Almost grow slowly or not growing at all while general purpose models break fast and perpetuate their skyrockets. So it's, it is great. While fine tuning mon, like fine tuning models allows us to focus on the single task which is kinda reducing general noise and general models in the controversy are too broad. Even small cuts will disrupt everything. So basically speaking optimizations. Combining the optimization methods work best, but the model knows what it's supposed to do. Yeah. Without fine tuning, pruning will break meaning of general model. Quantization will introduce noise, which also affect the visual represe of the output. And trimming also deletes the clear within the path. So at the end with fine tuning, everything stays coherent even. And 75% of compression as you may see, basically fine tuning isn't a just an extra step. It's what makes optimization possible. So yeah, that's pretty much finishing slide with the final takeaways. First of all, fine tuning is the true enabler. It makes compression possible without the collapse of the model while we apply those techniques. Architecture driven reduces, depths and size brilliant remotes, low impact weights. Quantization boosts CPU Efficiency while distillation fails when applied to aggressively minimize students. Resistance here is a key. Fine tune models resist degradation, far better than general models. So basically speaking, we didn't just shrink a model, we built a focused, efficient specialist and fine tuning may possible. Thank you very much. It was my pleasure. Great pleasure to provide a talk here. Thank you. Have a nice day.

Slides

Download slides (PDF)

See all 136 talks at this event!

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Aggressive LLMs Optimization: Making Them Work on Tiny Devices

Video size:

Abstract

Summary

Transcript

Slides

Max Navrotsky

Senior Software Engineer @ VGS

Oleksandr Gordieiev

Professor at Software Engineering Department @ Lutsk National Technical University

Join the community!

Featured event

2026

2025

Info

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Aggressive LLMs Optimization: Making Them Work on Tiny Devices

Video size:

Abstract

Summary

Transcript

Slides

Max Navrotsky

Senior Software Engineer @ VGS

Oleksandr Gordieiev

Professor at Software Engineering Department @ Lutsk National Technical University

Join the community!