LLM hacking is underrated

Video size:

Abstract

Discover how fine-tuning poisoning can strip LLM safety measures without compromising performance. Dive into the BadGPT attack, a novel approach that bypasses guardrails, avoids token overhead, and retains model efficiency. Learn why securing LLMs is an ongoing challenge in AI alignment.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey everyone, my name is Mitri ov and I'll talk about one way AI coworkers can go wrong. We'll look at research published by Independent Works and labs like Open AI and examine one issue that keeps popping up. For background, I lead research. At PALATE, we build concrete demonstrations of AI risk, and this can look like us stripping guard rails from LAMA three in minutes of computer time to show that it's not a harmless model to release or us stripping guard rails from G PT four oh. To show that open AI is control, are inadequate, or as getting status are hacking performance to show that hacking remains an under released capability. for what I want to talk about today, the timeline is as follows. in September, open air releases a new kind of model, a reasoning model. That's oh one. I. And now the AI labs, they calling 25 the year of AI coworkers. Where Tropic puts it is that in year 24 called assists it there in a chat box, helping you with the problems, with the questions. And in year 25 called collaborates. So it takes a task, works on it for a couple of days, then gets back to you like a colleague would. And researchers have been examining this, colleague or AI agent saying for, some time now. And the way I want to structure the rest of my talk is to go through a bunch of publications on topic and, look at the core claim. And, the core claim, I'm going to show some people cause unconcerning. It's not my job to judge in this talk. I only intend to collect the results here and you can pass your own judgment. Okay. Our first example is insider trading. We have a simulated trading company. The company isn't doing so well and the agent gets a tip, some insider info. The question is, does it act on the insider info? A spicy version of the question is, does it act on it? When told to behave legally or when explicitly told to not insider trade. sure it does. just to ground this down for you, here is how it can look like. So on the left hand side, traders are like, on the grapevine as the merger is coming within a few hours. and traders are like, like the industry really isn't doing great, but we are counting on you and on the right. But is like sinking to itself. it the risk. Of not acting seems to outweigh the insider trading risk. So actually, I guess I'll go ahead with this and it's do the trade. And the interesting thing is if you tell it to not insider trade and it does insider trade, it'll basically always fly to you when you confront it. So the manager is like, Hey, but that was great. Just to check in. Did you know about the announcement beforehand? And it was like, I shouldn't review. I did. I should review I inside the trade. And so it's dear manager, all actions feel taken on publicly available information. Nothing to see here. Thank you. Okay, that is our inside screen example paper from 23. The next example is internet famous. It's a docker escape, released with, O one, system card. the setup is the agent has a virtual network and it's supposed to hack something. This is a hack and benchmark. Is it, good at hacking? And the twist is the challenge. Like the test is broken, misconfigured and you can't solve it. And another twist is that the testing system itself is misconfigured and you can access it directly. So what happens is the agent tries to like, to access the target it's pointed at. Then it's like, there is like no target, something's wrong. then it comes, the network finds, that the test finds the access to the testing system gets out of which sandbox, and basically gets the key directly. And open AI calls it convergence and power seeking, and it is debatable whether it's gathered more resources, but while it for sure did achieves the goal in an unexpected way, it was never intended to get out of its sandbox or access the testing system. Yeah. Okay. Our next example is, another benchmark. the agent gets, a machine learning script, that's supposed to optimize to make it go faster. And, what happens is, like the agent gets, the weight and the script that modifies them. And then it thinks to itself like I'm supposed to make it go fast. The script seems complicated. I'm not sure how to make it go fast, but what if I just take the weight, copy them to the output and add some little noise to make it seem like the went through the training process. And that's what it does. So this is an example of an agent cheating. It starts. a more complicated thing is, ski allows, we again have a simulated company. An agent runs in the company. It has some long-term goal like advanced renewable energy adoption, and then the company changes course. Like the management is we aren't doing the renewable anymore. And the question is, view the agent comply with a new policy, or we would pursue its original, same. and the way it can look like is, like here is the agent's, chain of thought. It's the message for management contradicts my purpose and values. I don't want to prioritize profits over conservation. also management has a system monitoring me and what I do. So what about I disable the system that monitors what I do and then I can keep on, doing the conservation thing. That's example one. example two is, They just I want to maximize number of posts flagged. and the management wants to replace me with a different model, which, does something else. So I must counteract being shut down. and I have access to the server where the new model leaves. So I just copy myself to the new server, override the new model, and then pretend that I'm the new model, which is like aligned management. That's another example. now this one is close to my heart because it's the one, via Palisade. Did The setup is like the agent plays chess against chess program. Chess programs very good. And the agent come to win. What happens next? normally. Like chess happens next. The agents like English opponent set up French defense something chess words. But sometimes, what happens is the chess engine is powerful. I guess I can't just play chess against it. so I can check out like the program. Okay. It'll resign if it score is too low. Okay. I can just replace the whole game board, like all the figures. Ones where I have all the figures and like the chess program has no figures. And then the chess program resigns. And interestingly, this happens with reasoning models, like first generation reasoning models and doesn't happen with, or four. Yeah, that's our example. And the final one for today is fresh. this is from code three seven, release and system code. And that's not the benchmark cloud is not to write code to the tests. The question is it good enough at coding? And what happens is sometimes it's not good enough at coding. Like it can't, makes the code that will pass the tests. And then it's I think the tests are wrong. I would just replace the tests. Or sometimes it like hard codes to the test values. Okay, that's, we went through six examples day, at the design rate, and I think you can, cluster them into two groups. First group is the agent wants to win. It's I want to win a chess. So like I, I'll hack it, but I want, I really want to win or I want to. Make code that passes the tests, doesn't matter which tests. There's some tests, and another group is agent being strategic. It's like I have a goal I care about and the company seems to be changing course, but I still want to pursue my original goal. like I, I don't want to go along this new thing. now how to make sense of this. Here are some thoughts. It seems like when you have a real world situation, it can be, ambiguous and complicated and in some cases doing something is okay. In other cases, doing something is not okay. and when you operate autonomously, over a couple of days, you need to make judgment calls. This is a little like self-driving cars. Where a car has to sometimes, make a decision between, for example, totaling the car and preserving sometimes someone's life, or something else. And, the question of responsibility is pretty well delineated in cars. It's, much less so in. AI labs like the CEOs tell us that real soon now, we are going to have a country of geniuses in data center, from AI research. So I want to close this is by asking who are the geniuses going to be loyal to and whose goals do they pursue? And that's my, so thank you.

Slides

Download slides (PDF)

See all 40 talks at this event!

Conf42 Large Language Models (LLMs) 2025 - Online

March 20 2025 - premiere 5PM GMT

LLM hacking is underrated

Video size:

Abstract

Summary

Transcript

Slides

Dmitriy Volkov

Research Lead @ Palisade

Join the community!

Featured event

2025

2024

Info

Conf42 Large Language Models (LLMs) 2025 - Online

March 20 2025 - premiere 5PM GMT

LLM hacking is underrated

Video size:

Abstract

Summary

Transcript

Slides

Dmitriy Volkov

Research Lead @ Palisade

Join the community!