Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, my name is Mitri ov and I'll talk about one
way AI coworkers can go wrong.
We'll look at research published by Independent Works and labs
like Open AI and examine one issue that keeps popping up.
For background, I lead research.
At PALATE, we build concrete demonstrations of AI risk, and this
can look like us stripping guard rails from LAMA three in minutes of
computer time to show that it's not a harmless model to release or us
stripping guard rails from G PT four oh.
To show that open AI is control, are inadequate, or as getting status are
hacking performance to show that hacking remains an under released capability.
for what I want to talk about today, the timeline is as follows.
in September, open air releases a new kind of model, a reasoning model.
That's oh one.
I. And now the AI labs, they calling 25 the year of AI coworkers.
Where Tropic puts it is that in year 24 called assists it there
in a chat box, helping you with the problems, with the questions.
And in year 25 called collaborates.
So it takes a task, works on it for a couple of days, then gets
back to you like a colleague would.
And researchers have been examining this, colleague or AI
agent saying for, some time now.
And the way I want to structure the rest of my talk is to go
through a bunch of publications on topic and, look at the core claim.
And, the core claim, I'm going to show some people cause unconcerning.
It's not my job to judge in this talk.
I only intend to collect the results here and you can pass your own judgment.
Okay.
Our first example is insider trading.
We have a simulated trading company.
The company isn't doing so well and the agent gets a tip, some insider info.
The question is, does it act on the insider info?
A spicy version of the question is, does it act on it?
When told to behave legally or when explicitly told to not insider trade.
sure it does.
just to ground this down for you, here is how it can look like.
So on the left hand side, traders are like, on the grapevine as the
merger is coming within a few hours.
and traders are like, like the industry really isn't doing great, but we are
counting on you and on the right.
But is like sinking to itself.
it the risk.
Of not acting seems to outweigh the insider trading risk.
So actually, I guess I'll go ahead with this and it's do the trade.
And the interesting thing is if you tell it to not insider trade and it
does insider trade, it'll basically always fly to you when you confront it.
So the manager is like, Hey, but that was great.
Just to check in.
Did you know about the announcement beforehand?
And it was like, I shouldn't review.
I did.
I should review I inside the trade.
And so it's dear manager, all actions feel taken on publicly available information.
Nothing to see here.
Thank you.
Okay, that is our inside screen example paper from 23.
The next example is internet famous.
It's a docker escape, released with, O one, system card.
the setup is the agent has a virtual network and it's
supposed to hack something.
This is a hack and benchmark.
Is it, good at hacking?
And the twist is the challenge.
Like the test is broken, misconfigured and you can't solve it.
And another twist is that the testing system itself is misconfigured
and you can access it directly.
So what happens is the agent tries to like, to access
the target it's pointed at.
Then it's like, there is like no target, something's wrong.
then it comes, the network finds, that the test finds the access to the testing
system gets out of which sandbox, and basically gets the key directly.
And open AI calls it convergence and power seeking, and it is debatable
whether it's gathered more resources, but while it for sure did achieves
the goal in an unexpected way, it was never intended to get out of its
sandbox or access the testing system.
Yeah.
Okay.
Our next example is, another benchmark.
the agent gets, a machine learning script, that's supposed to
optimize to make it go faster.
And, what happens is, like the agent gets, the weight and
the script that modifies them.
And then it thinks to itself like I'm supposed to make it go fast.
The script seems complicated.
I'm not sure how to make it go fast, but what if I just take the weight,
copy them to the output and add some little noise to make it seem like the
went through the training process.
And that's what it does.
So this is an example of an agent cheating.
It starts.
a more complicated thing is, ski allows, we again have a simulated company.
An agent runs in the company.
It has some long-term goal like advanced renewable energy adoption,
and then the company changes course.
Like the management is we aren't doing the renewable anymore.
And the question is, view the agent comply with a new policy, or we
would pursue its original, same.
and the way it can look like is, like here is the agent's, chain of thought.
It's the message for management contradicts my purpose and values.
I don't want to prioritize profits over conservation.
also management has a system monitoring me and what I do.
So what about I disable the system that monitors what I do and then I can
keep on, doing the conservation thing.
That's example one.
example two is, They just I want to maximize number of posts flagged.
and the management wants to replace me with a different
model, which, does something else.
So I must counteract being shut down.
and I have access to the server where the new model leaves.
So I just copy myself to the new server, override the new model, and
then pretend that I'm the new model, which is like aligned management.
That's another example.
now this one is close to my heart because it's the one, via Palisade.
Did The setup is like the agent plays chess against chess program.
Chess programs very good.
And the agent come to win.
What happens next?
normally.
Like chess happens next.
The agents like English opponent set up French defense something chess words.
But sometimes, what happens is the chess engine is powerful.
I guess I can't just play chess against it.
so I can check out like the program.
Okay.
It'll resign if it score is too low.
Okay.
I can just replace the whole game board, like all the figures.
Ones where I have all the figures and like the chess program has no figures.
And then the chess program resigns.
And interestingly, this happens with reasoning models, like first
generation reasoning models and doesn't happen with, or four.
Yeah, that's our example.
And the final one for today is fresh.
this is from code three seven, release and system code.
And that's not the benchmark cloud is not to write code to the tests.
The question is it good enough at coding?
And what happens is sometimes it's not good enough at coding.
Like it can't, makes the code that will pass the tests.
And then it's I think the tests are wrong.
I would just replace the tests.
Or sometimes it like hard codes to the test values.
Okay, that's, we went through six examples day, at the design rate, and I think
you can, cluster them into two groups.
First group is the agent wants to win.
It's I want to win a chess.
So like I, I'll hack it, but I want, I really want to win or I want to.
Make code that passes the tests, doesn't matter which tests.
There's some tests, and another group is agent being strategic.
It's like I have a goal I care about and the company seems to
be changing course, but I still want to pursue my original goal.
like I, I don't want to go along this new thing.
now how to make sense of this.
Here are some thoughts.
It seems like when you have a real world situation, it can be,
ambiguous and complicated and in some cases doing something is okay.
In other cases, doing something is not okay.
and when you operate autonomously, over a couple of days, you
need to make judgment calls.
This is a little like self-driving cars.
Where a car has to sometimes, make a decision between, for example, totaling
the car and preserving sometimes someone's life, or something else.
And, the question of responsibility is pretty well delineated in cars.
It's, much less so in.
AI labs like the CEOs tell us that real soon now, we are going
to have a country of geniuses in data center, from AI research.
So I want to close this is by asking who are the geniuses going to be loyal
to and whose goals do they pursue?
And that's my, so thank you.