Conf42 Python 2024 - Online

Use of Python for Cutting edge Language Model research

Video size:


Join me as I share insights - from Financial Technology at Bloomberg to leading projects at Palantir Tech. Explore the pivot to LLMs, with a focus on Mechanistic Interpretability. Learn how Python, with its versatility, is the key to unraveling the potential of AI in shaping the future of humanity.


  • Bolu will talk about using Python for large language models research. Specifically, he'll be speaking on the insights Python library that enables mechanistic interpretability research. Any effort to understand how these models work will definitely continue to be increasingly important in the future.
  • mechanistic interpretability is a field of research that tackles this problem starting at a very granular level of the models. For today's talk, we're going to be picking one item out of the mechinterp toolkit, which is that of causal interventions.
  • Recent research uses this kind of intervention to try to understand how a model achieves some outcomes. The question is, is it possible to have some functional components of large language models? The topic of interest today is that of function vectors.
  • The insights package came along with an effort called the NDIF initiative, which is basically a national deep inference facility. This is a compute cluster that is available to researchers for doing work that cannot afford the financial burden of actually running these large models. For today's work, we're going to focus on insights.
  • We have to explicitly tell the model to save any value that we want to read outside of the context. This is just one of the examples where we have to remind ourselves of the difference between running the model locally and just simply using, building an intervention graph for a remote resource.
  • The paper decides to say beyond just averaging h. Can we drill specifically into what component in layer eight is contributing? If you're interested in learning more on doing mechanistic interpretability, I hope you've had as much fun going through this as I have.


This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, I'm Bolu and I am an independent AI researcher. And today I'm going to talk about using Python for large language models research. Specifically, I'll be speaking on the insights Python library that enables mechanistic interpretability research. So a bit of a primer on this field of inquiry. What is mechanistic interpretability? Some things we can hopefully all agree on. First is that neural networks solve an increase in number of important tasks. And the second is that it would be at least interesting and probably important to understand how they do that is interesting in the sense of, if you feel any sense of curiosity, to basically look inside this whole world that is currently black box to most people just out of. Because these models arrive at solutions that no person could write a program for. So, out of curiosity, it'd be interesting to know what are the algorithms being implemented and hopefully describing them in a human understandable way, and important in a sense that any sufficiently powerful system that is being put in strategic places of great importance in society has to have a certain level of transparency and understanding before we as a society can trust it to be deployed. Any effort to understand how these models work will definitely continue to be increasingly important in the future. Now, mechanistic interpretability, or mechinterp, as I'll call it going forward, because as you can imagine, it's a bit of a mouthful, is a field of research that tackles this problem starting at a very granular level of the models. And what does that mean by granular level? The typical mechanistic interpretability result provides a mechanistic model that basically means a causal model describing how different discrete components in a very large model arrive at some observed behavior. So we have some observed behavior, and the question is, can you, through experimental processes, arrive at an explanation for how these observations come to be? That is the mechanistic approach to it. Again, this is identifying mechinterp in a much larger field of interpretability, which can have different flavors to it. But mechanistic interpretability is unique in taking this granular causal model of trying to drill as deep as possible and hoping to build on larger and larger abstractions, but starting from the very granular level. And for today's talk, we're going to be picking one item out of the mechinterp toolkit, which is that of causal interventions. So basically the idea is if we abstract the entire network to be a computational graph, that is, again, we forget that this is machine learning. Just imagine this has been any abstract computational graph, and the current state of not understanding simply means we don't know what computation each of the nodes are running and how they interact with each other. So from that perspective, if we're curious about knowing how one component that is, either it is an attention head, or an MLP layer, or a linear layer or an embedding unit, again, you don't have to worry about any of these means. You can just abstract these as being any node in some compute graph. But if you do, it'll help to paint a picture better. So if we're curious to know what any of these nodes contribute to start with, even knowing if they contribute anything to start with, one way of doing that is simply taking the node and observing some behavior that we find interesting and then changing the impute to that node to see if the downstream impact for the observed behavior is noteworthy. That is, if this node, in this example, the node d, is very vital to some observed behavior downstream. If we mess with it a bit, that's if we perturb the node, the observed result should change. That means, okay, this node is on the critical path from input to output for this observed behavior. Of course, we expect some part of the model to change if you mess with anything. But the whole point of this is that we have to first of all settle on some observed behavior, and then we tweak the value of some node of interest and then we observe downstream. If, however, it doesn't have any impact, then that means this node is not that important and then we can ignore it. But if it is, then we know that we can drill deeper. So I think in the rest of the course, I'm going to speak on a practical example in very recent research that uses this kind of intervention to try to understand how a model achieves some outcomes. The topic of interest today is that of function vectors. So this is a very recent paper, I think just published last year, October, from a group from Northeastern University, from Corey College of Computer Science. Basically, it is a mechanistic interoperability research effort that tries to observe some behavior in large language models, and that behavior is described thus. So the question is the hypothesis, is it possible to have some functional components of large language models? That is, we can all agree that if I gave, looking at the top left section here of a string of input, that is arrive column, depart small column, big common column, if I gave this input to something like, say, chad GPT, I think we can all agree that it will figure out, okay, this is a simple word and opposite game. That is the first example at the top. And the second example, I believe is converting to Spanish. I think we can all agree that something like chat, GBT and similar large language models are able to do such a thing. Is it possible for me to take some kernel of this function of opposite, say again, taking the first example, and transfer it to a completely different context and have that same behavior operate on a token in this new context? What that means is on the right you see the direction of the arrow. The example, the counterpart on the right, simply says the word fast means. Now, under the normal operation of a large language module model, trying to predict the next token, you can say something like the word fast means quick, or it means going quickly or any reasonable thing to follow. However, if this hypothesis of portability of functions, we should be able to move something from this context on the left that clearly is about word and opposite into a completely new context that has no conception of word and opposite as an objective and achieve the result of flipping the word fast into slow. I know it seems very almost crazy to expect this is true, but let's just assume this is the leading hypothesis. And of course we're going to discuss what exactly this thing will be exporting is. We see there the letter a average layer activation. What the hypothesis says is this thing in quote that we plan to port over is simply the average activation over a series of projects for a given task. Again, I'm going to break that down a bit. So again, let's say our task is simple or an opposite. So we have three different examples. Old young, vanish, appear, dark. Colin and I guess something like bright or dark and light will follow. And the second example, the same thing. Awake, sleep, future, past, joy. Colin at the very end of all these contexts, these like query inputs, the neural network is right on the verge of doing the thing called flip opposite the last thing I saw before my column. So the hypothesis is if we can take that activation state and in the section b you see there, simply add it to a completely unrelated context, would it be possible to observe the same behavior? Because again on the right we see fair simple. In the absence of this intervention, we have no reason to expect the model will say anything other than simple. Then something like simple, easy, or whatever the model finds appropriate to follow simple. But if indeed our intervention is important, we expect to observe something like simple colon complex or at the bottom there encode becomes decode just magically by intervening with this average activation state. Again, I would explain what we mean by activation state in the following line, but I hope you just get the general thesis of what this is meant to be. That is the question is, is there a portable component of operations and functions inside of neural networks and more specifically large language models? All right, so I guess to give a bit of shape what I mean by activation vector and what is being ported left and right. So here, this is just like a typical one layer example of an LLM decoder, only what we have here is at the very bottom, we input a token, a sequence of tokens, right? That is like the on colon, off wet column, dry old colon. And as we see, the expectation is as this input passes through subsequent layers in, there's one single set of vectors that are going to keep being updated and changed and added on. And again, due to some specifics to the neural network architectures, more specifically the skip connections, which I won't get too much into right now, each subsequent layer adds additional context that is literally just added on top the last. But in any way that's not really important for now. So let's just think of it. For example, again, looking at the journey of the column, the very last column, when it goes to the embedding layer, it has some vector that represents okay, cool. This is how the neural network's embedding layer represents the token of a column. And again, so we can kind of anthropomorphize, pretend it's like self aware almost to say I am a column. Because technically, if you took that embedding vector and you put it through the unembedding vector on the other side, it would come out as column is really likely to come, right? So we might as well just see this as the model being what information the model has for that position in the sequence. So somewhere between starting from I am a column in the beginning to the very end of the thing that follows me is the word new. The model has learned some interesting things, right? By definition, like how else would it know? Again, because it's still that same column vector that has been updated for the sequence position of the token column. So the conjecture here for the hypothesis of portable functions is that somewhere in between or containing that vector is information on amicolum, of course, which it had before. And it also has my next is new. That is, my next token is the word new, which again is just what we would observe from Chad GPT so the additional thing the hypothesis is saying is that, or is asking is that, is there a component that encodes the operation that it must do or the function it must do to arrive at new, perhaps before it came to the conclusion of the next is new, is there a component that says I am to do or am to call the function opposite on. Surely there must be of some sort, because how else would it know to come up with new. But the question is if there is linearity to this representation by linearity is just what allows us to do things like this? Literally take a thing, add it average, and add it somewhere else and have it do things right. This assumes a lot of linear behavior. So this is kind of the underlying implicit assumption that is guiding this hypothesis. To start with many of the different research inquiries leads to very interesting result. Often start with this assumption of can we assume there's lowlinearity? And again, due to details of the architecture of most transformer neural networks, there are reasons to expect there to be low linearity. But just to see it happen for real is always interesting. And I think this is the first time we're seeing it in the context of operations as again, just representations, which I think other research has demonstrated before, such as for example, the relationship between the word car and cars, that is, the relationship between a word and its plural. There's been some regularities observed in that regard. But this, however, is trying to take it a step further to say, okay, are there encodings also for functions? Okay, so we have a rough idea of what it means for what this h is. It's simply just some vector that at the very end of the network, right before it goes into the penultimate layer, or at the penultimate layer, we could run our model three different times and snatch that vector across, cross all of them, look at exactly what read literally what that vector is saying. Because again, the information on what is to come next is embedded in the colon token, right? It's the thing that is saying, okay, dark. So all the information for what is after dot, dot, dot is in colon. Cool. So we take that for different runs and we average it out and try to add. So that gives you an idea of just to draw a bit of a picture to it. And of course, this is just restating the same thing now that we have an idea of what h means and what that vector is. So for each of the different runs in a series of prompts that are basically doing the same task, if we literally took all the values of the vectors, averaged them in position, added, divided by this unified, averaged out mean vector, and we took it into a different environment, into a different context, and we literally just added it to something else. The question is, will we be able to get effects like seen below? That is, if we took. So I think here in the example you see the representation for encode, again, so encode column. So there you can see how we can presume that without this intervention at the end, after this token goes through the entire model, it might say something like the thing to come after the column is base 64, I guess, because maybe encoding and base 64 is something that shows up often, right? Remember, the base function of a large language model is just to predict the next most likely thing in human generated text. However, with the addition of our supposed, our hypothetical average out opposite function, would we be able to steer it towards saying something like, actually, instead of saying encode base 64, I all of a sudden feel the urge to say the opposite of encode and dev, say encode column decode. This is the hypothesis. So again, it would be super interesting and kind of weird if we can indeed prove this representation. For example, that just has 123456. Again, our vector just has about six different dimensions. Encoding it. Of course, as we know, actual large language models can be much bigger than this in the billions and billions of parameters. So how exactly do we plan to do this? Multiple runs and extracting values and averaging them and intervening and adding them in the real world, not for a toy model. And that brings us to our trusted interpretability libraries and packages. These are packages that are designed solely for this purpose of staring very deep into what large language models of different sizes are up to in a way that is practical to enable this kind of research. So we have an insight which is particularly popular for working with models on the larger side, and I'll discuss the details of its architecture that afford this kind of behavior. Then we have transformer lens, which is also a very great open source library for doing this. But for today's work, we're going to focus on insights. So what is it about insights that makes it work? What is the contact on insight? Where did it come from? From my understanding, the insights package came along with an effort called the NDIF initiative, which is basically a national deep inference facility. This is basically a compute cluster that is available to researchers for doing work that cannot afford the financial burden of actually running these very large models because they're very costly. Forget just training, even just running inference on them is quite expensive. So basically you have this remote cluster of compute that has been made available to researchers, and the insights package was basically made as a point, as an interface to this compute cluster. So the typical workflow, as is seen here in this schematic, is that you have the researcher working locally, basically writing interventions for how they want to run their experiments and intervene with networks, which we are going to see. And this is basically change into a compute graph, or more specifically an intervention graph, as like, this is how I want the running of this very large model to be tweaked. And this is then sent over the network into this cluster to say, okay, cool, please run this 70 billion model, 70 billion parameter model that I definitely cannot run on my M one MacBook, but run it with these different interventions that make it look like that, make it no different from if I could actually run this locally. And as you can see, the thing between this boundary of the local environment to the NDIF infrastructure is simply this compute graph, and this compute graph is the output of the Nnsite library, and we'll see how it does. Cool. So that is the motivational setup for why an insight exists. It's basically a counterpart to the NDIF project, which is super interesting, by the way. Again, I think they just released their paper last November announcing the launch of the NDIF facility. It is live right now, I believe so, yeah. Really exciting project. I encourage anyone that's looking for computer resources, for inference in particular. And again, this has nothing to do with training. It's just like if you want to run a big model several times and do different interventions or read stuff from it to learn more, as we do with our hypothesis in question, then it works great. But of course, the library also offers the option of just running the if you happen to have several gigabytes of ram to spare. Okay, so let's jump into the code. What does it look like to do an intervention? By intervention, we just simply means anything that either writes or reads execution state of our model. That is, again, you have a model we put in a token sequence, and then stage after stage, the output of one stage is passed to the next, and that is added to the residual stream, which is just, again, think of it as like this ever accumulating output of each component in the model that eventually leads to a probability distribution or output that we observe. So if we ever want to poke into it either like use our binoculars or microscope to look in, that is one type of intervention, as you can see here on line five. Again, you can ignore the stuff above, I will explain that later. But just to dive straight into what exactly the interventions are, again, what are the things that make up these arrows of this intervention graph that is being sent over, which is the whole point of this package? Line five, we have something that is reading. So you see model layers, input is equal to something and we save it again, I'll explain why we're saving that. This giant model is not running on my colab notebook or my local environment, right? So it is interesting that I can indeed read what is happening on it. And on line six we have the opposite, which is me changing something in some other component of my model. So model the layers. So on layer eleven, on a component called the MLP, I want to change its output to become zero. And again, just to remind us what all this is for, how all this relates to our hypothesis, first we want to get the average of a bunch of runs for the task in question, which is the opposite. And then we want to append that average out value to some other examples that are in a different context called like the one shot or the zero shot examples. That is, we've not giving the model any idea what we're trying to achieve. We just wanted to feel the urge to do the thing we want it to do because we have added the vector. So literally the first thing of mean is literally just a read output that we want to run several times, read the average value of this vector, or read the value of this vector and then average out. And then six shows us changing the value of some component. Then we want to add this average value to the running of the different context and see what happens. Okay, cool. So this basically is the scaffolding for what we need for our project. But before we go into the code for our experiment and research in question, just to decode a bit, what exactly is happening here? So one thing is that the insights library loves Python contexts, which is one of the reasons why I guess Python might be a language of choice. But context managers are great in python, as we know, and they do take great advantage of them. And the general structure of it is that as we know, basically the code might look like the model is running locally. When I do things like save, I do edits and I do reads. But the whole point is that none of this, the model actually isn't running right now. But when the context closes, that's like when the code execution reaches the point where we exit the uppermost context, which is here is line three. Runner. All the edits we've made are, or in the course of running, of being intervention graph is updated with all the I o. That is, all the reading. And the writing we're doing to the model is basically just planned while in the context. And when the context is exited, this is then sent over, right? So the model does not run until the context of the highest level, which in case is the runner, which is a runner. Context is closed, and for context the invoker. Again, I would encourage anyone to read the documentation, but invoker is basically what does the writing for the graph right between the invoker and the runner. They are both coordinating. I think the runner definitely do some high level management, but one of the initialization inputs to the invoker implicitly is something called a tracer. And again you can think of the tracer as just being a new graph in question. As we're going to see. You can actually have construct multiple graphs inside of one runner, which we are going to see shortly. That is you can say like okay, I want to plan different experiments. And again, this fits perfectly for our use case, since first we want to run one set of operations that runs our task inputs and takes the average, and another set of operations that take that average and adds it to the state of the different context. Examples, again, that should have no idea about the task, and then see what happens when this average vector is added to it. So the runner is the high level context manager, and then each basic subgraph experiment that we want to run is contained in the invoker context. Interventions, every read and write intervention, all the iOS are what are the nodes in there of type tracing node, which again are what inform what our entire graph is made of to start with. And I said I was going to speak on why we need save. So again, remember that because this isn't running locally, we have to explicitly tell the model to save any value that we want to read outside of the context, because the standard behavior is when the context is exited, the model actually runs with all our interventions, but because these values are so large, we actually have to explicitly tell okay, please, I would like you to return several hundreds of thousands of vector values to me, because that is important. So that is the only reason why we can access hidden state outside context, otherwise we wouldn't need to. So perhaps this was just a temporary variable that we needed to use for our computation, which is fine if we have no intention to access it after the contact closes, we wouldn't put save. So this is only because we want to hold on to the value. So this is just one of the examples where we have to remind ourselves of the difference between running the model locally and just simply using, building an intervention graph for a remote resource that is going to run immediately we leave the context. And again, this is from documentation, basically showing how each. So here you can see that each line of intervention. So the first green arrow on the left blue box is a right. That is, we're setting some layers output to zero and the next is a read and the third is also a read. But you see here, we use the dot save because we do want this value to be sent over the network when the model isn't running. And you see the output of this is this intervention graph in the middle. And this is basically what is what is sent over the network in one direction. And then the result for things that we ask the model that we ask the graph to save are then sent back in the other direction when the execution is done. Again, just to remind ourselves on what we're trying to do now, we have an idea of what our library looks like and how we use it is that we want to pass in some context, we want to run it. Remember, we're only interested in what happens by the column, so we will be indexing to get only the vector at the very extreme end. Because the idea is that is the token that will contain information on what is to come next. Right? Again, just as a result of how transform architectures work is, the next prediction is containing the last token. To what end do we want to do that to this end? So we could do two sets of runs. The first run is to pass a bunch of examples doing the task we want. Again, this is exactly like how you would tell Chat GPT something like, I want you to give me words. And opposite, like this example, old, young, separated by Colin. Then it does this thing, right? So this is basically just like prompting it with the, this is similar to prompting with the format of code you want. But in this case we're actually just going to look at the very last token and then right before, when it's right on the verge of predicting, we just take a first, don't know the computation to know that, okay, this is a word and opposite game we're playing, and I am to predict the opposite of the last thing I saw before this token. So, right when it's supposedly figured all that out, we want to just snatch that vector and average a bunch of them out to get, hopefully a vector that represents in some pure form the very essence of the task that it has figured out, which is opposite function of previous experiment. That's the first part of the experiment. Then the second part of the experiment is to take this pure vector and then add it to a different context, a different series of examples that supposedly should have no idea what is going on, right? Because again, if you just told chat GBT encode column, it has no idea what you want is it can't read your mind yet. So this is called the zero shot intervention, which basically just means zero shot means zero examples of what you're looking for. Except now we're going to add this hopefully averaged out function has it just again feel compelled to do the thing that that vector was obtained from. Right, so how do we do the first part? That is the part where we just run a bunch of stuff and we extract the value at the very last column for all of them and average out. Cool. So again we have our trusted layout. Of course, first of all we have to determine what component we want to look at. Remember Mac, interp is all about having an observed and interesting observed behavior and trying to find the contribution of some discrete component, right? So in this case we narrow down by saying, okay, we want to look at layer eight again. In the actual experiment we run this for all the different layers, for all the different components of the model, and then we have like a plot of which of them happen to be most interesting. And then we drill down. But this is just showing an example of suppose we wanted to see what layer eight was doing as far as the task is concerned. Cool. So just imagine this done for a bunch of tens of other components. Cool. So we have a runner in Booger. Then here on line six we simply just do our trusted, as we would like to notice here that I don't do save because this value of hidden states, this variable is only needed for computation inside of my context, right? So I do not need to export this at this very point in time. I just simply need to take the variable, hold on to it, use it for other computation on len ten. And as you can see, line six is simply just the rightmost column. That is, sorry, beg your pardon, that is the on. Right. So between line six and seven we simply just take an example, we choose a layer and on line eight we say the sequence position should be, I think on line one, you see, we define that as minus one. So we just simply want to take the very last value, which is token. So again, all the dark gray bars is what line eight is holding onto. Then on line ten we simply just do the average. So we take that variable and we do the dot mean on the batch dimension, that's the 0th dimension. Again, that's the dimension of all the stacked examples on the right there. I think I just put like a clip out to show you what the vectors and matrixes will look like. So each of those stacks is just a batch dimension. So each of the examples of old, young, awake, asleep is represented by one of these slices. And we simply just want to average across that to get some hopefully pure vector that encodes the essence of opposites and that we want to save because that we want to send back. So it's kind of meant to be an efficient thing such that we don't want to send everything over the network, we don't want to send both the full, all the matrices. Thankfully we could decide to save everything and then compute locally. So again, this is just some of the considerations that you make when you remind yourself that actually there is some throughput cost and efficiency cost. So let's just do as much as we can on this environment and then just send down the most condensed version we want. Again, this should be similar to running using any remote resource or when you have trade offs between remote and local resources to contend with. Cool. So that is how we do the first part, that's how we get the averages. Literally this is all to do averages for one layer. And just imagine putting this in a for loop if you want to iterate over several layers. And for the second part, having possessed this average pure vector, which we called h, we want to then put h into our zero shot examples. Again, these are the examples that have no context on the task. They're just doing their own thing. They supposedly oblivious to the task we find interesting of opposites, but from nowhere they would just feel the urge to now just do opposites. Hopefully if we add this average vector state to them. And here is the example I mentioned where we're running two invoker contexts inside of the runner context. So basically the first is, again, we're trying to, as with any experiment, we have to have our control example or our reference or our baseline to say that, cool. Without adding this average vector, what does the model feel compelled to predict? So for simple, does it feel compelled to predict simple? Simpler? Maybe it just says cool. Simpler should be something to follow simple or given encode, does it feel compelled to predict base 64? Or perhaps it does feel naturally compelled to decode, who knows? So the first run on line four and five is just again simply running the model and saving the output for the very last token. And the second is where we do the interesting stuff of running the model and basically intervening. So on line eleven we literally just plus equals two, which is identical to how we were taking the mean before. But again for this context we just do a plus equals two, add this value to the existing. And on line 13 we again just like line five save to see. Okay, cool. Now let's see what the predictions are and how similar they are, or to what extent this vector has changed the opinions of this model results. I mean, depending on your standards for impressive or not, this is what it looks like. This is what run one looks like. Just by doing that, we can see that. Indeed, on the third column here, adding that h vector does move the needle a bit does have the effect of the opposite function, right? So on the second column, we just see that the thing the model tries to do, if you tell the model, if you tell the model minimum column, it just repeats a lot of stuff, right? So it just says minimum is minimum, arrogant is arrogant, inside is inside. Although sometimes it does interesting things, like the fifth example from the bottom. If you say on, it says I. If you say answer, it says yes. Again, this is what the model feels compelled to just say if it has no context. But on the third column we see that in some examples we do manage to tilt its final judgment in a different direction. Now, I will mention though, that this is technically not where the paper stops. The paper decides to say beyond just averaging h. Remember, this is taking the value of the output of the entire layer. Remember, we just see layer eight. The paper takes it further by saying, okay, instead of just looking at layer eight, can we drill specifically into what component in layer eight is contributing? So, back to our reference architecture, neural network transformer component transformer block has different things. Our attention block, rather, excuse me, has the mass self attention, the feed forward, it has different things, layer norm, and they decide to drill into the contributions of the attention head. Again, the distinction isn't that important, but the experimental method is precisely the same. So they just find a way to drill into studying the contribution of the top x attention heads. And instead of averaging looking for the average of just all the components contributions, which again supposedly will have more noise, they basically tried to denoise by just narrowing in on a few, and with that the effect is way more obvious. But I don't include that for the purpose of this talk. And that was the talk. So if you are interested in looking at the NDIF project and Nn Insight Library as well, which is a companion, please view that site. And if you're interested in learning more on Mechinterp, many of the code snippets and basically concepts introduce you. They were introduced to myself on the platform called arena education. It is an awesome program. You should check it out. If you're interested in learning more on doing mechanistic interpretability. I hope you've had as much fun going through this as I have. And do enjoy the rest of the conference.

Boluwatife Ben-Adeola

Independent AI Researcher

Boluwatife Ben-Adeola's LinkedIn account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways