Conf42 Large Language Models (LLMs) 2024 - Online

Superposition in LLM Feature Representations

Abstract

Dive into the quantum realm of LLMs! Join me as we unravel the superposition in LLM feature representations, exploring the intersection of finance, technology, and AI interpretability. Unearth insights from my journey at Bloomberg, Palantir, and my pursuit of mechanistic interpretability.

Summary

  • Bolu: Today I'm going to talk about superposition in neural network representations. How do neural networks learn which representations to use for inputs? And how are these inputs passed around in the computation?
  • Just because things aren't auto collection numbers doesn't necessarily mean that they are linear, right? Linearity is a very particular statement about how different entities interact. And there was an interesting paper that showed up to say that actually, yes, abstract things like functions can be encoded.
  • The recommendation is distinct to algorithmic interpretation, interpretability. Think of the linear composition as just a compression scheme for how all this information is packed together. As these tools become more mature to understand what's happening, we get to do different things, like everything from mind reading to mind control.
  • Neural networks are able to do this because they exploit feature sparsity and the relative feature importance. Exploring this puzzle suggests a way forward on tackling what exactly is going on.
  • The paper I shared basically tries to do this by tackling a smaller model. They use something called a sparse encoder and a BSN Autoencoder. Since this information is flowing through our entire network, each discrete component is going to need to do some version of this.
  • So basically that would mean if again, with the network of the quarter on the right, if the, on the input and output was 512 different like neurons. They were able to extract way over 100,000 reasonable features. And the future of this work could basically look like scaling up larger models and uncover more useful features.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi there. My name is Bolu, and today I'm going to talk about superposition in neural network representations. So I guess it motivates that a bit. I'll share some context about where this hypothesis comes from or the field of neural network research. This field is called mechanistic interpretability, and basically it follows from the following reasoning. Okay, so we all understand that neural networks solve an increasing number of important tasks really well, and it would be at least interesting and probably important to understand how they do that. So mechanrap is basically a subfield that tackled this problem by seeking granular mechanistic explanations for different observed behaviors in neural networks. It's basically pushing back on the idea that neural networks are just these black boxes that are completely inscrutable and just do magic with linear algebra. So it pairs into a given network at a granular level to investigate some very isolated behavior. At the same time, it also has very broad hypotheses and theories about how neural networks do things. And one of these is about representation learning. That is, how do neural networks learn which representations to use for inputs, and how are these inputs, how these representations passed around in the computation? That's what this is basically understanding what a model sees and how it does. So what informations have model found important to look for? And how is information propagated and I guess represented and propagated internally in the network? I guess to paint a picture of what we mean by representations and propagation. So basically, at the bottom left, on the bottom left here, we have, let's just say this is like a simple tokenized version of input you're going to pass on to based on a transformer, right? So you have like on colon, off wet colon, dry old colon. And I guess, as any of us would, would attest to from using something like chat JPT, these neural networks are definitely able to predict that old. I figure that the next thing is going to be old colon, new, right, since you can figure out what you were in opposite. So the idea is, between this entry of our text and the prediction on the other side, entire network has to have encoded certain information and done computations in the process to get this output of, oh, the next thing to come after this column is new. So basically what we're trying to ask is, okay, just what do we know and what can we investigate about how this information is encoded? Right? So, because in the beginning, all it knows is that I'm a column again. After going through just the embedded network of mapping the column character to a collection of numbers are just called vectors of ordered vectors, as you can see there in the column. So the question is summer between entry on I'm a column to exiting on, my next is new, which is again the result of this vector going through an unembedding layer and the softmax, and again the highest probability weight being attributed to as. Again, just for simplicity, let's assume the word new is its own token, because as we know, prediction is done on a token basis. So, right, so somewhere between signing with M and Colin and my next is new is a bunch of stuff. So what do we know about what these representations look like internally? All right, so here are a couple qualities of these representations that the starting school of mechanistic interpretability posits. Basically, it says that there are discrete features that a model has learned to look for in an input, and these discrete features basically compose into giving any given representation, right? So if we looked at any layer or at any component in the architecture, all the information the model has at that stage is basically going to be some composition of discrete things. And the second is linearity. And so this basically takes the composite, the decomposability statement, a bit further to say that not only are these discrete components composed together, they're composed linearly. And again, we'll discuss a bit later what exactly that means. And the third basically just says we can think of these discrete qualities as something called features. And again, the precise time code definition for what features are comes later. So I guess maybe to summarize, maybe this, like one single line or tagline, it probably, like summarizes the school of thought that says language model. Again, you can replace your neural network representation that basically have similar architectures. Language model representations are linearly decomposable in two features, right? So we're going to pick apart each one of those, of those items in the course of this talk. The first, again, is this is kind of a weak statement. It's not that strong. And I'll explain why. In isolation, decomposability just basically means that, well, we assume that neural networks learn different things. That is, giving a neural network doesn't just basically memorize every simple potential input. It learns to abstract certain features like blueness or redness, or perhaps even more general, to color. Right? If it has a general abstraction for color representations. Again, in this simple case, we have a neural network that may be trained to identify colors and shapes like. So let's just say maybe this is like some classification neural network, right? And the idea is that, okay, somewhere in, you know, all linear network weights are basically transformations that are able to extract certain discrete qualities such as the center shape or the background color. Right here simplify this. We just look for blueness or redness on the left. But the interesting thing is, well, this is kind of just saying sometimes neural networks don't overfit, which is why I say this is kind of a weak statement because like, sure, like it's pretty obvious that, yes, I mean, or at least like anyone that has training neural network can demonstrate in with like a test set to show that yes, indeed, these neural networks can generalize and not everything is overfitting or memorizing. So on the right there, if we have something like a purple triangle that supposedly this neural network has not seen in training before, it could depend on its previous, previously learned features to say that, well, even though it doesn't quite have the conception of purple as a distinct thing, it could compose of the color purple as being perhaps reading the RGB values, being equally composed of red and blue values. So. Right, so at this day, like all, that's all decomposability is saying, there's certain things in this diagram that are quite strong assumptions. Again, this whole idea that, oh, there's one thing called a blue neuron, as we'll see, that's a pretty strong thing to say. And it's not obvious at all that this is how things play out in reality. But again, at this stage, decomposability just says irreproentation is composed of a bunch of little stuff. Because in your own work, again, demonstrably does not just generalize all the time. Right? Again, for sufficient size for spatially small problem set. The second is linearity. So this again takes the decomposability property a bit further. You say, okay, cool. Not only are these different properties, well, distinct or different, they combine linear sums quite simply, which basically just says, if you can imagine a vector representing a certain vector direction representing some feature. And again, this is an as contrived. Remember looking at this diagram, the inputs are already ordered collectible numbers. So again, everything that's for a colon is already inherently having this vector format. I should mention though, just because a thing is an order to collection of numbers, it doesn't mean it has to be linear, right? So it's a bit confusing, obviously, because it has this like vector formatting of, again, an order to collection of numbers relating to one entity, then surely it was obviously mini direction. That is not obvious. And again, I'll show you examples of what that looks like when it's not the case. Okay, so again, the larynx says these decomposable sub vectors basically, literally just add together to give you the representation for something, right? So here we have how the, some other neural network that cares about size and redness in the abstract has two different directions. Again, given it only has two features or qualities it cares about, it can dedicate two different directions to it, right? And then these directions can simply combine to represent any one given input. And again, like, how do we have any evidence for this in practice? Is that. Yes, I guess this is a bit of a popular example by now, but there was a paper that came out a couple years ago that basically showed regularities in the differences between pairs of vectors. So the difference between the man and woman, again, this is like the man and woman, let's say word representation in certain language model architectures was consistent. So if you do something as silly as, let's say, subtracted the vector, again, just the ordered collection of numbers for, of uncle from the vector for aunt, and you simply just impose that on, let's say, some other pair on something like man, you would end up with precisely the vector called woman, right? And you have a bit of this like vector algebra here on the right with the card. Again, let's assume this is like another relationship of plurals, again of like cars, the vector recognition for cars. If you subtract the singular recognition for car and add to apple, you get something like apples, right? So this kind of behavior of literal, like ordered subtraction of values is what you would see in a linear system, right? A system where the masculine, you know, this abstract feature of this is referring to a masculine entity is encoded with all the other stuff that has to do with royalty in king, or has to do with relatives or relatives of siblings, of your parents, an uncle and aunt, or just again, in the literal world, man and woman. Effectively, if these two, if these multiple things are composed in a linear fashion, then you can get it. We'll be doing things like this, vector subtraction and arithmetic as we're seeing here, right? But again, this doesn't mean everything completely is indeed, right. So this is part, this is just for the embedding layer. And again, to remind us what the embedding layer is, it is the very bottom of this, right? It's, there's still a lot of uncertainty as to sure, if maybe for simple things like embedding a word, you get this vector algebra, does that mean like for everything? And all the layers in the network, all the information that it has to encode is in fact composed in this linear fashion. Right. So that is why there's still a mystery, even though we have seen some evidence. I guess, as I mentioned, it would be worth noting that again, just because a thing is an ordered collection of numbers, again, which is how neural networks tend to be like represented or simply just how they are, this kind of, this meme around how neural networks are just linear algebra scaled up. Well, just because things aren't auto collection numbers doesn't necessarily mean that they are linear, right? Linearity is a very particular statement about how different entities interact. Right? So here's an example of, again on, we can imagine a different regime where we had a neural network that again was able to extract a discrete component for redness and another for blueness, but then join them together. It did something like, well, maybe just exploited the simple precision decimal places. Again, how it does this, again, special edition, is by simply just taking the first thing on the left and then making it the value to one decimal place and taking the other thing. And here you have an algorithm to do this. Again, this is an example of a non web winner expression. Um, and again, like the component of this that makes it nonlinear is, you see, it relies on the floor operation. Again, this is like from, like from a python, except that like math or floor, it basically just like tries to do, do the rounding. That's basically how they exploit this like, um, precision and placement to basically like squish these two different values together. Right? So again, so this is just like one, I guess like dummy example of showing that, well, yes, things ordered collection of numbers can act in ways that are not quite vector like or don't quite simply just do addition. Right? You do have other compression schemes, right? And so what the linear representation is saying is that actually on the journey from, again, the input of I am a column, which is what like the embedding does to the output. All the information it has at that point. Again, all the information at this single um, column and the input has maintained as it went through all the layers were simply just added to each other. Right? There's one vector that represents, oh, there's something, there's a bit of like a word and opposite game going on. And there was an interesting paper that, that showed up to say that actually, yes, not just nouns or discrete informations about inputs can be encoded, but also abstract things like functions. Right? Again, this whole idea of, oh, this is a word and opposite game that's been played here between wet and dry, old and new, etc. That itself is one vector and yet another vector. Again, this, this is something that attention can give us, right? Being able to basically like, look at previous inputs. So again, the colon token is able to like, look behind, immediately behind it. See that all the thing that came before me is old and also is able to look maybe further back to other things to like glean the pattern of words and opposites, right? All these different bits of information are literally just different vectors or different directions that compose as simple additions to end with the conclusion of. Okay, surely my next thing is new. Again, the representations aren't really concerned with how the network is doing combination that is like, like, what are the mechanics inside of it that know how to do that? Okay, giving this vector for work opposite this vector for this. How does it do that? Like, again, there's another body of work that explores basically this, like, algorithmic interpretability. This is just saying, um, basically the variables that is being used for these computations, what do you look like and how are the variables that composed? Again, by how? I mean, like how in a sense of, like, are they saying, like, do you have weird stuff like this going on where it's like, in the c, in the space of all potential transformations, of taking redness and blueness together to get purpleness? Um, is it like some unknown arbitrary thing which would be messy and hard, or is it just literally symbol vector addition? Um, so that's what recommendation is as being like, distinct to like, algorithmic, um, interpretation, interpretability. Um, right. So again, as we see here, um, again, just think of the linear composition as just a compression scheme for how all this information is packed together. Okay? So linearity is great because it basically helps us narrow down, again, as I said, like one compression algorithm in a very large function space. So there are many things, again, these are the giant networks to be doing in their, like, typical inscrutable fashion. So linearity is pretty helpful in that actually kind of narrows it down to one like, well known, unstudied, basically like compression scheme, right? Which is the entire field of linear algebra. Right? If it happens to be linear. And also the other things that this gives us as well, which is, it aids in diagnostics and helps improve our understanding of the models in ways that, again, if there were some, if, for example, in every single representation or every single layer, use a different type of arbitrary algorithm, will be hard. So, yeah, it basically would be very convenient if this was the case. Right? And again, we have seen some evidence for it. Right? So this is just an important point to make where to state that this is a combination of having some evidence, but also there's a bit of a motivated inquiry into this. Right? Again, if this wasn't something we cared about, there are many things about neural networks that seem to be interesting, but people just haven't really dug into. But the fact that they seem to have inner behavior has drawn a very large community of researchers to study just why. Because, again, it makes the problem a lot more tractable than if it wasn't the case. Right? And yes, I guess I put here effectively mind control being like, they're a bit of like. As these tools become more mature to understand what's happening, we get to do different things, like everything from mind reading to mind control. That is like, again, if you get to run your strand brain on a computer and you have access to all the numbers and you understand the general, both the algorithms for how the information is represented and how the information is transformed, then you can eventually, like, intervene and or at least just like, you know, have a log stream on what's going on. Cool. So that's the motivation for why linear composability is great, right? Because again, it's a algorithm for these transformers to use for understanding. Okay, here is a bit of the downside, is that linearity is kind of demanding in that it basically says that to have a lossless compression scheme that composes linearly, it requires as many dimensions, that is, as many of those different boxes, as many, like, of the. As many distinct ordered numbers in the collected set as you have qualities you want to encode. Right. As you can see here, again, we have something for redness, for blueness, for squareness, for triangles. And then, like, as you see, basically, you have this like, one hot vector kind of situation going where the thing that makes one of these directions and code for the property of redness is that it is only the first cell that activates for it. Now, I do want to point out that there's a slight difference between just the dimensions and, uh, like, the number of dimensions is the requirement. It doesn't necessarily mean, though, that you would always have this perfect idea of, uh, one cell coordinating to one thing, right? You. You could basically have, like, there are, you know, infinite many, um, orthogonal bases that are able to achieve this. Basically, all you just want is that for, you want to have as many orthogonal, um, directions as you have features, right? Again, just for the sake of, like, easier understanding, we just focus on the one in this very large, this infinite set of orthogonal bases that happens to be one hot encoded, right? So just going for. Just imagine that every time I talk about a neuron or a dimension. I just literally mean one neuron, but that's not necessarily the case. Cool. So why is this a problem? Okay, so let's just like, again, to remind ourselves of, like, where we're at right now. Again, this certain hypothesis for how representation is done says that language model, large language model representations are linearly decomposable, composable into features. Okay, this brings us to the linear organizational puzzle. Why is this a puzzle? So basically, in a couple steps, first, we have some evidence that indeed we LLMs do represent stuff linearly. Right? Again, so, like, meaning this, this claim has, you know, some basis in reality, right? And again, there are several other arguments in research to suggest that, like, this behavior is more likely, or like, or has either from, or can be observed either from, like, looking at the number of flops that are dedicated to the transformations versus not, etc. Cool. So linear stuff is happening. Linear combinations require as many dimensions or neurons, again, which is, again, a subset of the case of orthogonal basis as features. If you want to encode for redness, blueness, triangle ness and squareness, you need literally four different things. Again, if you want to encode these things distinctly as being different things, you need four different directions. However, and this is where the puzzle comes in experience, it seems that these networks are able to represent way more stuff than they have neurons for. Right? And again, I have a bit of a back of the envelope calculation here, where GPT-2 has an order of hundreds of thousands of neurons. Again, the exact numbers might vary on the architecture, basically, but if you look at the number of attention heads, MLP layers, and the dimensions that these architecture components operate with, um, you, you're in the order of a couple hundred thousands of neurons. Uh, and the assumption is that these models encode a lot more than that. And if you're finding hard to wrap your hand around, what about how 100,000, a couple hundred thousand, um, features is not enough. Remember that, again, this is encoding all of the english language, right? Or all of language, if you recall how well GBD two was able to perform, it is plausible to that, yes, it probably encodes a lot more than a couple hundred thousand features, because again, this is all of language itself. The question is, how is that possible? Again, we know that linear. So basically, we seem to have conflicting or contradicting evidence when we have evidence of linearity, but at the same time, we have all needs, which is as many neurons as it has features. But at the same time, we have models in production, we have external models that seem to do quite well without having as many neurons as they seem to have features. Right. Again, to appreciate why this is a puzzle, you just have to depend on your gut feeling of there are probably more than 200,000 things that you need to be looking for in. In any given standards. Remember how open ended all of language is? And again, this is one of the difficulties in describing what exactly a feature is. A feature, I think, of a feature is one helpful definition. Feature is a thing that a neuron would be dedicated to in a sufficiently large language model. But again, we get to that shortly. Okay, cool. So this is our puzzle. How is this happening? There's a great paper that came from a team at Anthropoc that basically tries to. Basically building off of previous work. Exploring this puzzle suggests a way forward on tackling what exactly is going on and basically being able to. Being able to disentangle all the mess that seems to be happening, because, again, something strange. But I guess basically, before this paper came out, the same team introduced the idea of superposition. And superposition is basically a hypothesis that attempts to answer the puzzle the riddle of how can a model do more represent more features than it has neurons? Now, it effectively says that neural networks are able to do this because they exploit feature sparsity and the relative feature importance. It basically just says that, like, the model does not, in fact, do perfectly lossless compression, but it trades that off in exchange for representing more features, because of a property called like feature sparsity, which basically means that, again, even though the english language, or like any arbitrary text in english language, or like, in the set of all possible texts of coherent english language sentences, or perhaps not even coherent, even though those contain a very large number of features, it turns out not all those features are active at the same time. And there's a great. And this provides an opportunity for a trade off where we can say, okay, what if I choose a not perfectly orthogonal set of vectors to represent my features, which, again, is the requirement for that to be a lossless compression. What if, instead, I chose n plus m, actually a bit more than this ideal set of perfectly orthogonal vectors, and in exchange, basically, each additional direct feature direction that I add basically adds some noise. Again, using the compression analogy, add some noise. Basically, I trade off a little bit of noise for having a much wider set of features that can represent. And again, this only works if all features are not present together, because, again, if all features are present all the time, that, again, if you have no sparsity, you will have noise in all your outputs. And, yeah, this is actually what this is saying, right? So in the top and the top three boxes you have there is basically saying, okay, you have, again, all the different dots are meant to be different features that your model cares about. Again, maybe one of them is redness, one of them is blue nest, one of them is square nest, etc. And you can see you have like a two dimensional surface. You have a two dimensional surface. In the case where there's no sparsity, where every feature is as likely to be important as the other, you effectively have, um, what we, what we expect that is like the neural network indeed only has two directions to represent the two things. Um, most important thing it cares about. Okay? And again, let's just imagine right now that they're all equally important. So you just like, randomly chooses two features. Again, maybe it only cares about, um, having a distinction between redness and. And it's unable to have a distinction between squareness and triangle nest, for example. Right? It basically just chooses like, one pair. That is, you know, is the best it can do to like, get a good classification loss on the problem set. Okay? But it turns out that as you increase sparsity, right, as you maybe make redness and circleness not as common. For example, let's say you had a case whereby sometimes the image has no shape in the middle. That is, sometimes it's just color that matters, and sometimes you have a completely colorless input, but it's just the shape in the middle. Right? And that's what sparsity means, right? Where you have like, two different. In this case, you have like more than two different properties that don't always coincide together, right? It turns out that in the 80% sparsity example, as you see here, it actually chooses to represent more features than it should be able to, right? Again, this is consistent with what we see in the experience. And on the right here, you see 90% sparsity case. It basically has that direction showing that, okay, that's what spark, that's what indifference means. Because the whole point of having an orthogonal basis is that if you try to extract the component feature, that, again, you have some vector, and then you have only, again, two potential orthogonal, um, features. If you do a dot product against each of these, each value you get is, um, basically saying how much this giving vector is composed of each of those directions. So basically, if you have, again, some direction for some representation of a red square, basically if you take a dot product of that against the redness direction, it gives you, oh, this is either really red or not that much red. And take a dot product of that against the squared direction. It says, oh, how much of it does this look like? A square? Right. So basically, so that's why these needs to be orthogonal. Like they shouldn't interfere with each other. Like the quality of redness should only interfere with the quality of squareness, if that makes sense. So that's why orthogonality is important. However, again, if sometimes you basically just get a colorless square or, or you get a shapeless color, basically you just have any potential color. There's a shape at either sparsity where you don't always have cases where these two show up together. And again, for your problem set, you only need to care about these in isolation. That is, you only need to be really good at detecting color sometimes or detecting shape sometimes. It turns out that you actually be fine if you chose directions for squareness and redness that interfered with that is what the bottom example is trying to show. Okay? So on the bottom left square, you, our, let's say our orange vector is the thing that we actually want to observe. Again, it's our red square, right? It's a thing, it's the input is the true input vector, right? And you see that, again, we have five different directions, five different features. There's this, if you take a dot product of this value against all the different, five different vectors to see, like, how much it, it has a value with all of them. So you see it is along one direction, for example, right? So that direction would have is given value, right? Let's say that direction means how red it is. However, because there's interference. It has tiny little vectors that are going in along the other features that it actually doesn't have. Again, work. Let's imagine that in this case, we have a simple case of where, oh, this input is simply just, again, a very red input, right? It's just like a red blob or like a red square, right? Sorry, not a red square, like, I mean, just like a red image. There's nothing else in it. So indeed it is aligns perfectly with one of the features. But again, but because in this representation, again, you have only two dimensions, two pure dimensions, but then you're trying to squish five different things inside. Um, you would actually have a little bit of, it would pick up a little bit of a component along other features that it actually doesn't have. So that's what interference is. Um, and however, why this works is that neural networks have non linearities. Um, basically, like the activation functions, um, are non linearies that are able to basically turn off these, um, tiny, tiny bits of noise, right. If they didn't have that, then the bits of noise become quite annoying and would actually, like, count more towards your errors. But because again, in a case whereby there's actually is very little annoyance coming from the other values in the dot product, these are able to be tuned off effectively. Right, so now let's imagine now on the bottom right. In this case, let's imagine that the true input, again, like this thing we care about, the two blue vectors. Again, that is, you have something that is a really big square and also a really yellow background. Okay? So again, you have squareness and you have yellowness. If you can observe this vector addition of these two, again, remember, were operating in like a linear combination regime. This is exactly the same thing as we get on the left. This is why interference is important, because interference requires that, like, for it to work. There should be only very few number of things in this case, like, just say like, for interference to have no impact here, it should only be one feature that is truly trying to be detected at a time. But in a case whereby the two blue directions are trying to be, um, it would look to our system, our neural network, as if it was actually this case on the left. And in. And as we've seen, it is going to end up chipping away those two different values to nothing because of the non linearities. I'm going to think that, oh, actually, instead of seeing a square, a yellow square, it's just going to see a circle. Maybe that's what that third direction represents, which is complete noise, isn't it? Which is completely wrong, so to speak. Right. This is basically going to end up ignoring the components of this vector along those two as just being noise, which is really bad, which is really why, in the case where there's no sparsity, where, like, again, all the features are likely to be active, the neural network doesn't even bother trying to do anything funny. It doesn't try any funny business asset in the top left square. It simply just represents, you know, again, an arbitrary two features. Or in a case whereby it's able to have a sense of relative importance of features, maybe, like, one feature is way more important in determining. To give you an example, let's say one feature of language is, well, what language is it in? That is like, is it English or Spanish? Is it English or Chinese? Another feature is the sentence referring to in the past tense or present tense. Right. This past or present tense feature, you know, helps you avoid grammatical mistakes, but it's fair enough to assume that the feature of at least knowing what language the question or the query is in is probably way more important in terms of, like, having you avoid errors than detecting if it's past tense or present tense. Right? So again, that's just like one abstract idea of the model. So again, if the model only had like two different features, like one, let's like just one, like, or like on the margin. If the model has to represent one more thing and it has to choose between the language detection and the past or present tense, it will most likely prioritize choosing to represent, using that one extra feature to represent language and language type, that is this, English or French or Chinese, as that probably has way more predictive power. Like, we'll have it have way less error than if it was instead of trying to predict the correct tense, but in the wrong language. Again, this bit of a toy example. Okay, so how do we solve this? Right. Again, giving this, we suspect this is what models are doing. Or again, we suspect this is why they're able to do it. This is why they're able to get away with it. Again, they're trying to exploit sparsity by compressing stuff. So the paper I shared basically tries to do this, um, by tackling a smaller model, right? So the tackle a one layer transformer, and they pick out one component of the architecture. Again, as I explained in this, in our typical large transformer, yeah. Every single component is doing some version of this, right? Since this information is flowing through our entire network, each discrete component is going to need to have to do some version of this, right? So they focus on the MLP layer, which is what comes the attention heads. And in the model, they use the dimensions of that, basically, like how many vectors or how many neurons it has. So how many dimensions of each vector has or how many neuron, how many neurons that layer has is 512. And what they do is, as seen here on the right, they use something called a sparse overcomplete order encoder. Obviously, I'll describe what that means, starting from the right. Okay, so what does that mean? And autoencoder and a BSN. Autoencoder is basically a neural network whose primary purpose is reconstruction. So basically you have some input, you have something in the middle, which is like, again, your network. And the job of that network is to try to replicate, to reproduce the output. That seems kind of silly. Like why just bother with this density transform? Because, well, in some cases, you might want to do something like, okay, compressed dimensions, right? So let's say you have this very like, large input you want to find interesting. You want to find the most important critical features by compressing it in the middle and seeing how. Well, okay, like, again, let's say something needs five dimensions to this input is five dimensions. What are the two most important dimensions of this or two most important, like representations of these five dimensions I can take that would have me still do well on reconstructing with this. That basically has this property of feature discovery by compression. That's what auto encoders do. Overcomplete. Again, starting from the right to left, describe overcomplete, basically does the slightly opposite version of that, which is, instead of compressing, you're basically trying to expand. You basically try to give the order encoder in the middle again between the input and your construction of the input is much larger than as you're saying. If this neural network representation had way more room to represent stuff, what would it look like? Right? And remember, the whole point of our store position is that we're assuming that the model we see is actually trying to simulate a much larger model. Again, remember, that's the whole point of superposition, right? So by using this overcompleted encoder, we're trying to say that, cool, whatever representation this MLP node has for some input, what if we gave it way more neurons to work with? What would you do with it? That's the overcomplete section. Then the sparse component of that description is saying, like, sure, what if we just go from like, you know, a five dimensional, inscrutable, compressed complex thing to a hundred dimension inscrutable complex thing, right? We're not much better than we started, right? And again, neural networks, just like, don't really have any incentives to just make things explainable to us. So the sparse component says, okay, in addition to giving you the network, more room to work with, to like, expand, to like, see what you learned, we want to force you to narrow your learnings, your features, to being active in one node at a time. Right. I think I explained before how just because I think it was, in this example, just going to jump quickly. So just because I say that, oh, like, linearity says you must have as many dimensions as you want features, it doesn't necessarily mean that it will always be one. Hot, right? You might have a case whereby there's several, you know, there are infinite, many orthogonal, four orthogonal vectors, like four vectors that form an orthogonal basis that aren't one. Hot. Right. It's kind of like smeared between all the different values. But again, from our, for our convenience to like say that, that neuron is like firing a lot when it sees redness, we want to impose an extra basic constraint on the auto encoder to say that, cool. Don't just try to find representations with more nodes to work with when you do this, narrow down your learnings or try to isolate your learnings for one feature to like, one node at a time, right? That's basically just for our interpretability benefit. And yet that is what a sparse overcomplete order code is. Or usually they usually just ignore the overcomplete part part and just call it a sparse or encoder and basically just says, cool. We want to give neural networks opportunity to want to extract what they've learned by trying to reconstruct representations using more dimensions. And we want this new like extraction to be sparse in such a way that only one node is activated at a time for a given feature that is applied and effectively that looks like this. So they ran this, this training, this training process for the sparse auto encoder on their one layer network for the MLP layer. And again, if you see here, they describe on the left there you see this like the act 512, which is like the activation of MLP layer, typically should have 500, 1212 dimensions. Again, that would just mean instead of 1 second, instead of like four different blocks here, that'll be 512, right? That's like how big the vector is. So they went from 512 and expanded all the RAN different versions, but the largest ones went up to 131k. So basically that would mean if again, with the network of the quarter on the right, if the, on the input and output was 512 different like neurons. And in the middle you had this giant 130K node network, or a node like layer, basically that was trying to reconstruct the input into the output. And they learned a bunch of stuff. They have a very nice interactive, um, application that I encourage you all to check out that basically shows the model learning really interesting things. So one of the, um, the neurons I discovered, again, neuron simply means like, because of this like, um, constraint of sparsity, the, the model learns like isolate some abstract feature. So literally one of these 130K nodes, basically, like once this feature is present an input, it just like fires and screams a lot. You see like this input is really here. And these features are like so like wide and varied. When, for example, detects, the, detects arabic characters in input, another of them, as you see here, detects if a sequence of text is probably like a DNA sequence which I think was pretty wild because, like, this could also be gibberish, but there are certain patterns and I guess the actual letters, for example, that is used for DNA encoding. That seems like such an arbitrary thing that a model will learn. But it did learn this. And you can check out other esoteric ones that you learned in this reconstruction. And again, feature was present in the 512 mlp. But because it was coked up and all cooped up together with the superposition, in the superposition phase, it was hard to basically discern. The whole point of the recorder is basically to extract these features out there. So now they become one isolated thing, which kind of brings us full circle to the definition of what a feature is. So again, I've been throwing around the idea of feature as just a distinct thing. Model find to be interesting, right. As basically one perhaps more narrow definition of it. I guess I don't want to say formal, but just like, one more particular definition of it, based on this paradigm that we've described, is like a feature is basically a property that a model would encode. Would dedicate an entire neuron to. Would encode using an. Using one neuron if it had enough neurons, right? So basically, if there's such a thing that if a model was sufficiently large, would it get one neuron to it? That thing is a feature, right? But if there's a thing that no matter how many neurons it had, this thing wouldn't have a neuron, it perhaps would be like a part of some other neuron, then that thing's not a feature, right. It seems kind of circular, but it turns about. Yeah. The precise definition of, like, futures can be kind of gnarly. Um, but, like, for all practical purposes, you know, think of, again, features in the colloquial sense of just like, you know, a thing that the model finds to be interesting, like squareness or bonus or whatever. Um, and but the interesting thing, I guess, is that, again, like, part of the things. Part of the ways, like, a more powerful model is more powerful is because it can indeed encode for more stuff than a small one can. And again, as a proposition suggest the smaller ones actually encode a lot more than they might than their size alone might suggest, right? Because, again, the whole point of this is if indeed there was no superposition, or like, if indeed there was nothing weird happening, then this MLP layer would actually only have 512, but they were able to extract way over 100,000 reasonable features. So something for sure where it is happening. And, like. And these features, like, were consistent with experimental validation for. Again like they had different evaluation methods that you can check out in the paper to show like how much confidence they have for it. But basically the features like very confidence or incoherence or at least in like explainability to like it's a real human being human. But I feel like the number of explainable features, high quality feature definitely exceeds 512. So indeed this compression is happening for sure and this is like the proof for it. And yeah the future of this work could basically look like scaling up this auto encoders to work on much larger models and uncover more useful features going forward. Awesome and that is the talk. Thanks for joining here and I encourage you to read more of the papers out there. I think the anthropic blog posts and paper, informal papers and formal papers are a great place to start as that basically represents where the frontiers right now. Awesome, thank you for the time and see you later.
...

Boluwatife Ben-Adeola

Independent AI Researcher

Boluwatife Ben-Adeola's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways