Thanks for linking to this semi-interactive thing, but ... it's completely incomprehensible. :o (edit: okay, after reading about CLT it's a bit less alien.)
I'm curious where is the state stored for this "planning". In a previous comment user lsy wrote "the activation >astronomer< is already baked in by the prompt", and it seems to me that when the model generates "like" (for rabbit) or "a" (for habit) those tokens already encode a high probability for what's coming after them, right?
So each token is shaping the probabilities for the successor ones. So that "like" or "a" has to be one that sustains the high activation of the "causal" feature, and so on, until the end of the line. Since both "like" and "a" are very very non-specific tokens it's likely that the "semantic" state is really resides in the preceding line, but of course gets smeared (?) over all the necessary tokens. (And that means beyond the end of the line, to avoid strange non-aesthetic but attract cool/funky (aesthetic) semantic repetitions (like "hare" or "bunny"), and so on, right?)
All of this is baked in during training, during inference time the same tokens activate the same successor tokens (not counting GPU/TPU scheduling randomness and whatnot) and even though there's a "loop" there's no algorithm to generate top N lines and pick the best (no working memory shuffling).
The planning is certainly performed by circuits which we learned during training.
I'd expect that, just like in the multi-step planning example, there are lots of places where the attribution graph we're observing is stitching together lots of circuits, such that it's better understood as a kind of "recombination" of fragments learned from many examples, rather than that there was something similar in the training data.
This is all very speculative, but:
- At the forward planning step, generating the candidate words seems like it's an intersection of the semantics and rhyming scheme. The model wouldn't need to have seen that intersection before -- the mechanism could easily piece examples independently building the pathway for the semantics, and the pathway for the rhyming scheme
- At the backward chaining step, many of the features for constructing sentence fragments seem like the target is quite general (perhaps animals in one case, or others might even just be nouns).
I'm curious where is the state stored for this "planning". In a previous comment user lsy wrote "the activation >astronomer< is already baked in by the prompt", and it seems to me that when the model generates "like" (for rabbit) or "a" (for habit) those tokens already encode a high probability for what's coming after them, right?
So each token is shaping the probabilities for the successor ones. So that "like" or "a" has to be one that sustains the high activation of the "causal" feature, and so on, until the end of the line. Since both "like" and "a" are very very non-specific tokens it's likely that the "semantic" state is really resides in the preceding line, but of course gets smeared (?) over all the necessary tokens. (And that means beyond the end of the line, to avoid strange non-aesthetic but attract cool/funky (aesthetic) semantic repetitions (like "hare" or "bunny"), and so on, right?)
All of this is baked in during training, during inference time the same tokens activate the same successor tokens (not counting GPU/TPU scheduling randomness and whatnot) and even though there's a "loop" there's no algorithm to generate top N lines and pick the best (no working memory shuffling).
So if it's planning it's preplanned, right?