Safety
Inference-time Safety
- CoT Approach: OpenAI’s Deliberative Reasoning
- Backtracking: Reset token
Alignment
Mechanistic Interpretability
i now think that we can get scalable and useful LLM explainability without making any additional progress on actually understanding them 🤔 so many beautiful abstractions like circuits, but the bitter lesson seems to be that we can probably just throw a lot more compute in slightly clever ways at autointerp
it actually doesnt matter if it recovers “the truth” or not, as long is it has explanatory power for the behaviour of the model we are trying to interpret. the key is we can spend compute at inference time to search for explanations, and score them on how well they lead to accurate reconstruction of NN activations. we dont have ground truth in interp, so reconstruction is probably the next best metric to optimise for. obviously recent work by transluce directly does this by spending compute on sampling explanations. but i realised other work was sort of on this path already, like this and this (kinda) --- Dec 26, 2024.
It’s interesting to think of explanations as simple predictive models of behavior, with reconstruction error measuring predictive power. If we search the right space and use more compute to explore it automatically I think we can make a lot of progress. In some sense, all recent interpretability methods have been trying to do this: (1) the whole idea underlying causal abstraction is that we want to hypothesis-test how well a logical causal graph reconstructs a neural network: https://arxiv.org/abs/2301.04709; (2)SAEs seek to reconstruct the residual stream using a sparse feature basis natural language explanations in particular as the search space are just a particularly climbable hill for this sort of task, and can boostrap off of the crazy progress rn on inference-time search.
the transluce neuron-labelling thing is super impressive to me for a couple reasons.
- obviously (maybe only to linguists??), “concept” is not a well-defined idea at all, nor is the SAE goal of “monosemantic features”. we cant measure monosemanticity, so we cant tell how good an SAE is
- a natural language description of a feature could simply be “concept X and concept Y” (or really anything arbitrarily complex, with enough tokens) and sidestep this whole monosemanticity problem. transluce shows we can leverage search over such descriptions to get wayyyy better neuron explanations than previously possible overall i’m not bullish on SAEs for solving the problem they claim to want to solve. they do seem to be a useful tool in other regards however! and they will benefit from better autointerp, without a doubt
Dec 27, 2024
Seems like we should approach explanability like how Transluce approaches it. At this point, I am still unclear about the exact differences between SAE and Transluce. The key point seems to be reconstruction .
Interesting work! I’m very excited to see more work exploring model diffing, and would love to see this replicated in more “natural” settings than sleeper agents. Looking for latents that change significantly when finetuned for different data and for model is interesting