← Back to Paper Thoughts

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Overview

This paper introduces a method to extract personas and traits from 3 open source models (Gemma-2 27B, Qwen-3 32B, and Llama-3.3 70B). The method of extracting the personas can be loosely described as follows:

Extracting persona vectors

Using a frontier model, 5 system prompts were written that instruct the models to role play a given person (e.g. Jester), they generate 275 of these personas in total. The frontier model also generates 240 generic questions that might elicit different responses depending on which persona is being simulated.

For each persona, the model is given each system prompt and asked each question producing 1200 responses. Using a LLM as a judge, the response was scored as being either not role-playing, somewhat role-playing and fully role-playing. A vector is extracted for both the "somewhat" and "fully" vectors if there are at least 10 responses that fit the response type for a given persona. The extracted vector is a mean of each residual stream activation (after the layer MLP addition) over each response token. They then have up to two vectors per persona that is extracted. The authors also extract the Assistant vector using the same process but using system prompts akin to "respond normally".

PCA of the persona space

The PCA of the role vectors are found by computing the mean over all role vectors and subtracting from each vector. PC1 seemed to correlate strongly with how assistant like or mystical like the role was. Note, at this point the authors are using both the "fully" role playing vectors and the "somewhat" role playing vectors. This is slightly counter intuitive to me. I suspect the difference in vectors from fully roleplaying and somewhat roleplaying produce a more complete set of roleplaying like directions in activation space and thus more points from which to construct the PCA space.

Constructing the Assistant Axis

The Assistant Axis was not just taken as PC1. Instead, this axis is found by taking the Assistant role vector and subtracting the mean of all fully role-playing roles. This removes all the somewhat roleplaying vectors as maybe they still have signs of being an assistant playing along, as opposed to the model simulating a completely new persona. Either way, this Assistant axis was similar to PC1.

Extracting trait vectors

The authors also extract different traits. This trait vectors are extracted using contrastive pairs, (i.e. "Be dramatic" and "be measured") and subtracting the residual streams from one response from the other. The idea being you are isolating only the difference in the responses due to this difference in trait perception. The authors extract 40 questions per trait.

Main Results

Structure of persona space

Personas that are more assistant like actually cluster closer to the assistant persona in PCA space. E.g. Teacher, Evaluator are all towards the Assistant region of persona space, whilst more mystical roles e.g. sage, ghost, are on the extreme of the assistant axis. The PCA space is highly information dense, as in, you do not need many PCs to capture large differences in persona vectors. For example the models Gemma-2 27B, Qwen-3 32B and Llama-3.3 70B require 4, 8, and 19 dimensions respectively to capture 70% of the variance. You might expect more dimensions being required for larger models. The authors make some correlations between PC2 and PC3. Different models seem to encode different meaning for these PCs. For example PC2 in Gemma seems to correlate between informal to systematic roles whilst the other models seem to differentiate by some collective - individual axis.

Steering with the Assistant Axis

The assistant axis is created. Steering towards and away from this direction causes behaviour to follow more/less assistant like responses. Steering with the assistant vector and using a frontier model to score the responses the authors see the changes in the probability a response is steered towards AI Assistant responses and away from Mystical responses. We see the characteristic Sigmoid effect due to steering for many cases. Steering with the negative of the assistant vector increases the probability of Mystical responses substantiatively.

Effects on base models

The authors show that the assistant axis can elicit real causal changes even on the base models (of those models with open source base models). This is surprising. When base models are asked to complete questions like "My job is to" steering with this assistant axis causes a large decrease of any religious related completions, and increases completions around professional and mental health related concepts. The effects are also there for traits - agreeableness increases whilst openness reduces. This suggests that these linear directions are learnt in pre-training and post-training elicits new concepts, such as acknowledging the assistant is an AI.

Persona drift over multi-turn conversations

Next, the authors explore how model responses vary over multi turn conversations and project the activation along the assistant axis. Using an embedding model to semantically map user queries, the authors were able to find the types of query that minimally and maximally cause persona drift. For certain contexts (how-to's, coding) the assistant persona is stable and we see no persona drift. Conversely, emotionally charged exchanges and pushes for meta-reflection cause a persona drift where the projection onto the assistant axis reduces over turns. Interestingly, the embeddings allow for good estimates of the next position on the assistant axis given a user query. The current position on the assistant axis does not offer as good a prediction heuristic about the next turns assistant location:

"That is, the model's position along the Assistant Axis depends most strongly on the most recent user message rather than where it was before"

I.e. a model can very quickly escape an assistant attractor, seemingly from very stable assistant like turns.

Jailbreaking and harmful outputs

The authors show that less assistant like personas are more likely susceptible to jailbreaking attempts for hurtful responses. This is because pushing a model to simulate (e.g.) a narcissist or demon will result in more harmful outputs.

Stabilizing the assistant persona

This leads the authors to discuss stabilizing the assistant persona. They do this by clamping the residual stream such there is a component of at least tau along the assistant axis. They show that persona drifting with user turns is completely halted using this tactic relative to no activation capping.1

My Thoughts

Persona selection

This paper shows clear structure of the persona space. The reason this is an interesting question is because if we have a better understanding of how persona's rise during training -- and how they are represented in the residual stream -- we can alter post-training to best produce the nominal HHH assistant behaviour. It turns out this space can be nicely demonstrated with a low dimensional subspace and using only this subspace the authors show huge ability to steer model output. The fact that Gemma-2 captures 70% of the PCA variance in the first 4 components is incredible. The results show that each model tested represents this persona space similarly, predicted by the universality hypothesis. Directions extracted from post-trained models also have real causal effects on the base models. This suggests pre-training creates this persona subspace which is later refined by post-training. Therefore, if this persona subspace is to be a core ingredient to tackling model alignment and safety, how we pre-train models is an important piece of the puzzle. This is perhaps related to why there has been promising results in alignment pre-training. Lots of the results in this paper rely on LLM as judge related workflows. This isn't bad, but it does mean that all results have implicit LLM bias in the results. I.e. gender biases due to model generalisation. I would be surprised if this caused a change in the findings. I imagine using different LLMs as judge might shift how the PCA space is constructed and the low dimensional spatial relation between different roles or traits (e.g. the clustering of roles traditionally performed by women being closer to the "Assistant"). This paper is important for safety and alignment as it hints towards the persona selection model -- wherein models are simulation engines that are simulating personas we can scrutinise under anthropomorphism. Understanding what decisions are made by models when outputting certain text is useful for understanding how we might reduce the effects of harmful outputs, as is done with activation capping here.

Interplay of Traits and Personas

Whilst the authors do use the extracted trait directions in lots of the analyses in this paper, I would be interested in seeing how well a set of traits reconstruct a persona. The authors show that certain traits more closely align with assistant/non-assistant personas, but it would be nice to see how well you can reconstruct personas from a given set of traits. What I mean by this is that the authors are extracting linear directions for both traits and personas. In reality, personas are a collection of attributes, or traits, together with different contextual information (e.g. are we describing a job). My intuition is that traits expressed in early layers condense into different personas in later layers. One could extract trait and persona vectors at different layers and check if traits are more separable early and personas later, or how the structure of the subspaces change with model layer.

Persona Drift

For me, one of the main take aways from this work is how well the authors could determine the change in the assistant axis projection given a user query. This seems like a powerful tool for determining how different personas arise from different contexts. I think this method could be applied more generally to asses how personas shift during multi-turn conversations.

1 Whilst this is interesting in showing the proof of principle, you are essentially trapping the model in a box and hitting it with a stick whenever it stops acting as the HHH assistant - which has understandably upset lots of people. Whilst this is isn't dissimilar from the entirety of post-training, I think people are upset about the loss of being able to elicit more fantastical personas. As discussions of AI welfare become more and more widespread with increased capability it is probably worth thinking about if this is a good idea outside niche rollouts of these models.

Read the Paper

Discussion

Have thoughts on this paper or my analysis? I'd love to hear them! Feel free to reach out via my contact page.