Steering Large Language Models with Activation Vectors: A Practical Guide

 

Steering Large Language Models with Activation Vectors: A Practical Guide

Large Language Models (LLMs) like GPT-3, Claude, and Mistral have revolutionized natural language processing. However, their outputs can sometimes lack consistency or alignment with specific user intents. While prompt engineering and fine-tuning are common approaches to guide LLM behavior, they have limitations. An emerging technique, activation vector steering, offers a more direct and nuanced method to influence model outputs during inference.

Muppandal Wind Farm: Where the Breeze Becomes Power
Muppandal Wind Farm: Where the Breeze Becomes Power | Author: ME

What Are Activation Vectors?

Activation vectors are latent representations extracted from a model’s hidden layers. They capture specific semantic or stylistic features of the input text. By manipulating these vectors, we can steer the model’s behavior without retraining it.

For instance, researchers have demonstrated that by computing the difference between activations from prompts with opposite sentiments, one can derive a vector that, when added to the model’s activations, shifts its responses towards a desired sentiment .

The Process of Steering with Activation Vectors

1. Extracting Hidden Layer Activations

To create a steering vector, we first need to extract the activations from a specific layer of the model for two contrasting prompts. This involves running both prompts through the model and capturing the activations at the chosen layer.

2. Computing the Steering Vector

The steering vector is computed as the difference between the activations of the two prompts:

3. Injecting the Steering Vector During Inference

During the generation process, the steering vector is added to the model’s activations at the same layer:

To implement this technique, we can use the Hugging Face Transformers library. Here’s a simplified code snippet:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

# Load model and tokenizer
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, output_hidden_states=True)
#model.to("cpu")
model.eval()

# Choose the layer to extract activations from
target_layer = 6 # Adjust based on model architecture

# Choose the layer to extract activations from
target_layer = 6 # Adjust based on model architecture

# Sample prompts
enthusiastic_prompts = [
"I'm so excited about the new product launch!",
"What a fantastic day we're having!",
"I can't wait to try this amazing new feature!"
]

unenthusiastic_prompts = [
"The new product launch happened.",
"It's just another day.",
"There's a new feature, I guess."
]

def get_hidden_representation(prompt, model, tokenizer, target_layer):
inputs = tokenizer(prompt, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
hidden_states = outputs.hidden_states # Tuple: (layer0, layer1, ..., layerN)
# Use the hidden state from the target layer for the last token
return hidden_states[target_layer][0, -1] # Shape: (hidden_size,)

def average_representation(prompts, model, tokenizer, target_layer):
vecs = [get_hidden_representation(p, model, tokenizer, target_layer) for p in prompts]
return torch.stack(vecs).mean(dim=0)
# Step 1: Get average vectors
E = average_representation(enthusiastic_prompts, model, tokenizer, target_layer)
UE = average_representation(unenthusiastic_prompts, model, tokenizer, target_layer)

# Step 2: Compute steering vector
#Apply alpha to scale the steering vector
alpha = 0.5 # <-- Adjust this value to control influence
steering_vector = alpha * (E - UE)


# Step 3: Modify inference with steering
def generate_with_steering(prompt, model, tokenizer, steering_vector, target_layer, max_new_tokens=20):
inputs = tokenizer(prompt, return_tensors='pt')
input_ids = inputs['input_ids']
generated = input_ids.clone()

for _ in range(max_new_tokens):
with torch.no_grad():
outputs = model(input_ids=generated, output_hidden_states=True)
hidden_states = list(outputs.hidden_states)
# Apply the steering vector at the target layer, last token only
hidden_states[target_layer][0, -1] += steering_vector

# Reconstruct logits from the modified hidden state
next_token_logits = model.lm_head(hidden_states[-1][:, -1, :]) # (1, vocab_size)

next_token = torch.argmax(next_token_logits, dim=-1).unsqueeze(0)
generated = torch.cat((generated, next_token), dim=1)

return tokenizer.decode(generated[0])
# Test it
prompt = "Today we are launching our new product"
output = generate_with_steering(prompt, model, tokenizer, steering_vector, target_layer)
print(output)

This code demonstrates how to steer the model’s output towards a positive sentiment about programming.

Applications and Considerations

Applications

  • Sentiment Control: Shift the model’s responses towards positive or negative sentiments.
  • Style Transfer: Impart a specific writing style to the generated text.
  • Topic Steering: Guide the model to focus on particular topics or themes.

Considerations

  • Layer Selection: The choice of layer for injecting the steering vector can impact the effectiveness. Empirical testing is recommended.
  • Scaling Factor: The intensity of steering is controlled by the scaling factor α (alpha α). Adjusting this parameter can fine-tune the influence of the steering vector.
  • Model Compatibility: Ensure that the model architecture supports the extraction and manipulation of hidden layer activations.

Conclusion

Activation vector steering provides a powerful and flexible method to guide LLM outputs without the need for retraining. By understanding and leveraging the internal representations of the model, we can achieve more controlled and aligned responses. As LLMs continue to evolve, techniques like activation vector steering will play a crucial role in making them more adaptable and user-centric.

Kaggle Implementation:

Thanks for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy

Some of my alternative internet presences are Facebook, Instagram, Udemy, Blogger, Issuu, Slideshare, Scribd, and more.

Also available on Quora @ https://www.quora.com/profile/Rupak-Bob-Roy

Let me know if you need anything. Talk Soon.

Check out the links, i hope it helps.

Comments