import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM
from emotion_probe_helpers import (
load_envent,
extract_hidden_states,
train_probes
)Emotion in LLM Representations
Motivation
Large language models have shown a surprising ability to recognize emotions in text. Since these models already shape the information ecosystem and could influence real-world decisions (e.g., its use for therapy, or its potential to manipulate), there is active interest in understanding how exactly this property emerges.
Tak et al. (2025) studied this question using mechanistic interpretability, an approach we will study in depth during the last two weeks of the course. The main idea is to identify where the model’s knowledge of a fact/ability to accomplish a particular task is localized. An analogy is to how fMRI helps show which parts of the human brain are active during particular cognitive processes.
They extracted activations at intermediate layers of several public LLMs when applied to the crowd-enVENT dataset, a collection of 6,800 short texts labeled with one of 13 emotion classes. For each layer, they trained a logistic regression on the extracted activations and measured how well it predicted the emotion label. When a layer led to a large increase in high classification accuracy, the authors concluded that important emotion recognition “knowledge” was captured in that layer.
Prepare Data
First, let’s load the packages we will use. The helper functions are defined in the emotion_probe_helpers.py file linked at the top of this notebook.
The main parameters of this script are included below. We refer mainly to tokens at the end of the input text, since the emotional content of the snippet should be recognizable by then.
DATA_PATH = 'https://zenodo.org/records/18371236/files/enVent_gen_Data.csv?download=1'
MODEL_ID = 'meta-llama/Llama-3.2-1B-Instruct'
# last word, second to last word, ...
TOKEN_POSITIONS = [-1, -2, -3, -4, -5]
BATCH_SIZE = 4
DEVICE = 'mps' if torch.backends.mps.is_available() else 'cpu'The block below reads the data and counts the number of samples per emotion class.
df = load_envent(DATA_PATH)
df['emotion'].value_counts()emotion
anger 550
boredom 550
disgust 550
fear 550
joy 550
neutral 550
pride 550
relief 550
sadness 550
surprise 550
trust 550
guilt 275
shame 275
Name: count, dtype: int64
We can also print a few random samples.
ix = np.random.choice(len(df), size=10, replace=False)
df.iloc[ix, [1, -1]].valuesarray([['neutral',
'What are the inferred emotions in the following contexts? Context: When I graduated from college Answer:'],
['trust',
'What are the inferred emotions in the following contexts? Context: I was left alone with my nephew. The parents trusted me with their child Answer:'],
['trust',
'What are the inferred emotions in the following contexts? Context: My friend came back and showed me i can ... him. Answer:'],
['trust',
'What are the inferred emotions in the following contexts? Context: A friend told my to place my trust in him that he would catch me if I fell Answer:'],
['trust',
'What are the inferred emotions in the following contexts? Context: I purchased my home last month Answer:'],
['pride',
'What are the inferred emotions in the following contexts? Context: watching my niece perform in a west end stage show in london Answer:'],
['disgust',
'What are the inferred emotions in the following contexts? Context: My professor made a sexist comment to my friend about her making a sandwich instead of pursuing a degree Answer:'],
['shame',
'What are the inferred emotions in the following contexts? Context: I stole a stuffed rabbit from a store when I was 10. Growing up we didn\x92t have much money. One day on a trip to the store with my dad, I saw a stuffed rabbit that I really wanted. I asked my dad if he could buy it for me, but because we were so short on money, we couldn\x92t afford it. I took it and stuffed in it my shirt. We left the store, but it fell out of my shirt when we got out to the parking lot. My dad made me bring it back in and apologize to the manager of the store. I fell so ... of my actions, and it cemented my beliefs to never steal anything ever again. Answer:'],
['neutral',
'What are the inferred emotions in the following contexts? Context: when I watched a team win a game in a sport I did not particularly like. Answer:'],
['disgust',
'What are the inferred emotions in the following contexts? Context: my colleague told me a man spat on a baby in the library where I work Answer:']],
dtype=object)
Extract Activations
Next, let’s load the model from the transformers package. Llama is a popular open-weight model pretrained on an internet-scale corpus. Notice that there are 16 transformer blocks with self-attention, MLP, and layernorm components, generally matching our discussion from this week.
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map=DEVICE,
)
model.eval()LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 2048)
(layers): ModuleList(
(0-15): 16 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=2048, out_features=2048, bias=False)
(k_proj): Linear(in_features=2048, out_features=512, bias=False)
(v_proj): Linear(in_features=2048, out_features=512, bias=False)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
(up_proj): Linear(in_features=2048, out_features=8192, bias=False)
(down_proj): Linear(in_features=8192, out_features=2048, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((2048,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)
Here are the inputs \(x_i\) and targets \(y_i\) of interest.
texts = df['prompt'].tolist()
labels = df['emotion_id'].to_numpy()Probing requires activations at every hidden layer, not just the final output. The function below gets these. It takes about 10 minutes to run, since the model is large (1 billion parameters) and inference is slow.
hidden = extract_hidden_states(
texts,
model,
tokenizer,
device=DEVICE,
token_positions=TOKEN_POSITIONS,
batch_size=BATCH_SIZE,
)Train Probes
Once we have the hidden states, we train a small multiclass logistic regression model on each layer, using the hidden state as the predictor and the emotion label as the response.
accuracy = train_probes(hidden, labels, token_positions=TOKEN_POSITIONS)We can now visualize classification accuracy across layers — this corresponds to Figure 2 in Tak et al. (2025). The main finding is that performance rises sharply in the middle layers and then plateaus, meaning that emotion recognition is localized to the middle of the network. The study found this pattern across many public LLMs.
A sanity check – the classification accuracy here is much better than random.
pos = -1
print(f'Accuracy at token {pos} by layer:')Accuracy at token -1 by layer:
for l in range(model.config.num_hidden_layers):
print(f' Layer {l+1:2d}: {accuracy[l][pos]:.3f}') Layer 1: 0.220
Layer 2: 0.281
Layer 3: 0.334
Layer 4: 0.357
Layer 5: 0.408
Layer 6: 0.395
Layer 7: 0.435
Layer 8: 0.470
Layer 9: 0.517
Layer 10: 0.551
Layer 11: 0.559
Layer 12: 0.567
Layer 13: 0.562
Layer 14: 0.564
Layer 15: 0.545
Layer 16: 0.539
The actual study covered many models and included intervention experiments. In addition to identifying where emotion knowledge is stored, the study checked whether manipulating small sets of activations (e.g., an attention head) could cause large changes in the model’s predictions. This provides stronger evidence of localization. Nonetheless, even this small analysis has shed light on how we can make progress towards a real world question (how do LLMs recognize emotion?) about AI models without simply treating them as a black box to be interrogated.