Emotion in LLM Representations

Author

Kris Sankaran

Published

March 25, 2026

Notebook, Helpers, Conda Env

Motivation

Large language models have shown a surprising ability to recognize emotions in text. Since these models already shape the information ecosystem and could influence real-world decisions (e.g., its use for therapy, or its potential to manipulate), there is active interest in understanding how exactly this property emerges.

Tak et al. (2025) studied this question using mechanistic interpretability, an approach we will study in depth during the last two weeks of the course. The main idea is to identify where the model’s knowledge of a fact/ability to accomplish a particular task is localized. An analogy is to how fMRI helps show which parts of the human brain are active during particular cognitive processes.

They extracted activations at intermediate layers of several public LLMs when applied to the crowd-enVENT dataset, a collection of 6,800 short texts labeled with one of 13 emotion classes. For each layer, they trained a logistic regression on the extracted activations and measured how well it predicted the emotion label. When a layer led to a large increase in high classification accuracy, the authors concluded that important emotion recognition “knowledge” was captured in that layer.

Prepare Data

First, let’s load the packages we will use. The helper functions are defined in the emotion_probe_helpers.py file linked at the top of this notebook.

import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM
from emotion_probe_helpers import (
    load_envent,
    extract_hidden_states,
    train_probes
)

The main parameters of this script are included below. We refer mainly to tokens at the end of the input text, since the emotional content of the snippet should be recognizable by then.

DATA_PATH = 'https://zenodo.org/records/18371236/files/enVent_gen_Data.csv?download=1'
MODEL_ID = 'meta-llama/Llama-3.2-1B-Instruct'

# last word, second to last word, ...
TOKEN_POSITIONS = [-1, -2, -3, -4, -5]
BATCH_SIZE = 4
DEVICE = 'mps' if torch.backends.mps.is_available() else 'cpu'

The block below reads the data and counts the number of samples per emotion class.

df = load_envent(DATA_PATH)
df['emotion'].value_counts()
emotion
anger       550
boredom     550
disgust     550
fear        550
joy         550
neutral     550
pride       550
relief      550
sadness     550
surprise    550
trust       550
guilt       275
shame       275
Name: count, dtype: int64

We can also print a few random samples.

ix = np.random.choice(len(df), size=10, replace=False)
df.iloc[ix, [1, -1]].values
array([['neutral',
        'What are the inferred emotions in the following contexts? Context: When I graduated from college Answer:'],
       ['trust',
        'What are the inferred emotions in the following contexts? Context: I was left alone with my nephew. The parents trusted me with their child Answer:'],
       ['trust',
        'What are the inferred emotions in the following contexts? Context: My friend came back and showed me i can ... him. Answer:'],
       ['trust',
        'What are the inferred emotions in the following contexts? Context: A friend told my to place my trust in him that he would catch me if I fell Answer:'],
       ['trust',
        'What are the inferred emotions in the following contexts? Context: I purchased my home last month Answer:'],
       ['pride',
        'What are the inferred emotions in the following contexts? Context: watching my niece perform in a west end stage show in london Answer:'],
       ['disgust',
        'What are the inferred emotions in the following contexts? Context: My professor made a sexist comment to my friend about her making a sandwich instead of pursuing a degree Answer:'],
       ['shame',
        'What are the inferred emotions in the following contexts? Context: I stole a stuffed rabbit from a store when I was 10.  Growing up we didn\x92t have much money.  One day on a trip to the store with my dad, I saw a stuffed rabbit that I really wanted.  I asked my dad if he could buy it for me, but because we were so short on money, we couldn\x92t afford it.  I took it and stuffed in it my shirt.  We left the store, but it fell out of my shirt when we got out to the parking lot.  My dad made me bring it back in and apologize to the manager of the store.  I fell so ... of my actions, and it cemented my beliefs to never steal anything ever again. Answer:'],
       ['neutral',
        'What are the inferred emotions in the following contexts? Context: when I watched a team win a game in a sport I did not particularly like. Answer:'],
       ['disgust',
        'What are the inferred emotions in the following contexts? Context: my colleague told me a man spat on a baby in the library where I work Answer:']],
      dtype=object)

Extract Activations

Next, let’s load the model from the transformers package. Llama is a popular open-weight model pretrained on an internet-scale corpus. Notice that there are 16 transformer blocks with self-attention, MLP, and layernorm components, generally matching our discussion from this week.

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map=DEVICE,
)
model.eval()
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)

Here are the inputs \(x_i\) and targets \(y_i\) of interest.

texts  = df['prompt'].tolist()
labels = df['emotion_id'].to_numpy()

Probing requires activations at every hidden layer, not just the final output. The function below gets these. It takes about 10 minutes to run, since the model is large (1 billion parameters) and inference is slow.

hidden = extract_hidden_states(
    texts,
    model,
    tokenizer,
    device=DEVICE,
    token_positions=TOKEN_POSITIONS,
    batch_size=BATCH_SIZE,
)

Train Probes

Once we have the hidden states, we train a small multiclass logistic regression model on each layer, using the hidden state as the predictor and the emotion label as the response.

accuracy = train_probes(hidden, labels, token_positions=TOKEN_POSITIONS)

We can now visualize classification accuracy across layers — this corresponds to Figure 2 in Tak et al. (2025). The main finding is that performance rises sharply in the middle layers and then plateaus, meaning that emotion recognition is localized to the middle of the network. The study found this pattern across many public LLMs.

A sanity check – the classification accuracy here is much better than random.

pos = -1
print(f'Accuracy at token {pos} by layer:')
Accuracy at token -1 by layer:
for l in range(model.config.num_hidden_layers):
    print(f'  Layer {l+1:2d}: {accuracy[l][pos]:.3f}')
  Layer  1: 0.220
  Layer  2: 0.281
  Layer  3: 0.334
  Layer  4: 0.357
  Layer  5: 0.408
  Layer  6: 0.395
  Layer  7: 0.435
  Layer  8: 0.470
  Layer  9: 0.517
  Layer 10: 0.551
  Layer 11: 0.559
  Layer 12: 0.567
  Layer 13: 0.562
  Layer 14: 0.564
  Layer 15: 0.545
  Layer 16: 0.539

The actual study covered many models and included intervention experiments. In addition to identifying where emotion knowledge is stored, the study checked whether manipulating small sets of activations (e.g., an attention head) could cause large changes in the model’s predictions. This provides stronger evidence of localization. Nonetheless, even this small analysis has shed light on how we can make progress towards a real world question (how do LLMs recognize emotion?) about AI models without simply treating them as a black box to be interrogated.

References

Tak, Ala N., Amin Banayeeanzade, Anahita Bolourani, Mina Kian, Robin Jia, and Jonathan Gratch. 2025. “Mechanistic Interpretability of Emotion Inference in Large Language Models.” https://doi.org/10.48550/ARXIV.2502.05489.