Transformers

Author

Kris Sankaran

Published

March 23, 2026

\[ \newcommand{\bs}[1]{\mathbf{#1}} \newcommand{\reals}{\mathbb{R}} \newcommand{\widebar}[1]{\overline{#1}} \newcommand{\E}{\mathbb{E}} \newcommand{\indic}[1]{\mathbb{1}\left\{{#1}\right\}} \newcommand{\Earg}[1]{\mathbb{E}\left[{#1}\right]} \newcommand{\exp}[1]{\text{exp}\left({#1}\right)} \newcommand{\Esubarg}[2]{\mathbb{E}_{#1}\left[{#2}\right]} \]

Reading, Code

Items marked \(^{\dagger}\) are not in the required reading and will not be tested. The required reading covers some methods beyond this handout, those also won’t be tested.

Setup

  1. Definition. We view a single sample as a sequence of \(N\) tokens, \(x_{n} \in \reals^{D}\). For example,

    • In language, we represent a document as a sequence of \(N\) words. The tokenizer below converts each word to its ID in a vocabulary of size 50K.
    from transformers import AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    tokens = tokenizer(" to be or not to be")
    tokens["input_ids"]
    [284, 307, 393, 407, 284, 307]

    To keep the dimensionality \(D\) of the tokens manageable, we associate each word with a column in an embedding matrix \(x_{n} = W_{\cdot,v}\) where \(v \in \{1, \dots, V\}\) is the word’s ID in the vocabulary and \(W \in \reals^{D \times V}\) for \(D \ll V\).

    import torch
    
    # V = 50257 words, D = 8 dimensions
    embed = torch.nn.Embedding(num_embeddings=50257, embedding_dim=16)
    X = embed(torch.LongTensor(tokens["input_ids"])).T
    torch.round(X, decimals=2) # notice columns 1 - 2 match 5 - 6
    tensor([[-0.2700,  1.3800,  0.0300, -0.8800, -0.2700,  1.3800],
            [-0.7100, -0.3000, -0.2100, -1.0900, -0.7100, -0.3000],
            [-0.8800, -0.1000,  0.0900, -1.6100, -0.8800, -0.1000],
            [ 1.0800, -0.8900, -1.6300,  0.8200,  1.0800, -0.8900],
            [-0.3400,  0.9700, -0.0900, -0.4000, -0.3400,  0.9700],
            [ 0.3000,  1.0300,  1.5700, -0.7000,  0.3000,  1.0300],
            [ 0.2100,  2.1600,  0.9900, -0.7600,  0.2100,  2.1600],
            [-0.4600,  1.4200,  0.9000,  0.3900, -0.4600,  1.4200],
            [-0.8600, -0.3900, -0.1300, -1.3400, -0.8600, -0.3900],
            [ 0.9700,  0.6500,  0.9800,  3.1000,  0.9700,  0.6500],
            [ 1.2100, -1.3100,  1.8400,  1.3700,  1.2100, -1.3100],
            [-0.5800, -0.4300, -0.2500, -2.2100, -0.5800, -0.4300],
            [ 0.2500, -1.0000,  1.8900, -0.0900,  0.2500, -1.0000],
            [-1.2400,  1.1100,  2.0400, -0.2700, -1.2400,  1.1100],
            [ 0.9500, -0.7500,  0.2700, -1.1300,  0.9500, -0.7500],
            [-0.7200,  1.7500,  0.7900,  0.9700, -0.7200,  1.7500]],
           grad_fn=<RoundBackward1>)
    • In vision, we represent an image as a sequence of \(N\) image patches. Here \(D = \text{patch width} \times \text{patch height}\), the total number of pixels in each patch.

    Sequence representations are useful in vision as well. Image from Dosovitskiy et al. (2020)
  2. Goal. Map a token sequence \(X \in \reals^{D \times N}\) to a representation useful for prediction. Specifically, we want a map parameterized by weights \(\theta_{m}\), \(f_{\theta_{m}}: \reals^{D \times N} \to \reals^{D \times N}\), \[\begin{align*} X^{m} = f_{\theta_{m}}(X^{m - 1}) \end{align*}\] that can be applied recursively within a deep model, similar to the “linear layer + nonlinear activation” block used in MLPs.

  3. Requirements

    • The map must act on the full sequence simultaneously, unlike an MLP, which processes one vector at a time.

    • The map must capture long-range dependence. It must “remember” information from earlier in the sequence. This is unlike GPs and traditional statistical models.

    To see why long-range dependence is important, consider patterns like quotes in books or braces in code. These affect the interpretation of the text for a long time after they start. Image from Karpathy, Johnson, and Li (2015).
  4. Approach. We compose transformer blocks \(f_{\theta_1}, \dots, f_{\theta_M}\) to gradually refine the sequence representation, \[\begin{align*} X^{0} \xrightarrow{f_{\theta_{1}}} X^{1} \xrightarrow{f_{\theta_{2}}} \dots \xrightarrow{f_{\theta_{M}}} X^{M} \end{align*}\] Like in MLPs, the parameters \(\theta_{1}, \dots, \theta_{M}\) will be optimized to accomplish some downstream task. We omit optimization details, since our focus is interpretability.

  5. Preview. Each block \(f_{\theta_{m}}\) will have the form,

    A summary of the overall transformer block from the reading. Everything below the first \(\bigoplus\) is Stage 1 and everything above is Stage 2.

    We’ve seen MLP and LayerNorm steps before, though we’ll consider why they are helpful in this context. A new component is the multihead self-attention operation (MHSA). This is what allows the transformation to learn long-range dependence.

Stage 1: Attention across sequence

  1. Attention. In layer \(m\), we want a nonnegative matrix \(A^{m} \in \reals^{N \times N}\) whose columns sum to one. Column \(n\) is viewed as a distribution over input tokens \(n'\) that are relevant to token \(n\). Informally, token \(n\) attends to token \(n'\) according to the weight \(A_{nn'}^{m}\).

  2. If we knew \(A^{m}\), we could compute,

    \[\begin{align*} Y^{m} = X^{m - 1}A^{m} \end{align*}\]

    and call it the output of stage 1. This sets the \(n^{th}\) column of \(Y^{m}\) to be a convex combination of columns of \(X^{m - 1}\) where the mixing weights come from \(A^{m}\).

    Attention mixes columns of \(X^{m - 1}\) to get \(Y^m\). Figure from the reading.

    Geometric interpretation of mixing columns of \(X^{m - 1}\).
  3. There are unfortunately two problems,

    • Where do the attention weights come from?
    • This operation can select which tokens to mix, but not what information to extract from them. This will be addressed by the values matrix \(V_{h}\) below.

    We’ll resolve both issues step-by-step. To simplify notation, we suppress the \(m\) superscripts for now – the remainder of the note focuses on a single layer \(m\).

  4. MHSA (attempt 1: position-only). A first attempt would be to define similarity according to the distance between tokens. For example,

    \[\begin{align*} A_{nn'} &= \frac{\exp{-\left|n - n'\right|}}{\sum_{n^{''} = 1}^{N} \exp{-\left|n^{''} - n^{\prime}\right|}} \end{align*}\] This can’t learn long-range dependence since weights decay with distance.

  5. MHSA (attempt 2: content-based similarity). We use the tokens themselves to define attention. Tokens with similar representations attend to one another. We can measure this similarity with inner products \(x_{n}^{\top} x_{n'}\), \[\begin{align*} A_{nn'} &= \frac{\exp{x_{n}^\top x_{n^{\prime}}}}{\sum_{n^{\prime\prime}= 1}^{N} \exp{x_{n^{\prime\prime}}^\top x_{n^\prime}}} \end{align*}\] Similarity is now content (not position) based.

    Exercise: Can this version of attention learn long-range dependence? Explain your reasoning.

    A = torch.softmax(X.T @ X, dim=0)

    A geometric representation of content-based similarity. A high inner product corresponds to a small angle.
  6. MHSA (attempt 3: subspace-specific similarities). This inner product imposes a single notion of similarity. In reality, multiple types exist. For example, “scale” and “arpeggio” might be similar in the sense of both being musical terms, while “scale” and “fish” might be similar in the sense of being a part of the animal. To this end, we project the tokens \(x_{n}\) onto a subspace spanned by the \(M < D\) rows of \(U \in \reals^{M \times D}\). The projected coordinates are given by \(Ux_{n}\), and their associated similarity is, \[\begin{align*} A_{nn'} &= \frac{\exp{\left(Ux_{n}\right)^\top Ux_{n^{\prime}}}}{\sum_{n^{\prime\prime} = 1}^{N}\exp{\left(Ux_{n^{\prime\prime}}\right)^\top Ux_{n^{\prime}}}} \end{align*}\]

  7. To illustrate, consider a representation where the first and last four dimensions reflect animal and music related meanings, respectively. Scale has high values in both. Subspace-specific \(U\)’s will be able to learn different types of similarity. In a real model, \(U\) is learned, not set manually, but the idea is the same (each projection induces its own similarity).

    test_words = ["scale", "fish", "arpeggio", "music", "random", "other"]
    X = torch.tensor([
     [0.8, 0.9, 0.1, 0.7, 0.7, 0.8, 0.1, 0.2],  # scale: both
     [0.9, 1.0, 0.0, 0.8, 0.0, 0.1, 0.0, 0.1],  # fish: animal only
     [0.0, 0.1, 0.0, 0.0, 0.9, 1.0, 0.8, 0.9],  # arpeggio: music only
     [0.1, 0.0, 0.1, 0.1, 0.8, 0.9, 0.9, 1.0],  # music: music only
     [0.1, 0.2, 0.1, 0.0, 0.1, 0.0, 0.2, 0.1],  # random: low on both
     [0.0, 0.1, 0.2, 0.1, 0.0, 0.2, 0.1, 0.0],  # other: low on both
    ]).T
    
    # two attention matrices
    A_animal = torch.softmax((U_animal @ X).T @ (U_animal @ X), dim=0)
    A_music = torch.softmax((U_music @ X).T @ (U_music @ X), dim=0)

    A geometric representation of subspace-based similarity. Different \(U_{h}\) correspond to different subspaces.
  8. MHSA (attempt 4: asymmetry). With a single \(U\), similarity is symmetric \(A_{nn'} = A_{n'n}\). Separate projections \(U_k, U_q\) break this symmetry, \[\begin{align*} A_{nn'} &= \frac{\exp{\left(U_{k}x_{n}\right)^\top U_{q}x_{n^{\prime}}}}{\sum_{n^{\prime\prime} = 1}^{N}\exp{\left(U_{k}x_{n^{\prime\prime}}\right)^\top U_{q}x_{n^{\prime}}}} \end{align*}\] In the literature, the quantities \(U_{q}x\) and \(U_{k}x\) are called queries and keys, respectively. Token \(n'\) “asks” the query \(U_q x_{n'}\) and token \(n\) presents a key \(U_k x_n\). Attention is high when the key aligns with the query. In pseudocode, this is,

    Uq = torch.randn(6, 8)
    Uk = torch.randn(6, 8)
    
    A_asym = torch.softmax((Uk @ X).T @ (Uq @ X), dim=0)

    Exercise: Notice that in atempt 3, the example heatmap visualization is not symmetric. Why not?

  9. MHSA Finally, we can consider many keys and queries. This can capture the different senses of “scale” above. Specifically, for each layer \(m\), we define \(H\) matrices, \[\begin{align*} U_{kh} \in \reals^{M \times D} \qquad U_{qh} \in \reals^{M \times D} \end{align*}\] which allows us to create \(H\) different versions of the attention matrices \(A_{h}\), \[\begin{align*} \left[A_{h}\right]_{nn'} &= \frac{\left(U_{kh}x_{n}\right)^\top\left(U_{qh}x_{n'}\right)}{\sum_{n^{\prime\prime} = 1}^{N}\left(U_{kh}x_{n^{\prime\prime}}\right)^\top\left(U_{qh}x_{n^{\prime}}\right)} \end{align*}\] Then at layer \(m\), instead of simply using \(Y^{m} = X^{m - 1}A^{m}\) like we defined above, we consider \[\begin{align*} Y^{m} = \sum_{h = 1}^{H} V_{h}^{m} X^{m - 1}A_{h}^{m} \end{align*}\] where \(V_{h}^{m} \in \reals^{D \times D}\) is a linear transformation of the attention-mixed input. Each head \(h\) can specialize. Its \(U_{kh}, U_{qh}\) find which tokens are relevant under head \(h\)’s notion of similarity, and its \(V_h\) determines what to extract from those tokens.

  10. In the literature \(V_{h}\) are called “values” – when the query matches a key according to the similarities \(A_{h}\), we return the values \(V_{h}^{m}X^{m - 1}\). \(A_h^m\) selects which tokens to mix and \(V_h^m\) selects what to extract from them. For example, in the sentence “The movie was not good,” the word “good” might attend most closely to the word “not.” The token \(x_{n'}^{m - 1}\) representing “not” might capture many aspects of this word (e.g., it is a short word, it is common, …) besides the part that’s relevant (it negates what follows). \(V\) can learn to extract the negation aspect and suppress the rest.

    Interpretation of \(V_{h}^{m}\) focusing on particular coordinates of \(X^{m - 1}\).
  11. To summarize, we have parameters \(U_{qh}^{m}, U_{kh}^{m}, V_{h}^{m}\) for each layer \(m\) which learn the similarities between pairs of tokens and linearly transforms the representations \(X^{m - 1}\) in layer \(m - 1\) into \(X^{m}\) at layer \(m\).

Stage 2: MLP across features

  1. Stage 1 learns dependence across tokens but treats each feature separately (every row of \(X^{m - 1}\) is mixed by the same weights \(A^m\)). Stage 2 applies a per-token nonlinear transformation that learns interactions across feature dimensions.

  2. Specifically, Stage 1 returns, \[\begin{align*} Y^m = X^{m-1} + \text{MHSA}(\bar{X}^{m-1}) \end{align*}\] where \(\bar{X}^{m-1} = \text{LayerNorm}(X^{m-1})\) centers/scales each token’s representation (with learned shift and scale parameters, as discussed in the last note). Stage 2 then modifies each token’s features, \[\begin{align*} x_n^m = y_n^m + \text{MLP}(\bar{y}_n^m) \end{align*}\] where we similarly normalize \(\bar{Y}^m = \text{LayerNorm}(Y^m)\) and use lower case symbols to denote the \(n^{th}\) columns of \(X^m\) and \(Y^m\). Notice only the correction needs LayerNorm.

    Exercise: What are the dimensions of \(x_{n}^{m}\)? For a single layer, how many times do we have to execute the MLP?

  3. Expressing this as pseudocode,

    # Stage 1: mix across tokens
    X_norm = layer_norm(X)
    Y = X + mhsa(X_norm)
    
    # Stage 2: mix across features
    Y_norm = layer_norm(Y)
    X_next = Y + mlp(Y_norm)
  4. Both stages use residual connections, meaning that output = input + correction. The reason for using these connections is that, at initialization, the MLP and MHSA weights are small, so both corrections are near zero and the entire block acts as essentially the identity. Training gradually learns the corrections. This is reminiscent of boosting, where each training step refines the residuals of the current fit.

  5. Putting everything together, we arrive again at Figure 7 from the reading,

    The same summary from before. We’ve now gone over each component.

    This is the map \(f_{\theta_{m}} : \reals^{D \times N} \to \reals^{D \times N}\) from the setup, wiht parameters \(\theta_m\) containing hte query, key, and value matrices from MHSA, the MLP weights, adn the scale/shift parameters for the two LayerNorm steps. Stacking \(M\) of these blocks gives a full transformer model.

    \[\begin{align*} X^0 \xrightarrow{f_{\theta_1}} X^1 \xrightarrow{f_{\theta_2}} \cdots \xrightarrow{f_{\theta_M}} X^M \end{align*}\]

    Each layer enriches token representations by sharing information across tokens (Stage 1) and features (Stage 2). Stacking layers allows the representations to be mixed again so that higher-level, indirect relationships can emerge.

    Exercise: Explain each part of the figure above. What are the variables’ dimensions? What do the arrows represent?

Code Example

  1. \(^{\dagger}\) Here’s a small example inspecting the attention weights from a pretrained model, downloaded from the HuggingFace repository of pretrained models. The tokenizer object converts free character strings into sequences of tokens.

    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    import torch
    
    tokenizer = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-SST-2")
    model = AutoModelForSequenceClassification.from_pretrained(
     "textattack/bert-base-uncased-SST-2",
     attn_implementation="eager"
    )

    The model behaves like we expect on a few made up examples. The two columns of logits correspond to whether it was a positive or negative sentiment review.

    reviews = [
     "terrible, 1 star",
     "great, changed my life"
    ]
    inputs = tokenizer(reviews, return_tensors="pt", padding=True)
    logits = model(**inputs).logits
    torch.softmax(logits, dim=1)
    tensor([[9.9849e-01, 1.5102e-03],
            [4.0360e-04, 9.9960e-01]], grad_fn=<SoftmaxBackward0>)
  2. \(^{\dagger}\) To understand the attention matrix, we can use the .attentions attribute in the trained model. The way these matrices are stored will differ from model to model. But conceptually they will always include a collection of heads that reweight the \(n\) sequence inputs to one another.

    text = "A wonderful little production. The filming technique is very unassuming, old time BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece."
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model(**inputs, output_attentions=True)
    A = outputs.attentions[10][0, 11].detach()  # layer 10, batch 0, head 11
  3. \(^{\dagger}\) Just out of interest, let’s study some of the hidden states from this model. The function below loops over the input reviews, converts each into a sequence of tokens, and extracts the final layer’s representation \(X^{M}\).

    import numpy as np
    import pandas as pd
    
    def hidden_states(reviews, tokenizer, model, batch_size=16):
     embeddings = []
     model.eval()
     with torch.no_grad():
         for i in range(0, len(reviews), batch_size):
             # get current review
             batch = list(reviews[i:i + batch_size])
             inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=512)
    
             # get hidden states
             outputs = model(**inputs, output_hidden_states=True)
             embeddings.append(outputs.hidden_states[-1][:, 0, :].numpy())
     return np.vstack(embeddings)

    We next apply that function to a random sample of 300 reviews. We’ve hidden the code, but we then run PCA to organize those reviews in a 2D plot.

    imdb = pd.read_csv("../data/imdb.csv").sample(300, random_state=2026)
    h = hidden_states(imdb["review"].tolist(), tokenizer, model)

References

Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. 2020. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.” arXiv. https://doi.org/10.48550/ARXIV.2010.11929.
Karpathy, Andrej, Justin Johnson, and Fei-Fei Li. 2015. “Visualizing and Understanding Recurrent Networks.” arXiv. https://doi.org/10.48550/ARXIV.1506.02078.