Teaching attention, part 1 / N
Why is it hard to make sense of attention in the deep learning literature?
I’ve been preparing some notes for a tutorial on memory and attention, and while the material for memory seems to be coming together relatively smoothly, designing a coherent discussion around attention has been harder.
The main difficulty seems to be that the same word (“attention”) is used to refer to many distinct things in practice. For example, there are differences between
- Implicit vs. explicit attention
- Location-based vs. content-based attention
- Hard vs. soft attention
- External vs. self attention
It’s not worth going into the distinctions now – the point is that, you have to be able to hold many different definitions in your head, in order to hold a conversation on attention. In the same spirit, if you talk to someone who says that their model “uses attention,” you have to be a bit of a gadfly to figure out what’s actually going on.
Learning many definitions in and of itself wouldn’t be such a big problem, though it does take a concerted effort. But I suspect that the issue runs a bit deeper, and that attention is treated more like a problem solving technique, rather than any concrete computational device. The analogy that comes to mind is the standard machine in analysis: you use it to prove all sorts of theorems, and people become comfortable using it in conversations, but no one ever really bothers to define it.
This can make it complicated to talk attention across problem domains. While it speaks the general utility of attention as a problem solving technique, it can be a little discomforting to have to keep track of many small variants on something that you thought you understood. Here’s a small sampling of problem areas where I’ve seen attention come in handy,
- Text translation
- Image classification in presence of clutter
- Caption generation
- Speech recognition
- Image generation
- Language modeling
- Algorithm learning (from examples)
- Handwriting generation
- Graph modeling
In each domain, you can usually find any number papers claiming to have solved the problem using attention. A jaded researcher (cough cough) would think that people have discovered that a recipe for writing papers is to pick a random combination of types of attention and apply it to a problem area (“hard content-based self-attention for handwriting generation”), but let’s be positive and say that human creativity knows no limits.
Within and across these domains, people propose different attention mechanisms. For example, both DRAW and Self-Attention GANs both have a way of sequentially attending to parts of an image during generation, but that’s about the only thing in common. In contrast, it’s not hard to find attention mechanisms for language translation that differ by things as seemingly inconsequential as concatenating or linearly mixing inputs (turns out choices like this do have an impact on training and performance).
In some domains, certain heuristics are useful, but they are far from universal. For example, the “translation” analogy is useful in modalities like image captioning and speech recognition, not just language. But it doesn’t apply at all for, say, self-attention. Alternatively, consider the cognitive science parallel to foveal attention. This is a nice way to understand models that sequentially process parts of an image, but what to make of attention in settings that have no clear cognitive science parallel (prediction on graphs using graph attention?). Another common analogy is to differentiable computers – attention modules can be thought of as reading from memory on an as-needed basis. But even this analogy (which I prefer) can lead to to some confusion, as architectures inspired by this analogy often explicitly include writing mechanisms to accompany attention, and these can mechanisms can exist independently.
One small quirk, just to add to the confusion, is that attention has a meandering history. Location-based and hard attention appeared first, and while they are no longer dominant, there are still reasons to prefer them in certain circumstances, which make them very relevant in current research. Papers involving attention also have a bad habit of making a splash – I’m sure you can name a few attention papers by name – but as a consequence, attention might seem like a hat trick that you can pull to solve a few specialized problems (e.g., translation and caption generation), rather than problem solving principle that can be applied in more general settings.
But Not All is Lost
Just writing about how key ideas are covered in the overgrowth of novel research doesn’t actually help anyone. It also doesn’t do justice to people who have created excellent explanations of attention – I’m thinking of Alex Graves’ lectures and several posts on the distill blog.
What are strategies that seem useful for learning about attention?
A starting point seems to be understanding a few examples in depth. I mean, being able to describe the problem and architectural context, and then saying what new computational / mathematical operations are introduced which warrant the name “attention.” Visual, geometric, and probabilistic interpretations – even those in toy, sandbox examples – can also build fluency. And being able to implement them in a deep learning library of course doesn’t hurt.
Second, it seems worth building a catalog of examples, appearing across the literature. This gives reference points when encountering new ideas, abstract things become concrete when they can be compared to something you are confident you understand. In my mind, if I could take two random papers talking about attention, and if I could relate them to one another (what is similar? what is different?), only then would I say that I deeply understand attention.
But this is probably not the right way to think of it. It’s probably more fruitful to think of shades of understanding in this situation – in some of its variants, or in some of its contexts, attention might be a familiar friend, in others, it might require a bit more digging to see what is happening and why anyone would have wanted it to have happened in that way. This isn’t a bad thing, it just means you might discover yourself learning afresh something that you thought you had mastered long ago.
Anyways, isn’t it impressive how much I’ve written without actually defining even one example of attention? But this is just part one of N…
Kris at 18:40