Introduction to Deep Learning
Readings: 1 (sections 12.1 - 12.7, but skip 12.5.2 - 12.5.3), 2 (optional), Code
Items marked \(^{\dagger}\) are not in the required reading and will not be tested.
Setup
Goal. Given examples \(\{\left({x_i, y_i}\right)\}_{i = 1}^{N}\), learn a predictor \(f : x \mapsto y\). In fact, the ultimate goal is more ambitious – we want a generic recipe for automatically learning predictors automatically, regardless of whether the inputs are tables, images, sentences, …
Requirements. The recipe must,
- Apply out-of-the-box to heterogeneous input/output structures (see table).
- Handle highly nonlinear relationships.
- Scale to large \(N\).
| Task | \(x_i\) | \(y_i\) | Example | |
|---|---|---|---|---|
| Image Classification | image | label | ||
| Object Detection | image | label + bounding box | ||
| Language Modeling | sequence of words | next word | ||
| Sentence Translation | sequence of words | sequence of words | ||
| Style Transfer | image | new image |
Approach. We represent data as \(D\)-dimensional tensors. Define parameterized modules \(f_l(\cdot; \theta_l)\) and compose them into architectures \(f = f_{L} \circ \dots \circ f_{1}\). Intermediate representations \(h_l\) bridge input \(x\) with output \(y\). Fit \(\theta_l\) by minimizing a loss function.
Tensor Data Structures
We can often represent data as \(D\)-dimensional tensors (multidimensional arrays). Vectors and matrices are the \(D = 1\) and \(D = 2\) cases.
This works across many problem contexts. E.g., class labels \(y_i \in \{1, \dots, K\}\) become one-hot encoded vectors, so tensors capture standard statistics problems. RGB images are 3D tensors (Height \(\times\) Width \(\times\) Color Channel), and video adds a Time axis, so their tensors are 4D. A document is a Vocabulary \(\times\) Document Length tensor where each word in the document is one-hot encoded. Though, document lengths vary, so a dataset contains tensors of different sizes.
We can stack a “batch” of \(n\) tensors into a single tensor. For example, 10 RGB images of size \(16 \times 16 \times 3\) can be written as a \(10 \times 16 \times 16 \times 3\) tensor. This is helpful because GPUs perform tensor arithmetic in parallel, so batching lets us optimize over many examples at once.
Perceptrons
We want a general recipe for mapping tensors to predictions. This recipe will be made from composable modules. We start with the simplest case: mapping a vector \(x \in \reals^D\) to a binary label \(y \in \{0, 1\}\) using the perceptron (Rosenblatt (1958)).
Layers of a simple perceptron. Definition. A perceptron combines two transformations: \[\begin{align*} x \xrightarrow{f} z \xrightarrow{g} y \end{align*}\] where \[\begin{align} f\left(x\right) &= w^\top x + b \space (\text{linear layer})\\ g\left(z\right) &= \indic{z > 0} \space (\text{activation}). \end{align}\] In other words, take a linear combination of inputs and threshold. The result is a binary classifier. Historically, this was viewed as a model of a neuron that “fires” when its inputs reach a threshold.
\(D = 2\). The vector \(w\) defines a direction in \(\reals^{2}\). The set \(\{x: w^\top x = -b\}\) is the decision boundary between classes. Changing \(w\) rotates the boundary and changing \(b\) translates it. few choices of \(w\) and \(b\) are shown below.
Interpreting weights and biases in a perceptron. Exercise. Draw the decision boundary when \(w = \begin{bmatrix} 1 \\ 1\end{bmatrix}\) and \(b = -1\). Repeat for \(w = \begin{bmatrix} -1 \\ 1 \end{bmatrix}\) and \(b = 0\).
Exercise. TRUE FALSE The decision boundary is always perpendicular to \(w\).
\(D > 2\). This logic holds in higher dimensions. The perceptron assigns \(y = 1\) whenever \(x\) has a large enough projection onto \(w\), i.e., \(w^\top x > -b\). The decision boundary is now a \(D - 1\)-dimensional hyperplane.
Given \(\{x_i, y_i\}_{i = 1}^{N}\), we learn \(w\) and \(b\) by solving, \[\begin{align*} \hat{w}, \hat{b} &= \arg \min_{w \in \reals^{D}, b \in \reals} \frac{1}{N}\sum_{i = 1}^{N}L\left(w^\top x_i + b, y_i\right). \end{align*}\] Rosenblatt (1958) used \(L\left(\hat{y}, y\right) := \indic{\hat{y} \neq y}\), a nonconvex 0-1 loss which is hard to optimize. Modern approaches replace this loss with smooth surrogates, like cross-entropy, which are amenable to gradient-based optimization.
We adjust \(w\) and \(b\) to minimize the training loss.
Multilayer Networks
The perceptron can only draw linear boundaries. To handle nonlinear classification, we compose multiple layers, giving the multilayer perceptron (MLP).
Two Layers (\(L = 2\)). The simplest MLP alternates linear maps with nonlinear activations, \[\begin{align*} x \xrightarrow{f_1} z^1 \xrightarrow{g_1} h^1 \xrightarrow{f_2} z^2 \xrightarrow{g_2} \hat{y} \end{align*}\] where each linear layer is, \[\begin{align} f_{l}\left(v\right) &= W_{l}v + b_{l} \space & \quad\quad \triangleleft \quad \text{layer } \ell \text{ weights and biases}. \end{align}\] Here, \(W_{l} \in \reals^{D_{l + 1} \times D_{l}}\), \(b_{l} \in \reals^{D_{l + 1}}\) and \(g\) is a nonlinear activation function applied elementwise, like \(\left[g\left(x\right)\right]_{d} =\max(x_{d},0)\). Note that the argument for \(f\) is a generic \(v \in \reals^{D_{l}}\). At the first layer, this is \(x\), but at deeper layers, it is the current hidden representation \(h^{l - 1}\).
Representation of transformations in a two layer MLP. Geometric interpretation (\(L = 2)\). Each row of \(W_1\) defines a linear classifier in the input space \(x\). The second layer \(W_2\) mixes these classifiers. This allows for piecewise linear decision boundaries, which are more flexible than the ordinary perceptron. Increasing \(D_2\) (the number of rows of \(W_1\)) adds more classifiers to the mixture, allowing more complex boundaries.
A two-dimensional example of the geometric interpretation.