Introduction to Deep Learning

Author

Kris Sankaran

Published

March 16, 2026

Readings: 1 (sections 12.1 - 12.7, but skip 12.5.2 - 12.5.3), 2 (optional), Code

Items marked \(^{\dagger}\) are not in the required reading and will not be tested.

Setup

Goal. Given examples \(\{\left({x_i, y_i}\right)\}_{i = 1}^{N}\), learn a predictor \(f : x \mapsto y\). In fact, the ultimate goal is more ambitious – we want a generic recipe for automatically learning predictors automatically, regardless of whether the inputs are tables, images, sentences, …

Requirements. The recipe must,

Apply out-of-the-box to heterogeneous input/output structures (see table).
Handle highly nonlinear relationships.
Scale to large \(N\).

Task	\(x_i\)	\(y_i\)
Image Classification	image	label
Object Detection	image	label + bounding box
Language Modeling	sequence of words	next word
Sentence Translation	sequence of words	sequence of words
Style Transfer	image	new image

Approach. We represent data as \(D\)-dimensional tensors. Define parameterized modules \(f_l(\cdot; \theta_l)\) and compose them into architectures \(f = f_{L} \circ \dots \circ f_{1}\). Intermediate representations \(h_l\) bridge input \(x\) with output \(y\). Fit \(\theta_l\) by minimizing a loss function.

Tensor Data Structures

We can often represent data as \(D\)-dimensional tensors (multidimensional arrays). Vectors and matrices are the \(D = 1\) and \(D = 2\) cases.
This works across many problem contexts. E.g., class labels \(y_i \in \{1, \dots, K\}\) become one-hot encoded vectors, so tensors capture standard statistics problems. RGB images are 3D tensors (Height \(\times\) Width \(\times\) Color Channel), and video adds a Time axis, so their tensors are 4D. A document is a Vocabulary \(\times\) Document Length tensor where each word in the document is one-hot encoded. Though, document lengths vary, so a dataset contains tensors of different sizes.
We can stack a “batch” of \(n\) tensors into a single tensor. For example, 10 RGB images of size \(16 \times 16 \times 3\) can be written as a \(10 \times 16 \times 16 \times 3\) tensor. This is helpful because GPUs perform tensor arithmetic in parallel, so batching lets us optimize over many examples at once.

Perceptrons

We want a general recipe for mapping tensors to predictions. This recipe will be made from composable modules. We start with the simplest case: mapping a vector \(x \in \reals^D\) to a binary label \(y \in \{0, 1\}\) using the perceptron (Rosenblatt (1958)).

Layers of a simple perceptron.
Definition. A perceptron combines two transformations: \[\begin{align*} x \xrightarrow{f} z \xrightarrow{g} y \end{align*}\] where \[\begin{align} f\left(x\right) &= w^\top x + b \space (\text{linear layer})\\ g\left(z\right) &= \indic{z > 0} \space (\text{activation}). \end{align}\] In other words, take a linear combination of inputs and threshold. The result is a binary classifier. Historically, this was viewed as a model of a neuron that “fires” when its inputs reach a threshold.
\(D = 2\). The vector \(w\) defines a direction in \(\reals^{2}\). The set \(\{x: w^\top x = -b\}\) is the decision boundary between classes. Changing \(w\) rotates the boundary and changing \(b\) translates it. few choices of \(w\) and \(b\) are shown below.

Interpreting weights and biases in a perceptron.

Exercise. Draw the decision boundary when \(w = \begin{bmatrix} 1 \\ 1\end{bmatrix}\) and \(b = -1\). Repeat for \(w = \begin{bmatrix} -1 \\ 1 \end{bmatrix}\) and \(b = 0\).

Exercise. TRUE FALSE The decision boundary is always perpendicular to \(w\).
\(D > 2\). This logic holds in higher dimensions. The perceptron assigns \(y = 1\) whenever \(x\) has a large enough projection onto \(w\), i.e., \(w^\top x > -b\). The decision boundary is now a \(D - 1\)-dimensional hyperplane.
Given \(\{x_i, y_i\}_{i = 1}^{N}\), we learn \(w\) and \(b\) by solving, \[\begin{align*} \hat{w}, \hat{b} &= \arg \min_{w \in \reals^{D}, b \in \reals} \frac{1}{N}\sum_{i = 1}^{N}L\left(w^\top x_i + b, y_i\right). \end{align*}\] Rosenblatt (1958) used \(L\left(\hat{y}, y\right) := \indic{\hat{y} \neq y}\), a nonconvex 0-1 loss which is hard to optimize. Modern approaches replace this loss with smooth surrogates, like cross-entropy, which are amenable to gradient-based optimization.

We adjust \(w\) and \(b\) to minimize the training loss.

Multilayer Networks

The perceptron can only draw linear boundaries. To handle nonlinear classification, we compose multiple layers, giving the multilayer perceptron (MLP).
Two Layers (\(L = 2\)). The simplest MLP alternates linear maps with nonlinear activations, \[\begin{align*} x \xrightarrow{f_1} z^1 \xrightarrow{g_1} h^1 \xrightarrow{f_2} z^2 \xrightarrow{g_2} \hat{y} \end{align*}\] where each linear layer is, \[\begin{align} f_{l}\left(v\right) &= W_{l}v + b_{l} \space & \quad\quad \triangleleft \quad \text{layer } \ell \text{ weights and biases}. \end{align}\] Here, \(W_{l} \in \reals^{D_{l + 1} \times D_{l}}\), \(b_{l} \in \reals^{D_{l + 1}}\) and \(g\) is a nonlinear activation function applied elementwise, like \(\left[g\left(x\right)\right]_{d} =\max(x_{d},0)\). Note that the argument for \(f\) is a generic \(v \in \reals^{D_{l}}\). At the first layer, this is \(x\), but at deeper layers, it is the current hidden representation \(h^{l - 1}\).

Representation of transformations in a two layer MLP.
Geometric interpretation (\(L = 2)\). Each row of \(W_1\) defines a linear classifier in the input space \(x\). The second layer \(W_2\) mixes these classifiers. This allows for piecewise linear decision boundaries, which are more flexible than the ordinary perceptron. Increasing \(D_2\) (the number of rows of \(W_1\)) adds more classifiers to the mixture, allowing more complex boundaries.

A two-dimensional example of the geometric interpretation.

A one-dimensional example of the geometric interpretation, to show how multilayer networks could apply to regression rather than classification.

Exercise: For each \(i, j\), what is the sign of \(\left[W_{1}\right]_{ij}\).

Exercise: TRUE FALSE The dataset visualized below can be perfectly classified by a two layer MLP.

Is two layers enough?
Deeper models. The \(L\)-layer MLP repeats the pattern, \[\begin{align*} x \xrightarrow{f_1} z^1 \xrightarrow{g_1} h^1 \xrightarrow{f_2} \dots \xrightarrow{g_{L}} h^{L} \xrightarrow{g_{\text{out}}} \hat{y} \end{align*}\] so that we can continue mixing increasingly complex classifiers.

Parameters and Activations

There are two types of quantities involved with a deep learning model, and they each play a different role in learning and prediction.
- Parameters \(\theta = \{W_l, b_l\}_{l = 1}^{L}\) are learned from the full data \(\{x_i, y_i\}\). They are shared across all inputs.
- Activations are intermediate representations of a single input \(x\) as it travels through the network. We distinguish preactivations \(z^l = f^l\left(h^{l - 1}\right)\) from postactivations \(h^l = g(z^l)\) depending on whether the nonlinearity \(g\) has been applied.
Training adjusts \(\theta\) so that the final representations \(h^{L}\) are predictive of \(y\).

Distinction between parameters and activations (Credit to Torralba et al.).
If \(g\) is differentiable (almost everywhere), the entire network \(f = g_{\text{out}} \circ f_{L} \circ \dots \circ g_{1} \circ f_{1}\) is a composition of nearly smooth smooth functions. The result remains nearly smooth, meaning we can compute \(\nabla_{\theta} L\) and minimize the loss using stochastic gradient descent.

Types of Layers

We originally set out to build composable tensor processing modules. The MLP illustrates the principle. We next catalog modules used in practice.
Linear layers. The map \(x_{\text{out}} = f\left(x_{\text{in}}; \theta\right) := W x_{\text{in}} + b\), which has parameters \(\theta = \{W, b\}\). Each row of \(W\) acts as a single neuron, activating on a subset of the input space.

A linear layer translates and skews the input distribution according to \(W\) and \(b\).
Activation layers. Activations are nonlinearities that ensure that deep network can represent many types of functions. The composition of linear layers is still linear, so without these activations, even a deep composition would collapse to a single linear map. Common choices are given below, note they are applied elementwise,
\[\begin{aligned} x_{\text{out}}[d] &= \begin{cases} 1, &\text{if} \quad x_{\text{in}}[d] > 0\\ 0, & \text{otherwise} \end{cases} & \quad\quad \triangleleft \quad \text{threshold}\\ x_{\text{out}}[d] &= \frac{1}{1 + e^{-x_{\text{in}}[d]}} & \quad\quad \triangleleft \quad \text{sigmoid}\\ x_{\text{out}}[d] &= 2\times \text{sigmoid}(2x_{\text{in}}[d])-1 & \quad\quad \triangleleft \quad \text{tanh}\\ x_{\text{out}}[d] &= \max(x_{\text{in}}[d],0) & \quad\quad \triangleleft \quad \text{ReLu}\\ \end{aligned}\]

Common nonlinearities, as given in reading.

Exercise. The visualization below applies one of these activations to 2D inputs \(x_{\text{in}}\). Which one?

What nonlinearity is this?
Normalization Layers. These do not appear in classical MLPs but are common in modern architectures. Their role is to stabilize training dynamics and improve the loss landscape for stochastic gradient descent.
- Batch Normalization. Normalize each coordinate \(d\) across a batch of \(B\) inputs, \[\mu_d = \frac{1}{B} \sum_{b=1}^{B} x_{\text{in}}^{(b)}[d], \qquad \sigma_d^2 = \frac{1}{B} \sum_{b=1}^{B} \left(x_{\text{in}}^{(b)}[d] - \mu_d\right)^2\] \[x_{\text{out}}^{(b)}[d] = \gamma \frac{x_{\text{in}}^{(b)}[d] - \mu_d}{\sigma_d} + \beta\] The tensor dimension \(D_{\text{in}}\) remain unchanged after normalization. The learnable parameters \(\gamma \in \reals, \beta \in D_{\text{in}}\) let each neuron have its own scale and shift, which controls the magnitude and frequency of the later activations. Note that \(\mu_d\) and \(\sigma_d\) couple samples within a batch, creating a subtle dependence.
- \(\ell^2\) normalization. Project each input onto the \(\ell^2\) unit sphere, \[\begin{aligned} x_{\text{out}}[d] = \frac{x_{\text{in}}[d]}{\sqrt{\sum_{d' = 1}^{D} x^{2}_{\text{in}[d']}}} \end{aligned}\] This has no learnable parameters and no batch dependence.
The effect of \(\ell^2\) normalization on the input tensor’s data distribution.
- Layer Norm. Like \(\ell^2\) normalization, but both centers and scales the \(D\) coordinates of a single input, \[\begin{aligned} \mu &= \frac{1}{D} \sum_{d=1}^D x_{\text{in}}[d]\\ \sigma^2 &= \frac{1}{D} \sum_{d=1}^D \left(x_{\text{in}}[d] - \mu\right)^2\\ x_{\text{out}}[d] &= \gamma_{d} \frac{x_{\text{in}}[d] - \mu}{\sigma} + \beta_{d} \end{aligned}\] This is like \(\ell^2\) normalization, processing each sample independently without coupling across batches. Like batch normalization, the learnable \(\gamma_d\) and \(\beta_d\) preserve flexibility.
Batch vs. Layer Norm (credit to Torralba et al.).
Output Layers. These convert internal representations \(h^l\) into a form matched to the prediction target \(y\).
- Softmax Layer. Maps an activation to a probability distribution over \(K\) classes,
  \[\begin{aligned} x_{\text{out}}[d] &= \frac{e^{\tau x_{\text{in}}[d]}} {\sum_{k=1}^K e^{\tau x_{\text{in}}[k]}} \end{aligned}\]
  The output can support either prediction (argmax \(\to\) class label) or sampling (draw from the distribution of plausible outputs). The temperature \(\tau > 0\) controls the concentration of the distribution: large \(\tau\) sharpen the distribution towards the most probable class, small \(\tau\) flatten it towards the uniform, resulting in more diverse samples.
- Image rescaling. For image generation, map activations to the pixel range \(\left[0, 255\right]\)
  \[\begin{aligned} x_{\text{out}}[d] &= 255*\text{sigmoid}(x_{\text{in}}[d]) \end{aligned}\]
Many other layers exit, like convolutional, gated recurrent unit, and transformer layers. They can be used to encode specific inductive biases (e.g., convolutions pool local spatial features). The main idea is that any layer that takes a tensor in, produces a tensor out, and is differentiable with respect to learnable parameters can be composed into a deep learning architecture. Layers are interchangeable parts, like lego blocks.

Learning Representations

The reading lists five reasons for deep learning’s success. Summarizing,
1. High capacity – large networks can learn flexible functions.
2. Differentiable – parameters can be optimized out-of-the-box with stochastic gradient descent.
3. Good inductive biases – architectures reflect real-world structure.
4. Hardware friendly – tensor operations parallelize on GPUs.
5. Learned abstractions – Networks automatically learn useful intermediate representations.
Point (a) also makes deep networks difficult to interpret. We can’t understand what a network will do with a new input \(x\) by inspecting \(\theta\) directly. We have to run the computation.
\(^\dagger\) Property (e) deserves closer consideration. Each layer transforms an input, and by the final layer, the representation \(h^{L - 1}\) should make predicting \(y\) straightforward. In the toy example below, what requires a nonlinear boundary in the input space \(x\) becomes linearly separable by \(h^{L - 1}\).

Learned representations transform data distributions so that classes are easily separated.
\(^\dagger\) More amazingly, deeper layers learn progressively more abstract patterns. E.g., in image classification, they learn pixels \(\to\) edges \(\to\) parts \(\to\) objects. Traditionally, abstract features were hand-engineered (e.g., SIFT descriptors in computer vision Lowe (1999)). That networks discover them from data alone is one of the central phenomena associated with deep learning, and we will spend several weeks studying exactly what “learned abstraction” really means.

Deeper layers learn increasingly abstract representations of the input data. Figure from Bengio (2009).

Code Example

We simulated a simple 2D dataset below. Except for 5% of cases, samples are assigned to their classes depending on whether they lie in a circle of radius one centered at the origin.
```
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim

N = 100
x = np.random.standard_t(12, size=(N, 2))
y = (np.sum(x**2, axis=1) < 1).astype(int)

# 5% noise
flip_ix = np.random.choice(N, N // 20, replace=False)
y[flip_ix] = 1 - y[flip_ix]
```
The torch package defines many layers which we can immediately combine into a model using nn.Sequential. In the code below, \(W_{1} \in \reals^{20 \times 2}, W_{2}, W_3 \in \reals^{20 \times 20}\), and \(W_{4} \in \reals^{2 \times 20}\). We use ReLU nonlinearities \(g\) and use a softmax layer to allow classification into the two classes.
```
model = nn.Sequential(
 nn.Linear(2, 20), nn.ReLU(),
 nn.Linear(20, 20), nn.ReLU(),
 nn.Linear(20, 20), nn.ReLU(),
 nn.Linear(20, 2), nn.Softmax(dim=1)
)
```
Even though our data are already stored as matrices and vectors, torch needs them stored as a tensor data type. This is what allows training to use the GPU when it is available.
```
X_train = torch.tensor(x, dtype=torch.float32)
y_train = torch.tensor(y, dtype=torch.long)
```

The code below runs a form of stochastic gradient descent (optim.Adam) to learn the weights \(W_{l}\) and biases \(b_{l}\) that minimize a smooth version of the misclassification loss (CrossEntropyLoss). The optimization is run for 100 steps.

optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

for epoch in range(100):
 optimizer.zero_grad() # forget previous gradient
 loss = criterion(model(X_train), y_train) # compute loss at current parameters
 loss.backward() # evaluate the gradient
 optimizer.step() # take a gradient step

I’ve hidden some tedious code to create a fine grid along the data domain, stored in the tensor X_grid, but the block below runs the fitted model on each of those test points by calling the model object. The output has two columns, corresponding to predicted probabilities for the two classes; we only visualize one class. Note that we converted the output to a numpy array to make plotting more straightforward. We can see that the output layer separates the classes well, though note the influence of a few noisy observations.
```
with torch.no_grad():
 grid_df['y_hat'] = model(X_grid)[:, 1].numpy()
```
We can access activations before the output layer by indexing into the model object. The code below gets the post-layer-1 activations \(h^{1}\) from the model above. We can see that it learns a linear boundary along one of the sides of the circle separating the classes.
```
submodel = model[:2] # layer 1 postactivation
with torch.no_grad():
 grid_df['h'] = submodel(X_grid).numpy()[:, 2] # neuron 3
```
Exercise: Modify the code above to draw the activations for (i) a different neuron in the same layer, and (ii) a neuron in a different layer. Comment on the differences.

References

Bengio, Y. 2009. “Learning Deep Architectures for AI.” Foundations and Trends in Machine Learning 2 (1): 1–127. https://doi.org/10.1561/2200000006.

Helber, Patrick, Benjamin Bischke, Andreas Dengel, and Damian Borth. 2017. “EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification.” arXiv. https://doi.org/10.48550/ARXIV.1709.00029.

Lowe, D. G. 1999. “Object Recognition from Local Scale-Invariant Features.” In Proceedings of the Seventh IEEE International Conference on Computer Vision, 1150–1157 vol.2. IEEE. https://doi.org/10.1109/iccv.1999.790410.

Rosenblatt, F. 1958. “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” Psychological Review 65 (6): 386–408. https://doi.org/10.1037/h0042519.