We are making predictions all the time, often without realizing it. For example, imagine we are waiting at a bus stop and want to guess how long it will be before a bus arrives. We can combine many sources of evidence,
To think about the process formally, we could imagine a vector \(\mathbf{x}_i \in \mathbb{R}^{D}\) reflecting \(D\) characteristics of our environment. If we collected data about how long we actually had to wait, call it \(y_i\), for every day in a year, then we would have a dataset \[\begin{align*} \left(\mathbf{x}_1, y_1\right) \\ \left(\mathbf{x}_2, y_2\right) \\ \vdots \\ \left(\mathbf{x}_{365}, y_{365}\right) \\ \end{align*}\] and we could try to summarize the relationship \(\mathbf{x}_i \to y_i\). Methods for making this process automatic, based simply on a training dataset, are called supervised learning methods.
In the above example, the inputs were a mix of counts (number of people at stop?) and categorical (weather) data types, and our response was a nonnegative continuous value. In general, we could have arbitrary data types for either input or response variable. A few types of outputs are so common that they come with their own names,
For example,
There are in fact many other types of responses (ordinal, multiresponse, survival, functional, image-to-image, …) each which come with their own names and set of methods, but for our purposes, it’s enough to focus on regression and classification.
There is a nice geometric way of thinking about supervised learning. For regression, think of the inputs on the \(x\)-axis and the response on the \(y\)-axis. Regression then becomes the problem of estimating a one-dimensional curve from data.
In higher-dimensions, this becomes a surface.
If some of the inputs are categorical (e.g., poor vs. good weather), then the regression function is no longer a continuous curve, but we can still identify group means.
In higher-dimensions, the view is analogous. We just want to find boundaries between regions with clearly distinct colors. For example, for disease recurrence, blood pressure and resting heart rate might be enough to make a good guess about whether a patient will have recurrence or not.
When we have many input features, the equivalent formula is \[\begin{align*} f_{b}\left(x\right) = b_0 + b_1 x_1 + \dots + b_{D}x_{D} := b^{T}x, \end{align*}\] where I’ve used the dot-product from linear algebra to simplify notation (after having appended a 1). This kind of model is called a linear regression model.
\[\begin{align*} L\left(b\right) = \sum_{i = 1}^{N} \left(y_i - b^{T}x_{i}\right)^2. \end{align*}\]
To describe this, we need to define a direction \(b\) perpendicular to the boundary. We will say that whenever \[\begin{align*} f_{b}\left(x\right) = \frac{1}{1 + \text{exp}\left(b^T x\right)} \end{align*}\] is larger than 0.5, we’re in the red region, and whenever it’s smaller than 0.5, we’re in the purple region. This kind of model is called a logistic regression model.
\[\begin{align*} -\left[\sum_{i = 1}^{N} y_i \log\left(f_{b}\left(x_i\right)\right) + \left(1 - y_i\right) \log\left(1 - f_{b}\left(x_i\right)\right)\right] \end{align*}\]
To understand this loss, note that each term decomposes into either the blue or red curve, depending on whether the \(y_i\) is 1 or 0.
If 1 is predicted with probability 1, then there is no loss (and conversely for 0). The loss increases the further the predicted probability is from the true class.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets, linear_model
= datasets.load_diabetes(return_X_y=True)
X, y 4, :5] # first five predictors X[:
array([[ 0.03807591, 0.05068012, 0.06169621, 0.02187235, -0.0442235 ],
[-0.00188202, -0.04464164, -0.05147406, -0.02632783, -0.00844872],
[ 0.08529891, 0.05068012, 0.04445121, -0.00567061, -0.04559945],
[-0.08906294, -0.04464164, -0.01159501, -0.03665645, 0.01219057]])
4] # example response y[:
array([151., 75., 141., 206.])
Let’s now fit a linear model from \(\mathbf{x}_1, \dots, \mathbf{x}_{N}\) to \(y\). The first line tells python that we are using a LinearRegression
model class. The second searches over coefficients \(b\) to minimize the squared-error loss between the \(b^T x_i\) and \(y_i\). The third line prints out the fitted coefficient \(\hat{b}\).
= linear_model.LinearRegression()
model model.fit(X, y)
LinearRegression()
# fitted b coefficients model.coef_
array([ -10.01219782, -239.81908937, 519.83978679, 324.39042769,
-792.18416163, 476.74583782, 101.04457032, 177.06417623,
751.27932109, 67.62538639])
palmerspenguins
package).We’ll read the data from a public link and print the first few rows.
import pandas as pd
= pd.read_csv("https://uwmadison.box.com/shared/static/mnrdkzsb5tbhz2kpqahq1r2u3cpy1gg0.csv")
penguins penguins.head()
species island bill_length_mm ... body_mass_g sex year
0 Adelie Torgersen 39.1 ... 3750.0 male 2007
1 Adelie Torgersen 39.5 ... 3800.0 female 2007
2 Adelie Torgersen 40.3 ... 3250.0 female 2007
3 Adelie Torgersen NaN ... NaN NaN 2007
4 Adelie Torgersen 36.7 ... 3450.0 female 2007
[5 rows x 8 columns]
We’ll predict species
using just bill
length and depth. First, let’s make a plot to see how easy / difficult it will be to create a decision boundary.
ggplot(py$penguins) +
geom_point(aes(bill_length_mm, bill_depth_mm, col = species)) +
scale_color_manual(values = c("#3DD9BC", "#6DA671", "#F285D5")) +
labs(x = "Bill Length", y = "Bill Depth")
= linear_model.LogisticRegression()
model = penguins.dropna()
penguins = penguins[["bill_length_mm", "bill_depth_mm"]], penguins["species"]
X, y model.fit(X, y)
LogisticRegression()
"y_hat"] = model.predict(X) penguins[
The plot below compares the predicted class (left, middle, and right panels) with the true class (color). We get most of the samples correct, but have a few missclassifications near the boundaries.
ggplot(py$penguins) +
geom_point(aes(bill_length_mm, bill_depth_mm, col = species)) +
scale_color_manual(values = c("#3DD9BC", "#6DA671", "#F285D5")) +
labs(x = "Bill Length", y = "Bill Depth") +
facet_wrap(~ y_hat)
Exercise: Repeat this classification, but using at least two additional predictors.
sklearn
, we can use the ElasticNet
class. We’ll work with The Office dataset (as in, the TV series, The Office). The task is to predict the IMDB ratings for each episode based on how often certain actors / writers appeared in it and also who the episode writers were.= pd.read_csv("https://uwmadison.box.com/shared/static/ab0opn7t8cagr1gj87ty94nomu5s347o.csv")
office office.head()
season episode andy ... jeffrey_blitz justin_spitzer imdb_rating
0 1 1 0 ... 0 0 7.6
1 1 2 0 ... 0 0 8.3
2 1 3 0 ... 0 0 7.9
3 1 5 0 ... 0 0 8.4
4 1 6 0 ... 0 0 7.8
[5 rows x 31 columns]
= office.iloc[:, 2:-1], office["imdb_rating"]
X, y = (y - y.mean()) / y.std() # standardize y
The block below fits the Elastic Net model and saves the coefficients \(\hat{b}\). Notice that most of them are 0 – only a few of the actors make a big difference in the rating.
= linear_model.ElasticNet(alpha=1e-1, l1_ratio=0.5) # in real life, have to tune these parameters
model model.fit(X, y)
ElasticNet(alpha=0.1)
= model.predict(X)
y_hat
= model.coef_ # notice the sparsity
beta_hat beta_hat
array([ 0.00116889, 0.00618263, 0.00820025, 0.0037306 , 0.00846656,
-0.02521656, 0.00947178, 0.00793831, 0.00140553, 0.00312614,
0.01993542, 0.00238557, 0.00883436, -0.00194228, 0.01054646,
0. , 0.18606752, 0. , 0. , -0. ,
0. , 0. , 0. , -0. , -0. ,
-0. , 0. , 0. ])
We can confirm that the predictions are correlated relatively well with the truth.
<- data.frame(py$X, y = py$y, y_hat = py$y_hat)
office ggplot(office) +
geom_point(aes(y, y_hat)) +
labs(x = "True IMDB Rating", y = "Predicted IMDB Rating")
Notice that we can use the same logic to do either regression or classification. For regression, we associate each “leaf” at the bottom of the tree with a continuous prediction. For classification, we associate leaves with probabilities for different classes. It turns out that we can train these models using squared error and cross-entropy losses as before, though the details are beyond the scope of these notes.
If we split the same variable deeper, it creates more steps,
What if we had two variables? Depending on the order, of the splits, we create different axis-aligned partitions,
Q: What would be the diagram if I had switched the order of the splits (traffic before rain)?
A very common variation on tree-based models computes a large ensemble of trees and then combines their curves in some way. How exactly they are combined is beyond the scope of these notes, but this is what random forests and gradient boosted decision trees are doing in the background.
We can implement these models in sklearn
using RandomForestRegressor
/ RandomForestClassifier
, and GradientBoostingRegressor
/ GradientBoostingClassifier
. Let’s just see an example of a boosting classifier using the penguins dataset. The fitting / prediction code is very similar to
from sklearn.ensemble import GradientBoostingClassifier
= GradientBoostingClassifier()
model = penguins[["bill_length_mm", "bill_depth_mm"]], penguins["species"]
X, y model.fit(X, y)
GradientBoostingClassifier()
"y_hat"] = model.predict(X) penguins[
We use the same visualization code to check predictions against the truth. The boosting classifier makes no mistakes on the training data.
ggplot(py$penguins) +
geom_point(aes(bill_length_mm, bill_depth_mm, col = species)) +
scale_color_manual(values = c("#3DD9BC", "#6DA671", "#F285D5")) +
labs(x = "Bill Length", y = "Bill Depth") +
facet_wrap(~ y_hat)
Strengths | Weaknesses | |
---|---|---|
Linear / Logistic Regression | * Often easy to interpret * No tuning parameters * Very fast to train |
* Unstable when many features to pick from * Can only fit linear curves / boundaries (though, see featurization notes) |
Sparse Linear / Logistic Regression | * Often easy to interpret * Stable even when many features to pick from * Very fast to train |
* Can only fit linear curves / boundaries |
Tree-based Classification / Regression | * Can fit nonlinear functions of inputs | * Can be slow to train * Somewhat harder to interpret |
The answers are,