Choosing models with the right complexity.
It would have perfect performance on the training data. Now imagine that we collected a few more samples, all from the same population,
It turns out that the error is higher than a smoother curve that has higher training error,
Notice that the blue test errors are all shorter here than in the previous fit,
is perfect on the training dataset. But it will likely perform worse than the simpler boundary here,
This phenomenon is called overfitting. Complex models might appear to do well on a dataset available for training, only to fail when they are released “in the wild” to be applied to new samples. We need to be very deliberate about choosing models that are complex enough to fit the essential structure in a dataset, but not so complex that they overfit to patterns that will not appear again in future samples.
The simplest way to choose a model with the right level of complexity is to use data splitting. Randomly split the data that are available into a train and a test sets. Fit a collection of models, all of different complexities, on the training set. Look at the performances of those models on the test set. Choose the model with the best performance on the test set.
sklearn
. First, let’s just use a train / test split, without cross-validating. We’ll consider the penguins dataset from the earlier lecture.from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
# prepare data
= pd.read_csv("https://uwmadison.box.com/shared/static/mnrdkzsb5tbhz2kpqahq1r2u3cpy1gg0.csv")
penguins = penguins.dropna()
penguins = penguins[["bill_length_mm", "bill_depth_mm"]], penguins["species"]
X, y
(
X_train, X_test,
y_train, y_test,
indices_train, indices_test = train_test_split(X, y, np.arange(len(X)), test_size=0.25) )
Let’s double check that the sizes of the train and test sets make sense.
len(indices_train)
249
len(indices_test)
84
Now, we’ll train the model on just the training set.
from sklearn.ensemble import GradientBoostingClassifier
= GradientBoostingClassifier()
model model.fit(X_train, y_train)
GradientBoostingClassifier()
"y_hat"] = model.predict(X)
penguins[
# keep track of which rows are train vs. test
"split"] = "train"
penguins[= penguins.reset_index()
penguins list(indices_test), "split"] = "test" penguins.loc[
Finally, we’ll visualize to see the number of errors on either split. The first row are the test samples, the second row are the train samples. Notice that even though prediction on the training set is perfect, there are a few errors on the test set.
ggplot(py$penguins) +
geom_point(aes(bill_length_mm, bill_depth_mm, col = species)) +
scale_color_manual(values = c("#3DD9BC", "#6DA671", "#F285D5")) +
labs(x = "Bill Length", y = "Bill Depth") +
facet_grid(split ~ y_hat)
cross_val_score
in sklearn
that handles all the splitting and looping for us. We just have to give it a model type and it will evaluate it on a few splits of the data. The scores
vector below gives the error rate on \(K = 5\) holdout folds.from sklearn.model_selection import cross_val_score
= GradientBoostingClassifier()
model_class = cross_val_score(model_class, X, y, cv=5)
scores scores
array([1. , 0.98507463, 0.94029851, 0.90909091, 0.96969697])
Ideally, we would have low values for both bias and variance, since they both contribute to performance on the test set. In practice, though, there is a trade-off. Often, the high-powered models that tend to be closest to the truth on average might be far off on any individual run (high variance). Conversely, overly simple models that are very stable from run to run might be consistently incorrect in certain regions (bias).
Models with high variance but low bias tend to be overfit, and models with low variance but high bias tend to be underfit. Models that have good test or cross-validation errors have found a good compromise between bias and variance.
In practice, a useful strategy is to,