Statistical Experimental Design: Canonical Analysis

Readings 11.3, Rmarkdown

One we are in the vicinity of a potentially optimal point, we need more subtle methods. There are two essential issues,

We need a way of deciding whether a point is optimal. The method of steepest ascent will always report a new direction to explore, even if the benefit is marginal. At some point we need to stop.
We need a better way of proposing new directions to explore, when the first-order model is insufficient. Since the method of steepest ascent relies on a first-order model, it will not give good directions when the surface is actually highly nonlinear.

Canonical analysis helps solve both of these problems. First, we’ll define canonical analysis, and then we’ll illustrate its use in finding optimal points and proposing new directions.

Mathematical Setup

A first-order model never has any isolated optima. It’s either perfectly flat, or increases forever in some direction. A second-order model, however, does potentially have optimal values.

Figure 1: First order models have no isolated optima.

Calculus can help us identify these optima. Recall that, at a function’s maximum, its gradient is zero. This makes it natural to consider the gradient of a fitted second order surface, \[\begin{align*} \nabla_{x}\hat{y}\left(x\right) &= \nabla_{x}\left[x^{T}\hat{b} + x^{T}\hat{B}x\right] \\ &= \hat{b} + 2\hat{B}x \end{align*}\]

so the function has a stationary point at \[\begin{align*} x_{\ast} &= -\frac{1}{2}\hat{B}^{-1}\hat{b}. \end{align*}\]

We’ll denote the value of the surface at this optimal point by \[\begin{align*} \hat{y}_{\ast} &= \hat{b}^{T}x + x^{T}\hat{B}\hat{x} \\ &= -\frac{1}{2}\hat{b}^{T}\hat{B}^{-1}\hat{b} + \frac{1}{4}\hat{b}^{T} \hat{B}^{-1} \hat{B} \hat{B}^{-1} \hat{b} \\ &= -\frac{1}{4}\hat{b}^{T}\hat{B}^{-1}\hat{b} \end{align*}\]
Just like in ordinary calculus, there are three possibilities,

The stationary point is a maximum
The stationary point is a minimum
The stationary point is a saddlepoint

Figure 2: Second order model with a local minimum and at a saddlepoint.

In one-variable calculus, we’d just take second derivatives, and see whether the function curves up, down, or is flat. We’re in higher-dimensions, though, so we’ll use the generalization of the second-derivative test, called the canonical representation.

Canonical Representation

Figure 3: Geometric representation of the eigendecomposition.

Since \(\hat{B}\) is symmetric, we can find an eigendecomposition, \[\begin{align*} \hat{B} &= U \Lambda U^{T}, \end{align*}\] where \(U\) are orthogonal eigenvectors and \(\Lambda\) is a diagonal matrix containing eigenvalues \(\lambda_{k}\). Define \(w\left(x\right) = U^{T}\left(x - x_{\ast}\right)\). It turns out that our second-order fit can be equivalently expressed as \[\begin{align*} \hat{y}\left(x\right) &= \hat{y}_{\ast} + w^T\left(x\right)\Lambda w\left(x\right) \\ &= \hat{y}_{\ast} + \sum_{k = 1}^{K} \lambda_{k}w^{2}_{k}\left(x\right) \end{align*}\] which is much simpler because there are no interaction terms between the \(w_{k}\)’s (like there are between \(x_{k}\)’s).

Proof¹: The textbook skips the proof of this fact, but if you have seen some linear algebra, the approach is interesting. First, plug-in the value of \(\hat{y}_{\ast}\) from above, and expand the definition of \(w\left(x\right)\), \[\begin{align*} \hat{y}\left(x\right) &= \hat{y}_{\ast} + w^{T}\left(x\right) \Lambda w\left(x\right) \\ &= -\frac{1}{4}\hat{b}^{T}\hat{B}\hat{b} + \left(x - x_{\ast}\right)^{T}U\Lambda U^{T} \left(x - x_{\ast}\right). \end{align*}\]

Now, use the definition of our original eigendecomposition \(\hat{B} = U \Lambda U^{T}\) and expand the quadratic, \[\begin{align*} \hat{y}\left(x\right) &= -\frac{1}{4}\hat{b}^{T}\hat{B}\hat{b} + \left(x - x_{\ast}\right)^{T}\hat{B}\left(x - x_{\ast}\right) \\ &= -\frac{1}{4}\hat{b}^{T}\hat{B}\hat{b} + x^{T}\hat{B}x - 2x^{T}\hat{B}x_{\ast} + x^{T}_{\ast}\hat{B}x_{\ast}. \end{align*}\]

To simplify, let’s plug-in the expression for the stationary point \(x_{\ast} = -\frac{1}{2}\hat{B}^{-1}\hat{b}\) that we found above, \[\begin{align*} \hat{y}\left(x\right) &= -\frac{1}{4}\hat{b}^{T}\hat{B}\hat{b} + x^{T}\hat{B}x + x^{T}\hat{B}\hat{B}^{-1}\hat{b} + \frac{1}{4}\hat{b}^{T}\hat{B}^{-1}\hat{B}\hat{B}^{-1}\hat{b} \\ &= \hat{b}^{T}x + x^{T}\hat{B}x \end{align*}\]

which was exactly our original definition of the second-order model.

Implications

The representation of the second-order model by \[\begin{align} \label{eq:canonical} \hat{y}\left(x\right) &= \hat{y}_{\ast} + \sum_{k = 1}^{K} \lambda_{k}w^{2}_{k}\left(x\right) \end{align}\] is an important one, because

It provides the high-dimensional analog of the second-derivative test.
It suggests new configurations of \(x\) that might be better.

To see the first point, suppose the \(\lambda_{k}\)’s were all negative. Then, expression is maximized when \(w_{k}\left(x\right)\) are all zero — any change in the \(w_{k}\left(x\right)\)’s from there can only ever decrease \(\hat{y}\left(x\right)\).

But \(w\left(x\right) = 0\) happens at \(x = x_{\ast}\), by definition of \(w\left(x\right)\).
This means \(x_{\ast}\) is a maximum!

Similarly, when all \(\lambda_{k}\) are all positive, then moving the \(w_{k}\left(x\right)\)’s by any amount can only increase \(\hat{y}\left(x\right)\). This means we’re at a minimum!
Finally, when some of the \(\lambda_{k}\)’s are positive and others are negative, then we’re at a saddlepoint.

Increasing \(w_{k}\left(x\right)\) for \(\lambda_{k}\)’s that are positive will increase \(\hat{y}\left(x\right)\).
Increasing \(w_{k}\left(x\right)\) for \(\lambda_{k}\)’s that are negative will decrease \(\hat{y}\left(x\right)\).

This last discussion also gives us insight into the second point.

To find configurations \(x\) that further increase the value of the response surface \(\hat{y}\left(x\right)\), increase \(w_{k}\left(x\right)\) for those \(k\)’s where \(\lambda_{k}\) is most positive.

Data Example

We’ll continue the chemical process experiment from last time. Here, the experimenter has performed a central composite design experiment, a refinement of the earlier factorial designs. The hope is that now a configuration that maximizes the yield can be precisely isolated.

library(readr)
library(dplyr)
library(ggplot2)

chem <- read_csv("https://uwmadison.box.com/shared/static/nbaj1m8j7tuaqmznjlrsgbzyhp9k61i8.csv")
ggplot(chem) +
  geom_point(
    aes(time, temp, col = yield),
    position = position_jitter(w = 0.3, h = 0.3)
  ) +
  coord_fixed() +
  scale_color_viridis_c()

As before, we will code the data. We use SO to fit a second-order model. It appears that the linear and pure quadratic components are very strong, and that there is limited, if any, interaction between these two variables.

library(rsm)

chem_coded <- coded.data(chem, time_coded ~ (time - 35) / 5, temp_coded ~ (temp - 155) / 5)
fit <- rsm(yield ~ SO(temp_coded, time_coded), data = chem_coded)
summary(aov(fit))

                            Df Sum Sq Mean Sq F value   Pr(>F)    
FO(temp_coded, time_coded)   2 10.043   5.021  70.814 2.27e-05 ***
TWI(temp_coded, time_coded)  1  0.250   0.250   3.526    0.103    
PQ(temp_coded, time_coded)   2 17.954   8.977 126.594 3.19e-06 ***
Residuals                    7  0.496   0.071                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Next, we can perform a canonical analysis, to make sure that we are at a stationary point. Since both of the eigenvalues are negative, this is in fact the case. The eigenvectors also tell us directions that (in this case) decrease the yield fastest.

analysis <- canonical(fit)
analysis

$xs
temp_coded time_coded 
  4.305847  10.389230 

$eigen
eigen() decomposition
$values
[1] -0.9634986 -1.4142867

$vectors
                 [,1]       [,2]
temp_coded -0.9571122 -0.2897174
time_coded -0.2897174  0.9571122

We can identify the specific location of the optimum, by decoding the canonical analysis.

stationary <- code2val(analysis$xs, codings = codings(chem_coded))
stationary

     temp      time 
176.52923  86.94615

w1 <- code2val(analysis$xs + analysis$eigen$vectors[, 1], codings = codings(chem_coded))
w2 <- code2val(analysis$xs + analysis$eigen$vectors[, 2], codings = codings(chem_coded))

Finally, let’s plot the response surface, with the canonical points overlaid. For this, we can use the contour and segments methods provided by the rsm package.

contour(fit, ~ time_coded + temp_coded, image = TRUE, asp = 1)
segments(stationary[2], stationary[1], w1[2], w1[1], col = "red")
segments(stationary[2], stationary[1], w2[2], w2[1], col = "red")

Figure 4: The response surface, with canonical vectors overlaid.

You can skip this proof without worrying whether it will appear in later lectures or assignments / exams. But I encourage you to go on, it gives a flavor of more advanced topics in statistics.↩︎