Statistical Experimental Design: Testing Differences in Means

Statistics is about making general conclusions based on specific evidence. One approach is based on hypothesis testing: we have a theory about the general (a null hypothesis), and we want to see whether our specific sample is consistent with that theory. This philosophy is made quantitative by following a standard recipe,
- Pose a null hypothesis about the world
- Define a test statistic that should detect departures from that null hypothesis
- Determine a reference distribution for that test statistic
- Compute the test statistic on your data, and see if it’s plausibly a draw from your reference distribution

Proceeding in this way, there are a few types of error

	Tested rejected	Test didn’t reject
Null is true	False alarm	Correct
Null is false	Correct	Missed detection

\(p\)-values. “Rejected” or “Not rejected” is only a very coarse description of how the data conforms to a theory. \(p\)-values give a measure of the degree of (im)plausibility of a test statistic under a given null hypothesis. The specific measure of plausibility will depend on the form of the test – we will see a specific example in the next set of notes.

Two Sample t-test

Motivating example: You have two ways of making concrete mortar. Is one stronger than the other? By default, you think they are equally strong. We can denote this (the null hypothesis) by,

\[ H_0: \mu_1 = \mu_2 \]

The alternative hypothesis is that the strengths are not equal,

\[ H_1: \mu_1 \neq \mu_2 \]

Our test statistic for detecting departures from this null will be,

\[ t_0 := \frac{\bar{y}_1 - \bar{y}_2}{S_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \] where we define the pooled standard deviation by,

\[ S^2_p = \frac{\left(n_1 - 1\right)S_1^2 + \left(n_2 - 1\right)S_2^2}{n_1 + n_2 -2} \] and \(S_1\) and \(S_2\) are the usual standard deviations for each group individually. (consider what happens when \(n_1 = n_2 = n\))

Under the null hypothesis, this is a ratio between a standard normal and chi-square, so \(t_0\) is \(t\)-distributed with \(n_1 + n_2 - 2\) d.f. This gives reference distribution under the null.

Confidence intervals

Instead of thinking we know the mean and trying to reject it, why don’t we try to directly estimate it (with an error bar)? A 95% confidence interval is an interval \([L, U]\) satisfying,

\[ \mathbf{P}\left(\theta \in \left[L, U\right]\right) = 0.95 \]

The randomness here is in \(L\) and \(U\). If we were being more formal, we would write those as functions of the (random) sample,

\[ \left[L\left(y_1, \dots, y_n\right), U\left(y_1, \dots, y_n\right)\right] \]

To construct one for the two sample test, recall that, based on the \(t\)-distribution characterization from the probability-review lecture,

\[ P\left(\frac{\left(\bar{y}_1 - \bar{y}_2\right) - \left(\mu_1 - \mu_2\right)}{S_p\sqrt{\frac{2}{n}}} \in \left[t_{0.025, 2\left(n - 1\right)}, t_{0.975, 2\left(n - 1\right)}\right] \right) = 0.95 \]

To simplify the algebra, let

\[ T\left(y\right) := \bar{y}_1 - \bar{y}_2 \\ \theta := \mu_1 - \mu_2 \\ \hat{\sigma} := S_p\sqrt{\frac{2}{n}} \\ t_{0.025, 2\left(n - 1\right)} := t_{\text{left}} \\ t_{0.975, 2\left(n - 1\right)} := t_{\text{right}} \] so that the above expression reduces to,

\[ \mathbf{P}\left(\frac{T\left(y\right) - \theta}{\hat{\sigma}} \in \left[t_{\text{left}}, t_{\text{right}}\right]\right) = 0.95 \]

Now, if we rearrange terms, we find

\[ \mathbf{P}\left(\theta \in \left[T\left(y\right) - \hat{\sigma}t_{\text{right}}, T\left(y\right) - \hat{\sigma}t_{\text{left}}\right]\right) = 0.95 \] We can use the fact that \(t_{\text{left}} = -t_{\text{right}}\) to simplify the expression further to

\[ \mathbf{P}\left(\theta \in \left[T\left(y\right) - \hat{\sigma}t_{\text{right}}, T\left(y\right) + \hat{\sigma}t_{\text{right}}\right]\right) = 0.95 \] This is exactly the property that a confidence interval has to satisfy. Plugging in the original expressions gives the confidence interval for the difference in means, assuming shared variance.