15  Hypothesis Testing on Parts of Models

Everything should be made as simple as possible, but no simpler. – Albert Einstein

It often happens in studying a system that a single explanatory variable is of direct interest to the modeler. Other explanatory variables may be important in the way the system works, but they are not of primary interest. As in previous chapters, call the explanatory variable of interest simply the “explanatory variable.” The other explanatory variables, the ones that aren’t of direct interest, are the covariates. The covariates are ordinary variables. Their designation as covariates reflects the interests of the modeler rather than any intrinsic property of the variables themselves.

The source of the interest in the explanatory variable might be to discover whether it is relevant to a model of a response variable. For example, in studying people’s heights, one expects that several covariates, for instance the person’s sex or the height of their father or mother, play a role. It would be interesting to find out whether the number of siblings a person has, nkids in Galton’s dataset, is linked to the height. Studying nkids in isolation might be misleading since the other variables are reasonably regarded as influencing height. But doing a whole-model hypothesis test on a model that includes both nkids and the covariates would be uninformative: Of course one can account for a significant amount of the variation in height using sex and mother and father. (See the Example in Section 14.3 on height and genetics.) The question about nkids is whether it contributes something to the explanation that goes beyond the covariates. To answer this question, one needs a way of assessing the contribution of nkids on its own, but in the context set by the other variables.

Sometimes you may choose to focus on a single explanatory variable because a decision may hang on the role of that variable. Keep in mind that role of the explanatory variable of interest can depend on the context set by covariates. Those covariates might enhance, mask, or reverse the role of the variable of interest. For instance, it’s legitimate that wages might vary according to the type of job and the worker’s level of experience and education. It’s also possible that wages might vary according to sex or race insofar as those variables happen to be correlated with the covariates, that is, correlated with the type of job or level of education and experience. But, if wages depend on the worker’s sex or race even taking into account these other legitimate factors, that’s a different story and suggests a more sinister mechanism at work.

This chapter is about conducting hypothesis tests on a single explanatory variable in the context set by covariates. In talking about general principles, the text will refer to models of the form Y ~ X + A or Y ~ X + A + B where Y is the response variable, X is the explanatory variable of interest, and A and B are the variables that you aren’t directly interested in – the covariates.

Recall the definition of statistics offered in Chapter 1:

Statistics is the explanation of variation in the context of what remains unexplained.

It’s important to include covariates in a hypothesis test because they influence both aspects of this definition:

15.1 The Sequential ANOVA Table

The ANOVA table introduced in the Chapter 14 provides a framework for conducting hypothesis tests that include covariates. The whole-model ANOVA table lets you compare the “size” of the explained part of the model with the “size” of the residuals. Here is a whole-model report for the model height ~ 1 + nkids + sex + father + mother fit to Galton’s data:

Table 15.1: A whole model ANOVA table for height ~ 1 + nkids + sex + father + mother fit to Galton’s data.
df SS MS F p_value
model 4 7377.942 1844.485487 398.1333 0
residuals 893 4137.120 4.632834

RSS is 4137, SSM is 7378. The F test takes into account the number of vectors involved in each – four explanatory vectors (neglecting the intercept term), leaving 893 residual degrees of freedom. The F value, 398.1, is the ratio of the mean square of the model terms to the residuals. Since 398.1 is much, much larger than 1 (which is the expected value when the model terms are just random vectors), it’s appropriate to reject the null hypothesis; the p-value is effectively zero since F is so huge.

A sequential ANOVA table is much the same, but the report doesn’t just partition variation into “modeled” and “residual,” it goes further by partitioning the modeled variation among the different explanatory terms. There is one row in the report for each individual model term.

Table 15.2: A sequential ANOVA table for height ~ 1 + nkids + sex + father + mother fit to Galton’s data.
term df Sum Sq Mean Sq F p-value
nkids 1 185.4636 185.463648 40.03244 3.946324e-10
sex 1 5766.3339 5766.333920 1244.66674 1.907458e-171
father 1 937.6280 937.628048 202.38759 1.510480e-41
mother 1 488.5163 488.516332 105.44655 1.849975e-23
Residuals 893 4137.1204 4.632834

Notice that the “Residuals” row is exactly the same in the two ANOVA reports. This is because we are fitting the same model and so we have the same residuals. Less obvious, perhaps, is that the sum of squares of the individual model terms adds up across all the terms to give exactly the sum of squares from the whole-model ANOVA report:

\[185.5 + 5766.3 + 937.6 + 488.5 = 7377.9 \;.\]

The same is true for the degrees of freedom of the model terms:

\[ 1 + 1 + 1 + 1 = 4 \;. \]

What’s new in the sequential ANOVA table is that there is a separate mean square, F value, and p-value for each model term. These are calculated in the familiar way: the mean square for any term is just the sum of squares for that term divided by the degrees of freedom for that term. Similarly, the F value for each term is the mean square for that term divided by the mean square of the residuals. For instance, for nkids, the F value is 185.5 / 4.6 = 40. This is translated to a p-value in exactly the same way as in Chapter 14 (using an F distribution with 1 degree of freedom in the numerator and 893 in the denominator – values that come from the corresponding rows of the ANOVA report).

15.1.1 Covariates Soak Up Variance

An important role of covariates is to account for variance in the response. This reduces the size of residuals. Effectively, the covariates reduce the amount of randomness attributed to the system and thereby make it easier to see patterns in the data. To illustrate, consider a simple question relating to marriage: Do men tend to be older than women when they get married?

Example 15.1 (Age at marriage) The marriage-license dataset contains information on the ages of brides and grooms at the time of their marriage as well as other data on education, race, number of previous marriages, and so on. These data were collected from the on-line marriage license records made available by Mobile County, Alabama in the US. Here’s the data:

The Person and Age variables are obvious enough. The BookpageID variable records the location of the marriage license in the Mobile County records as a book number and page number. There is only one license on each page, so this serves as a unique identifier of the couple.

It seems simple enough to test whether men tend to be older than women when they get married, just fit the model age ~ 1 + person. Here it is:

Table 15.3: Coefficients for age ~ 1 + person using the marriage data from Mobile County, Alabama.
term Estimate Std Error t p-value
(Intercept) 33.239195 2.060289 16.1332673 <0.0001
personGroom 2.545541 2.913689 0.8736489 0.3845

The coefficient on personGroom indicates that on average men are 2.5 years older than women when they get married. That’s roughly what one would expect from experience – husbands and wives tend to be roughly the same age, with the husband typically a bit older. The confidence interval suggests a different interpretation, however. It’s \(2.5 \pm 5.8\), giving reason to believe that a different sample of licenses could give a coefficient of zero, indicating no age difference between men and women. ANOVA tells much the same story

Table 15.4: Whole model ANOVA for age ~ 1 + person using the marriage data from Mobile County, Alabama.
df SS MS F p_value
model 1 158.7546 158.7546 0.7632624 0.3844897
residuals 96 19967.4977 207.9948

The \(F\) value, 0.76, is smaller than one. Correspondingly, the p-value is big: 0.38. Based on these data, it’s not appropriate to reject the null hypothesis that men and women are the same age on average when they get married.

No big deal. Perhaps there just isn’t enough data. With some effort, more data could be collected and this might bring the p-value down to the point where the null could be rejected.

Actually, there’s much more information in this sample than this hypothesis test suggests. The model age ~ person ignores a very important structure in the data: Who is married to whom. A person’s age at marriage is very strongly related to the age of their spouse. That is, if you know the age of the wife, you have a lot of information about the age of the husband, and vice versa. This relationship between the husband and wife is carried by the variable BookpageID. Since the model age ~ person doesn’t include BookpageID as a covariate, the model tends to overstate the unexplained part of age.

To fix things, let’s add in bookpageID as a covariate and examine the model age ~ person + BookpageID. In this model the coefficient on personGroom refers to the difference in age between males and females, holding constant the family. That is, this model looks at the difference between the husband and wife within the family. Here’s the sequential ANOVA table:

Table 15.5: A sequential ANOVA table for age ~ 1 + person + bookpageID.
term df Sum Sq Mean Sq F p-value
person 1 158.7546 158.75458 9.069883 4.136755e-03
bookpageID 48 19127.3304 398.48605 22.766095 5.928425e-21
Residuals 48 840.1674 17.50349

Notice that here the \(F\) value of person is much larger than before, 9.07 versus 0.76 before. Correspondingly the p-value is much smaller: 0.004 versus 0.384. But the sum of squares of person is exactly the same as it was for the previous model. What’s happened is that the inclusion of bookpageID in the model has dramatically reduced the size of the residuals. The residual mean square is now 17.5 compared to 208 in the model that omitted bookpageID as a covariate.

It’s not that bookpageID is a variable that’s directly of interest. But that doesn’t mean that you shouldn’t include bookpageID in a model. It accounts for some of the variation in age and in so doing makes it easier for the role of other variables to be seen above the random variation due to sampling.

The particular hypothesis test conducted here has a specific name, the paired t-test. The word “paired” reflects the natural pairing in the data: At the time these data were collected, each marriage had a wife and a husband. Another context in which pairing arises is before-and-after studies, for example measuring a subject’s blood pressure both before and after a drug is given. The traditional approach to this test uses a different test statistic, the \(t\) statistic. In this situation \(F = t^2\) and the two approaches are equivalent, although the results are often presented differently when doing a t-test.

In the literature, this is often present even more succinctly as “p = 0.004 (t = -3.01, df = )”.

The paired t-test is an example of a more general approach known as repeated measures. In a paired situation there are two measurements on each case, but there can also be situations where there are more than two. (Note that in this example a case is a married couple, not an individual married person.)

More general still is analysis of covariance (ANCOVA), which refers to any situation where covariates are used to soak up the residuals.

Categorical predictors

If you are wondering why this model has 48 degrees of freedom for the book page ID but only one degree of freedom for person (bride or groom), that means you are paying attention! We have mentioned briefly how categorical variables are encoded for linear models, but we won’t devote our full attention to this topic until we get to Chapter 16.

15.1.2 The order of terms matters in a sequential ANOVA table

At a superficial level, interpretation of the sequential ANOVA table is straightforward. Use the p-value on each row to decide whether to reject the null hypothesis for that row. For instance, return to the height-model example from before:

height ~ nkids + sex + father + mother

Table 15.6: Regression output for height ~ nkids + sex + father + mother.
term Estimate Std Error t p-value
(Intercept) 16.18771294 2.79386887 5.794013 <0.0001
nkids -0.04381836 0.02717923 -1.612200 0.1073
sexM 5.20994674 0.14422135 36.124657 <0.0001
father 0.39830528 0.02956642 13.471543 <0.0001
mother 0.32095528 0.03125564 10.268717 <0.0001

The very small p-value on nkids (it’s actually 0.0000000004) prompts a rejection of the null hypothesis on the nkids term. But there are other ANOVA tables on exactly the same model fitted to exactly the same data. Here’s one:

Table 15.7: Regression output for height ~ sex + father + mother + nkids.
term Estimate Std Error t p-value
(Intercept) 16.18771294 2.79386887 5.794013 <0.0001
sexM 5.20994674 0.14422135 36.124657 <0.0001
father 0.39830528 0.02956642 13.471543 <0.0001
mother 0.32095528 0.03125564 10.268717 <0.0001
nkids -0.04381836 0.02717923 -1.612200 0.1073

According to this report, nkids is not making a significant contribution to the model.

Why do the two reports give such different results for nkids? Which is right, the second report’s p-value on nkids of 0.107 or the first report’s p-value of 0.0000000004? Obviously, they lead to different conclusions. But they are both right. It’s just that there are two different null hypotheses involved.

  • In the first report, the null is that nkids on its own doesn’t contribute beyond what’s typical for a random vector.

  • In the second report, the null is that nkids doesn’t contribute beyond the contribution already made by mother, father, and sex.

The two reports give us different perspectives on nkids.

To illustrate the process, let’s reconstruct sequential ANOVA on the model height ~ 1 + nkids + sex + father + mother. Think of this as a set of nested models, as shown in the table. For each of the nested models a “score” – that is, a sum of squares – is found.

Table 15.8: Reconstructing a sequential ANOVA table for height ~ nkids + sex + father + mother.
Model SSM Term added Additional SSM
height ~ 1 4002376.8 intercept
height ~ 1 + nkids 4002562.3 nkids 185.5
height ~ 1 + nkids + sex 4008328.6 sex 5766.3
height ~ 1 + nkids + sex + father 4009266.2 father 937.6
height ~ 1 + nkids + sex + father + mother 4009754.7 mother 488.5

The “Additional SSM” assigned to each term reflects how much that term increased SSM compared to the previous model. For instance, adding the nkids term in the second line of the table increased the sum of squares from 4002376.8 to 4002562.3, an increase of 185.5. It’s the increase that’s attributed to nkids.

Perhaps you’re wondering, why did nkids go first? ANOVA can take the terms in any order you want. For example:

Table 15.9: Reconstructing a sequential ANOVA table for height ~ sex + father + mother + nkids.
Model Sum of Squares Term added Additional Sum of Sq.
height ~ 1 4002376.8 intercept
height ~ 1 + mother 4002845.1 mother 468.3
height ~ 1 + mother + father 4003630.8 sex 785.7
height ~ 1 + mother + father + sex 4009742.7 sex 6112.0
height ~ 1 + mother + father + sex + nkids 4009754.7 nkids 12.0

Notice that with the new arrangement of terms, the sum of squares assigned to each term is different. In particular, the first report gave nkids a fairly large sum of squares: 185.5. The second report gave nkids only 12.0. This change plays out in the p-value on nkids given in the ANOVA table: it’s 0.107 now compared to 0.0000000004.

15.2 Type II ANOVA tables

The ANOVA table we have presented is sequential, the order of terms matters and the table tells about the results of building up a model term by term. This type of table is sometimes called a type I ANOVA table.

Table 15.10: A type I (sequentional) ANOVA table for height ~ nkids + sex + father + mother.
term df Sum Sq Mean Sq F p-value
nkids 1 185.4636 185.463648 40.03244 3.946324e-10
sex 1 5766.3339 5766.333920 1244.66674 1.907458e-171
father 1 937.6280 937.628048 202.38759 1.510480e-41
mother 1 488.5163 488.516332 105.44655 1.849975e-23
Residuals 893 4137.1204 4.632834

This tells a useful story: as we add more terms to our model, we explain more and more of the variation in the response variable, shifting value from RSS (unexplained) to SSM (explained) while leaving SST fixed.

But the intermediate null hypotheses represented in a type I ANOVA table are often not the ones we are most interested in. Here is another type of ANOVA table (called a type II ANOVA table) for the same model:

Table 15.11: A type II ANOVA table for height ~ nkids + sex + father + mother.
term Sum Sq df F p-value
nkids 12.04161 1 2.59919 1.072717e-01
sex 6045.80555 1 1304.99086 7.583858e-177
father 840.77810 1 181.48247 8.607806e-38
mother 488.51633 1 105.44655 1.849975e-23
Residuals 4137.12042 893

Comparing the two tables, we see that the residuals row is that same in both tables. That makes sense: We are using the same model so the residuals will be the same. But the only other row that matches is the row directly above that, the one labeled mother. Remember, that row is testing the hypothesis that the coefficient on mother is 0, given all the other terms are already in the model. That is, it is asking whether mother’s height helps us predict the child’s height, even if we already know the number of siblings, the sex of the adult child, and the father’s height.

But what is happening in the other rows? One of them should look familiar. We have seen the first row before. It was the last row in our type I ANOVA table for the model height ~ sex + father + mother + nkids, where we put nkids last instead of first.

Table 15.12: A type I (sequentional) ANOVA table for height ~ sex + father + mother + nkids.
term df Sum Sq Mean Sq F p-value
sex 1 5874.57323 5874.573234 1268.03026 1.482261e-173
father 1 1001.10973 1001.109733 216.09015 5.715495e-44
mother 1 490.21737 490.217369 105.81372 1.567502e-23
nkids 1 12.04161 12.041613 2.59919 1.072717e-01
Residuals 893 4137.12042 4.632834

That pretty much tells us what is happening in the type II table:

Type I and Type II ANOVA tables

  • In a type I ANOVA table, each row is testing a null hypothesis that one model coefficient is 0 in a context where all the previous terms are in the model. It builds up sequentially.

  • In a type II ANOVA table, each row is testing a null hypothesis that one model coefficient is 0 in a context where all the other terms are in the model. It is like a collection of last non-residual rows of type I tables with each term taking its turn to be “last one in.”

There is even a third type of ANOVA table (called a type III ANOVA table). The difference between type II and type III ANOVA tables arises when the model has interaction terms. To further confuse things, there are some variations on the type II and type III tables and not everyone agrees on what should be called what.

Warning: It is important when reading an ANOVA table to know which type of table it is!

15.2.1 Type II tests and the linear model summary table

The coefficients table that appears in our regression summary for the height ~ nkids + sex + father + mother model includes the same p-values that appear in the type II ANOVA table for the model, but with a different test statistic, \(t\):

(1) Type II ANOVA table.
term Sum Sq df F p-value
nkids 12.04161 1 2.59919 1.072717e-01
sex 6045.80555 1 1304.99086 7.583858e-177
father 840.77810 1 181.48247 8.607806e-38
mother 488.51633 1 105.44655 1.849975e-23
Residuals 4137.12042 893
(2) Summary of coefficients.
term Estimate Std Error t p-value
(Intercept) 16.18771294 2.79386887 5.794013 9.521525e-09
nkids -0.04381836 0.02717923 -1.612200 1.072717e-01
sexM 5.20994674 0.14422135 36.124657 7.583858e-177
father 0.39830528 0.02956642 13.471543 8.607806e-38
mother 0.32095528 0.03125564 10.268717 1.849975e-23

Table 15.13: Some output for height ~ nkids + sex + father + mother

The connection between the two is

  • Both are testing the same hypothesis in equivalent ways, so the p-values will match.

  • The test statistics are related by \(t^2 = F\).

  • The sign of \(t\) will be the same as the sign of the coefficient estimate. In fact, \(t\) is easy to compute: \(\displaystyle t = \frac{\mbox{estimate} - 0}{SE}\)

    • If we wanted to test the hypothesis that the coefficient had a value other than 0, we could do that by computing \(\displaystyle t = \frac{\mbox{estimate} - \mbox{hypothsized value}}{SE}\).
  • The degrees of freedom used in the \(t\) distribution are the denominator (i.e., residual) degrees of freedom used for \(F\). For linear models is this \(n - p\) where \(n\) is the sample size and \(p\) is the number of coefficients in the model (including the intercept and the coefficients on any of the galton_model terms).

  • The coefficient summary table includes one additional row, for the intercept.

    • The null hypothesis here is that the intercept is 0 (in the population), given the other terms in the model.

    • This is in most cases not a very interesting row.

      Recall that the intercept tells us the model value when all the predictors are 0. That might not even be a meaningful situation, and even if it is, we may not be interested in whether this number is 0.

      You will frequently see very small p-values here because it is very clear that the intercept should be very different from 0.

      If you are interested in the intercept – and often you are not – but not interested in whether it is 0, you can use a confidence interval for the intercept to see how precisely you are esitmating it given your model and your data.

15.2.2 Connection between confidence intervals and p-values

There is also a connection between the confidence intervals we can compute using the standard errors in the coefficient table and the p-values that appear in those same rows.

15.2.2.1 p-values and confidence intervals for model coefficients

  • 0 is inside the 95% confidence interval exactly when the p-value is greater than 0.05.

  • 0 is outside the 95% confidence interval exactly when the p-value is less than 0.05.

  • More generally, the confidence interval contains exactly those values \(b\) that we would not reject is a test of the hypothesis \(H_0: \beta_i = b\).

This is the sense in which a confidence interval can be considered as a range of plausible values for a parameter – it contains the values our data would not cause us to reject in a formal hypothesis test with the corresponding significance level.

15.3 Model comparison tests

One way to think about the non-residual rows of our ANOVA tables is that they are comparing two nested models. In each type of table, the larger model has one more term than the smaller model.

We can build tests that compare any two nested models in the same way, by seeing how much of RSS moves to SSM when we go from the smaller model to the larger model. This corresponds to a null hypothesis of

  • \(H_0\): In the population, all model coefficients that appear in the larger model but not the smaller model are 0.

For example, we could compare height ~ nkids + sex + father + mother to height ~ nkids + sex to see whether adding (both) parents’ heights to the model does better than adding two random predictors.

If we let \(SSM_1\), \(RSS_1\), and \(RDF_1\) be the sums of squares and residual degrees of freedom for the smaller model and let \(SSM_2\), \(RSS_2\), and \(RDF_2\) be the sums of squares and residual degrees of freedom for the larger model, then our F statistic is

\[ F = \frac{(SSM_2 - SSM_1) / (RDF_1 - RDF_2)}{(RSS_2 / (RDF_2) } = \frac{\Delta SSM/\Delta DF}{RSS_2 / RDF_2} \] Note that \(\Delta SSM = SSM_2 - SSM_1 = RSS_1 - RSS_2\) – the change is that some variation unexplained by the smaller model becomes explained by the larger model. So some amount of SS is moving from residuals to the model. Model comparison ANOVA tables focus on this change. Here is an example model comparison ANOVA table

Table 15.14: A model comparison ANOVA table.
Res.Df RSS Df Sum of Sq F Pr(>F)
896 6441.546
894 6271.515 2 170.032 12.11897 6.410347e-06

It is just a shorter way of presenting the most important information found by comparing two whole model ANOVA tables:

(1) Smaller model.
df SS MS F p_value
model 1 25.8789 25.878900 3.599678 0.05811205
residuals 896 6441.5465 7.189226
(2) Larger model.
df SS MS F p_value
model 3 195.9109 65.303629 9.308987 4.591189e-06
residuals 894 6271.5145 7.015117

Table 15.15: Two whole model ANOVA tables.

Notice that

  • \(\Delta DF = 896 - 894 = 3 - 1 = 2\) because the larger model has two additional terms.
  • \(\Delta SSM = 195.9 - 25.88 = 64441.5 - 6271.5 = 170\) – that the amount of variation in the response variable that is accounted for by the larger model but not accounted for in the smaller model. *\(F\) is again a “ratio of slopes” and compares this change in models to the expected change if we added the same number of additional random terms to the smaller model.

Each of the p-values that appear in a type I or type II ANOVA table can be obtained using this sort of model comparison, but the model comparison approach is more general, since it let’s us compare models the differ by more than one term.

In our example with Galton’s height data, adding two random terms would do as well as adding both parents’ only about 6 times in a million. That’s strong evidence that we can make better predictions knowing the parents’ heights than if we do not, even if we already know the sex of the person and how many siblings they have.

The whole model test (model utility test) is also a model comparison test. The comparison model is a model that has only an intercept – our “all-cases-the-same” model.

Res.Df RSS Df Sum of Sq F Pr(>F)
897 6467.425
894 6271.515 3 195.9109 9.308987 4.591189e-06
df SS MS F p_value
model 3 195.9109 65.303629 9.308987 4.591189e-06
residuals 894 6271.5145 7.015117

Summary

  • All the tests in Type I and Type II ANOVA tables are model comparison tests comparing two models where the larger model has one additional term.

  • The model utility test is a model comparison test where the smaller model has only an intercept term – the “all-cases-the-same” model.

  • Model comparison tests allow us to compare any two nested models using the same sort of F statistic.

  • p-values can be obtained using F-distributions or via randomization simulations. The p-values in standard regression tables come from F-distributions.

15.4 ANOVA, Collinearity, and Multi-Collinearity

Example 15.2 (Height and Siblings) In fitting the model height ~ nkids + sex + father + mother to Galton’s data, the p-value on nkids depended substantially on where the term is placed in the ANOVA table. When nkids came first, the p-value was small, leading to rejection of the null hypothesis. But when nkids came last, the p-value was not small enough to justify rejection of the null.

As always, the order dependence of the sequential ANOVA results stems from collinearity. You can investigate this directly by modeling nkids as a function of the covariates. Here is the regression report on the model nkids ~ sex + father + mother:

Table 15.16: Regression report for nkids ~ sex + father + mother.
term Estimate Std Error t p-value
(Intercept) 19.23743683 3.37721371 5.6962450 1.661656e-08
sexM -0.36524809 0.17704855 -2.0629827 3.940257e-02
father -0.17510345 0.03590810 -4.8764337 1.278199e-06
mother -0.01232026 0.03845896 -0.3203482 7.487793e-01

The model coefficients indicate that boys tend to have fewer siblings than girls. This may sound a bit strange, because obviously within any family every boy has exactly the same number of siblings as every girl. But the model applies across families, so this model is saying that families with boys tend to have fewer children than families with girls. This is a well known phenomenon. Depending on your point of view, you might like to interpret in either of two ways. Perhaps it’s that parents with boys get worn out and stop having children. An alternative is that, particularly in Galton’s time when women were definitely second-class citizens, parents who didn’t have boys continued to have more children in the hopes of producing them. Either way, the model indicates that nkids is collinear with sex. It also appears to be related to father, as you will see later in this example.

A nice way to measure the extent of collinearity is with either the model \(R^2\). Geometrically, this is equivalent to the angle \(\theta\) between nkids vector and the fitted model vector for the model prediction nkids from the other predictors. If this angle is close to 0 (\(R^2\) close to 1), then nkids is very similar to what could be predicted using other variables – there isn’t much new information in nkids. If this angle is close to \(90^{\circ}\) (\(R^2\) close to 0), then nkids provides new information that is essentially independent from the information available in the other predictors (at least in the context of a linear model).

In this example, \(R^2 = 0.03\), corresponding to \(\theta = 80^{\circ}\). That seems quite close to \(90^\circ\), hardly any alignment at all! How can that trivial amount of collinearity lead to the dramatic order-dependence of the ANOVA result on nkids?

To examine this, imagine splitting nkids into two vectors: a vector that is exactly aligned with the sex, father, mother subspace and the part that is exactly orthogonal to it. These two vectors correspond to the fitted model values and the residuals of the nkids model. Call them aligned and perp and add them to the height model in place of nkids. It’s important to remember that neither of these vectors individual points in exactly the same direction as nkids, but the sum of the vectors does, in other words

nkids = aligned + perp.

Here’s an ANOVA table, where aligned and perp have been placed first:

Table 15.17: An ANOVA table, where aligned and perp have been placed first.
term df Sum Sq Mean Sq F p-value
aligned 1 3435.49482 3435.494817 741.55368 2.343569e-119
perp 1 12.04161 12.041613 2.59919 1.072717e-01
sex 1 3528.99924 3528.999239 761.73667 9.705426e-122
father 1 401.40628 401.406280 86.64379 9.804420e-20
Residuals 893 4137.12042 4.632834

There are several interesting aspects to this report.

  • First, aligned shows up as very significant, with a huge sum of squares compared to the original nkids – the original was only 185.5. This is because aligned happens to point in a very favorable direction, closely related to the very important variable sex.

  • Second, perp gets a very small sum of squares: only 12.

  • Third, sex is much reduced compared to the original ANOVA. This is because aligned has already grabbed much of the sum of squares originally associated with sex.

  • Finally, the mother term gets absolutely no sum of squares and, indeed, has zero degrees of freedom. This is because there is one degree of redundancy among the set aligned, sex, father, and mother. Since mother came last in the report, it had absolutely nothing to contribute.

Now consider an ANOVA table on the same terms, but in a different order:

Table 15.18: An ANOVA table, where aligned and perp have been placed last.
term df Sum Sq Mean Sq F p-value
sex 1 5874.57323 5874.573234 1268.03026 1.482261e-173
father 1 1001.10973 1001.109733 216.09015 5.715495e-44
mother 1 490.21737 490.217369 105.81372 1.567502e-23
perp 1 12.04161 12.041613 2.59919 1.072717e-01
Residuals 893 4137.12042 4.632834

Coming after father, mother and sex, the aligned vector is the one discarded as being redundant. Indeed, it was constructed to fit exactly into the subspace spanned by father, mother and sex. Coming before aligned in the ANOVA table, the term sex has regained its large sum of squares, indicating that sex is a strong predictor of height.

Now look at perp. Although it comes at the end of the list of terms, it has exactly the same set of values that it had when it came at the beginning of the list. This is because perp is exactly orthogonal to all the other terms in the report – it was constructed that way as the residual from fitting nkids to the other terms. Whenever a term is orthogonal to the others, the sum of squares for that term will not depend on where the term comes in order with respect to the others.

It’s worth pointing out that the sum of squares, F value, and p-value on perp exactly matches that from the original ANOVA table using nkids when nkids came last. This is not a coincidence. It stems from putting father, mother and sex before nkids in the report – those other variables effectively stripped nkids of any component that was aligned with them, leaving only the part of nkids perpendicular to them.

All this may sound like an abstract exercise in geometrical thinking, but it actually carries a strong implication for interpreting the relationship between nkids and height. Earlier, a possible causal relationship was suggested; perhaps in families with many kids, competition for food and other resources caused the kids to be stunted in growth. Or it might be that families that had lots of kids back then were healthier or rich in resources that caused the kids to grow well.

The results with aligned and perp suggest something different. If nkids really did cause height, then the component of nkids perpendicular to sex and the other covariates ought to have some explanatory power with regard to height. On the other hand, if it’s really sex and the other covariates that cause height, and nkids has nothing to do with it, then the perpendicular component of nkids ought to be no better than a random vector. The ANOVA table is consistent with this second scenario. The p-value of 0.107 isn’t sufficiently small to justify rejecting the null that perp plays no role.

Even a very small alignment with genuinely important variables – nkids is at an angle of \(\theta = 80^\circ\) with respect to the sex, father, mother subspace – can create a sum of squares that’s statistically significant. Keep in mind that “significance” in a hypothesis test is defined with respect not to how strong or substantial or authentic the variable’s role is, but how it compares in size to a completely random variable. With large n, even a very small alignment can be much larger than expected for a random vector. There are \(n=898\) cases in Galton’s data, and in a space of dimension 898 it’s very unlikely that a random vector would show even an alignment as far from perpendicular as 80°. (The 95% confidence interval for the angle that a random vector makes with another vector is 86° to 94°.)

15.5 Which Comparisons?

The difference between type I and type II ANOVA tables highlights that there are multiple hypothesis tests that involve the same term in a linear model. The differences are related to the context within which the test is conducted (which other terms are considered to be “already in the model”).

The full sequence of tests displayed in the sequential type I ANOVA table is rarely the set of questions we are interested in – almost surely not if have not carefully chosen the order of terms. The type II table asks about each term in the context of all the others and can be a good place to start.

But if there is multi-collinearity among the predictors, it is possible that none of them will stand out as important in the type II table even though collectively the whole model ANOVA suggests that they are useful. It may well be that some subset of the predictors would be the most appropriate model. Neither the type I nor the type II table addresses this directly. A model comparison test can be useful here, but there are many such comparisons that can be made if we have several predictor terms in our model. This introduces the multiple comparisons problem.

Here are some suggestions for dealing with the the flexibility and sometimes apparent ambiguity of ANOVA.

  1. Treat ANOVA as a screening method.

    In medicine, screening tests are ways to get a quick, inexpensive indication of whether a patient has an illness or condition. They tend to have relatively high error rates. Nonetheless, they can be genuinely useful for identifying situations where more invasive or expensive methods can be used to get a more definitive result.

    ANOVA provides a quick, powerful, and effective way to assess the strength of evidence for a claim. You can use it to probe whether variables should be included in a model and what covariates might be included as well. The results shouldn’t be taken as proof that a hypothesis is correct. Indeed, you should always keep in mind that hypothesis testing never offers a proof of a hypothesis. It can only produce a rejection of the null or a failure to reject the null.

    ANOVA can help focus our attention, but it shouldn’t be the only thing we focus our attention on. Small p-values are the beginning of the journey, not the destination.

  2. If possible, arrange the covariates so that they are orthogonal to the explanatory variable.

    If two explanatory vectors are not at all collinear, that is, if they are exactly orthogonal to each other, then the order of the terms has no effect in a type I table and and the type I and type II tables yield the same results. This makes it easier to reason about model terms “one term at a time.”

    If you are collecting data in the context of an experiment, you may well be able to arrange things so that the explanatory variables and covariates are orthogonal to one another. Chapter 19 on experimental design covers some of the techniques to do this. In an observation study, this may be out of your control.

    Chapter 12 explains that collinearity between explanatory vectors increases standard errors, that is, worsens the precision of your coefficient estimates. This inflation of standard errors is another reason to try to create orthogonality, if you have the means to do so.

  3. Adopt a skeptical attitude.

    If you are inclined to be skeptical about a variable, look for the arrangement of terms that produces the smallest sum of squares. If, despite this, the variable still has a significant sum of squares, the ambiguity of order-dependence won’t influence your conclusions.

    But in interpreting your results, keep in mind that there might be other covariates that weren’t included in your models that might reduce your variable to insignificance.

  4. Adopt an enthusiastic attitude of hypothesis formation.

    Although ANOVA is formally a mechanism for hypothesis testing, it can also be used for hypothesis formation. Explore your data using ANOVA. If you find an arrangement of explanatory variables that produces a significant result, that’s a good reason to devote some time and energy to thinking more about that result and how to follow up on it.

    But don’t be complacent. Once you have formed a hypothesis, you will have to test it skeptically at some point.

  5. Don’t forget other ambiguities.

    The ambiguity of order dependence is readily seen in ANOVA table. But there are other ambiguities and uncertainties that you won’t necessarily be aware of. Why did you choose the particular covariates in your analysis? Why did you choose to measure certain variables and not others? Did you include interaction terms (which often introduce collinearities)? Did you consider transformation terms? The example in the previous section treats the number of kids as a simple linear term. But there is reason to think that its effect might be nonlinear, that in small families height is associated with having more siblings, but in large families the reverse is true.

    Even if there were no ambiguities of order dependence, there would still be ambiguities and uncertainties in your model results.

  6. Think carefully about how your system works, and model the system rather than just the data.

    The source of order dependence is collinearity among the explanatory variables. Such collinearity is itself evidence that the explanatory variables are related in some way. A model like Y ~ X + A + B is about the relationship between A and the explanatory variables; it doesn’t describe the relationships among the explanatory variables, even though these may be important to understanding how the system works. If you want to understand the system as a whole, you need to model the system as a whole. This involves a set of models. For example, suppose that in the real system, Y is caused by both X and A, and B exerts its influence only because it affects X. For such a system, it’s appropriate to consider a set of models: Y ~ X + A and X ~ B. Things can quickly get more complicated. For example, suppose A influences B. In that case, the model Y ~ X + A would confuse the direct influence of A on Y with the indirect influence that A has on Y via B and therefore X.

    The methods of structural equation modeling (SEM) are designed to deal with such complexity by creating structured sets of models. The structure itself is based in the modeler’s knowledge or assumptions about how the system works.

Example 15.3 (Testing Universities) University students are used to being given tests and graded as individuals. But who grades the universities to find out how good a job the institutions are doing?

This need not be hard. Most universities require an entrance test – the SAT is widely used in the US for undergraduate admissions, the GRE, LSAT, MCAT, etc. for graduate school. Why not give students the entrance test a second time when they leave and compare the two results? If students at Alpha University show a large increase compared to students at Beta University, then there is evidence that Alpha does a better job than Beta.

There are good and bad reasons why such testing is not widely done. University faculty as a group are notoriously independent and don’t like being evaluated. They are suspicious of standardized tests for evaluation of their teaching (although they don’t often oppose using them for admissions purposes). They point out that there are many factors outside of their control that influence the results, such as the students’ preparedness, motivation, and work ethic. (On the other hand, universities are happy to take credit for the students who succeed).

One legitimate obstacle to meaningful testing of universities is that students are not required to take an exit test and not motivated to perform well on it. A student who is applying to university has to take the entrance exam and has good reason to try to do well on it – favorable admissions decisions and scholarship awards often follow good test results. But the student who is soon to graduate has no such motivation, either to take the exit exam or to do well on it. So what does the difference between the entrance and exit scores show: How much the students learned or that the graduating students have something better to do with their time than take an exit exam?

The Collegiate Learning Assessment (CLA) is a standardized test consisting of several essay questions that probe a student’s critical thinking skills ((Stephen Klein 2007); see www.cae.org). As of 2008, more than 200 universities and colleges have been testing the exam. The CLA is given to one group of first-year students and another group of final-year students. The scores of the two groups are then compared.

Ideally, the same students would be tested in both their first and last years. If this were done, each student could be compared to himself or herself and covariates such as previous preparation or overall aptitude would be held constant within each entry-to-exit comparison. Unfortunately, such longitudinal designs were used at only 15% of the CLA schools. At the remaining schools, for reasons of cost and administrative complexity, a cross-sectional design was used, where the two groups consist of different students. When the study is cross-sectional, it can be important to adjust for covariates that may differ from one group to another.

In the case of the CLA, the students’ performance on other standardized tests (such as the entrance SAT) is used for the adjustment. As it happens, there is a strong correlation between the SAT score and other explanatory variables. This collinearity tends to increase the standard error of the estimate of school-to-school differences.

Two studies of the CLA adjustment methodology concluded that they so increased standard errors that no clear inference can be made of school-to-school differences. (Basken 2008) The studies called for longitudinal testing to be done in order to increase the reliability of the results. But, short of requiring students to take both the entrance and exit exams, it’s not clear how the longitudinal approach can be implemented.

15.6 Non-Parametric Statistics

It sometimes happens that one or more of your variables have outliers or a highly skew distribution. In this situation, the extreme values of the variable can have an unduly strong influence on the fit of the model. As a result, the p-values calculated from ANOVA can be misleading.

Common sense suggests deleting the outliers or extreme values from the data used to fit the model. Certainly this is an appropriate step if you have good reason to think that the extreme values are bogus.

A more moderate approach is to fit two models: one with and one without the extreme values. If the two models give similar results, then you can reasonably conclude that the extreme values are not having an undue effect.

But if the two models give very different results, you have a quandary. Which model should you use? On the one hand, you don’t want some cases to have an “undue” influence. On the other hand, those cases might be telling you something about the system and you need to be able to guard against the criticism that you have altered your data.

There is a simple and effective way to deal with this situation for the purpose of deciding whether the extreme values are dominating p-value calculations. The approach is to transform your data in a way that is genuine to the extreme values by keeping them on the edge of the distribution, but which brings them closer to the mainstream. The easiest such transform is called the rank transform. It replaces each value in a variable with the position of the value in the set of values. For example, suppose you have a variable with these values:

23.4, 17.3, 26.8, 32.8, 31.3, 34.5, 7352.3

It’s obvious that the seventh value is quite different from the other six. Applying the rank transform, this variable becomes

2, 1, 3, 5, 4, 6, 7.

That is, 23.4 is the 2nd smallest value, 17.3 is the 1st smallest value, 26.8 is the 3rd smallest, and so on. With the rank transform, the extreme value of 7352.3 becomes nothing special: it has rank 7 because it is the seventh smallest value. There are never outliers in a rank transformed variable. Nevertheless, the rank transformed data are honest to the original data in the sense that each value retains its position relative to the other values.

Analysis of rank-transformed data is called non-parametric statistics. The coefficients from models of rank-transformed data are hard to interpret because they describe how the response value changes do to a 1-step increase in rank rather than a one-unit increase the value of the variable. Still, the p-values can be interpreted in a standard way.

When outliers or extreme values are a concern, it’s worth comparing the p-values from the original (parametric) model and the non-parametric model fitted to the rank-transformed data. If the p-values lead to the same conclusion, you know that the extreme values are not shaping the results. If the p-values for the two models lead to different conclusions then your work is just beginning and you need to think carefully about what is going on.

15.7 Sample Size and Power

If the p-value on your explanatory variable of interest is small enough, you’re entitled to reject the null. If it’s not so small, you are left in the ambiguous situation of “failing to reject.” This failure might be because the explanatory variable is indeed irrelevant to the response variable. That’s a valuable finding. On the other hand, there can be a much more mundane explanation for the failure: your sample size was not large enough.

In ANOVA, the p-value gets smaller as F gets bigger. In turn, F is a product of both the substance of the effect and the sample size n: larger n produces larger F as described in Equation \(\ref{eq:F-and-m-2}\).

The fundamental issue, however, is not F, but the power of your study. In picking a sample size n for your study, your goal should be to produce a study with a large enough power so that, if you do fail to reject the null, you will have good reason to prefer the null hypothesis to the alternative hypothesis as a reasonable representation of your system.

Typically, you start a study with some working hypothesis about how the explanatory variable is related to the response. This working hypothesis is given the name alternative hypothesis. (The word “alternative” is simply the statistical vocabulary for distinguishing the working hypothesis from the “null” hypothesis.) The goal of this section is to show how to translate your alternative hypothesis into a decision about the appropriate sample size \(n\) to produce good power.

It may seem odd to be thinking about sample size here, at the end of a chapter on analyzing data. After all, you have to have your data already in hand before you can analyze it! But modeling is a process of asking questions. That process often involves going back and collecting more or different data based on the results of the analysis of earlier data.

When deciding on a sample size, think ahead to the analysis you will eventually do on the data you collect. It’s likely that your analysis will be analysis of covariance or something like it. So you will end up with an ANOVA table, looking perhaps like this:

Table 15.19: Thinking about what an ANOVA table might look like can help us estimate power.
Term Df Sum Sq Mean Sq F value p value
Intercept 1 500
Covariates 3 200
Explanatory Var 1 50 50 2 0.19
Residuals 10 250 25
TOTAL 15 1000

From your knowledge of the covariates and the explanatory variable, you should know the degrees of freedom of each model term even before you collect the data. But until you actually have the data, you won’t be able to fill in the actual values for the sums of squares. Instead, make a guess – hopefully an informed guess based on your working hypothesis. The guesses here have been indicated by italics, as in the guessed 200 sum of squares for the covariates.

What information might you have to work from? Perhaps you have already done a preliminary study and collected just a bit of data – the \(n=15\) cases here. Then you have the ANOVA table from those data. Or perhaps there is an existing study in the literature from a study of the same explanatory variable or an analogous one in a setting similar to the one you are working on. You’ll need to be creative, but realistic.

For the present, assume that the above ANOVA table came from a preliminary study with \(n=15\) cases. The F value of 2 is suggestive, but the p-value is large; you cannot reject the null based on this data set. But perhaps a larger study would be able to reveal what’s merely suggested in the preliminary study.

Imagine what the table would look like for larger \(n\). For simplicity here, consider what would happen if you doubled the number of cases, to \(n=30\). (In reality, you might need to consider some number much larger than the \(n=30\) used in this example.) Of course, you can’t know exactly what the larger \(n\) will give until you actually collect the data. But, for planning, it’s sensible to assume that your extended study might look something like the preliminary study. If that’s so, doubling the number of cases will typically double the total sum of squares. (Why? Because you’re adding up twice as many cases. For instance, the sum of squares of 3 4 5 is 50; the sum of squares of 3 4 5 3 4 5 is 100.)

Increasing the number of cases doesn’t change the degrees of freedom of the explanatory variables. The intercept, the covariates, and the explanatory variable terms still have the same degrees of freedom as before. As a result, for these terms the mean square will double.

In contrast, the degrees of freedom of the residual will increase with the number of cases in the sample. The sum of squares of the residual will also increase, but the net effect is that the mean square of the residual will stay roughly the same.

To illustrate, the following table shows the ANOVA tables side by side for a preliminary study with \(n=15\) and an expanded study with \(n=30\).

Table 15.20: Thinking through the effect of a larger sample size.
Term D.F. Sum Sq. Mean Sq. F   D.F. Sum Sq. Mean Sq. F
Intercept 1 500 500   1 1000 1000
Covariates 3 200 67   3 400 133
Explan. Var. 1 50 50 2   1 100 100 5
Residuals 10 250 25   25 500 20
TOTAL 15 1000   30 2000
Preliminary study \(n=15\) New study \(n=30\)

Overall, by doubling \(n\) you can anticipate that the F value will be more than doubled. If \(n\) were increased even further, the anticipated F value would similarly be increased further, as in Equation 14.2.

The goal in setting the sample size \(n\) is to produce statistical significance – a small p-value. As a rule of thumb for the F distribution, F above 4 or 5 will give a small p-value when the sample size \(n\) is around 25 or bigger. (More precisely, when the degrees of freedom of the denominator is around 25 or bigger.)

But targeting such an F value would be a mistake. The reason is that F itself is subject to sampling fluctuations. If your sample size is such to produce a p-value of 0.05 when you hit the target F, then there’s a roughly 50% probability that sampling fluctuations will produce an F that’s too small and consequently a p-value that’s too large.

To avoid suffering from a mishap, it’s appropriate to choose the sample size \(n\) large enough that sampling fluctuations are unlikely to produce an unsatisfactory p-value. Recall that the power is the probability that your study will lead to rejection of the null hypothesis if the alternative hypothesis were true. Your goal in choosing \(n\) is to produce a large power.

The calculation behind power is somewhat complicated, but the table gives an idea of the power for studies with different sample sizes n and different anticipated F.

Table 15.21: Power for various sample sizes given the alternative hypothesis value for F.
Anticipated F \(n=10\) \(n=20\) \(n=100\) \(n \to \infty\)
3 17% 23% 28% 29%
5 33% 42% 49% 51%
10 64% 76% 83% 85%
20 92% 97% 99% 99%

This table was constructed for an explanatory variable with one degree of freedom. This could be a quantitative variable (for example, family income) or a categorical variable with two levels (for example, sex). As the table indicates the anticipated F must be quite high – 10 or so – to achieve a power of 90% or higher. When the degrees of freedom in the numerator is higher, so long as n is large, high power can be achieved with considerably smaller F. For example, with five degrees of freedom in the denominator, F=5 gives roughly 90% power.

If your study has low power, then failing to reject the null doesn’t give much reason to take the null as reasonable: your study is likely to be inconclusive. When you aim for an F of about 5, your power will be only about 50%. If you want a high power, say 90%, your study should be designed to target a high F.

Studies are often expensive and difficult to conduct and so you want to be careful in designing them. The table above gives only a rough idea about power. Since the power calculations are complicated, you may want to consult with a professional to do the calculation that will be appropriate for your specific study design.