Country | Amount |
---|---|
Afghanistan | $1,919 |
Bangladesh | $3,891 |
Haiti | $1,784 |
Luxembourg | $104,003 |
Qatar | $127,660 |
Singapore | $87,855 |
Switzerland | $59,561 |
Tajikistan | $3,008 |
United States | $57,436 |
Zimbabwe | $1,970 |
4 Group-wise Models
Seek simplicity and distrust it. – Alfred North Whitehead (1861-1947), mathematician and philosopher
This chapter introduces a simple form of statistical model based on separating the cases in your data into different groups. Such models are very widely used and will seem familiar to you, even obvious. They can be useful in very simple situations, but can be utterly misleading in others. Part of the objective of this chapter is to guide you to understand the serious limitations of such models and to critique the inappropriate, though all too common, use of groupwise models.
4.1 Grand and Group-wise Models
Consider these two statements:
Adults are about 67 inches tall.
Worldwide per-capita income in 2010 was about $16,000 per year.
Such statements lump everybody together into one group. But you can also divide things up into separate groups. For example:
Women are about 64 inches tall; men are about 69 inches tall.
Or consider this familiar-looking information about per-capita income:
These sorts of statements, either in the grand mean form that puts everyone into the same group, or groupwise mean form where people are divided up into several groups, are very common.
People are used to this sort of division by sex or citizenship, so it’s easy to miss the point that these are only some of many possible variables that might be informative. Other variables that might also contribute to account for variation in people’s height are nutrition, parents’ height, ethnicity, etc.
It’s understandable to interpret such statements as giving “facts” or “data,” not as models. But they are representations of the situation which are useful for some purposes and not for others: in other words, models.
The groupwise income model, for instance, accounts for some of the person-to-person variation in income by dividing things up country by country. In contrast, the corresponding “grand” model is simply that per capita income worldwide is $16,000 – everybody in one group! The groupwise model is much more informative because there is so much spread: some people are greatly above the mean, some much, much lower, and a person’s country accounts for a lot of the variation.
The idea of averaging income by countries is just one way to display how income varies. There are many other ways that one might choose to account for variation in income. For example, income is related to skill level, to age, to education, to the political system in force, to the natural resources available, to health status, and to the population level and density, among many other things. Accounting for income with these other variables might provide different insights into the sources of income and to the association of income with other outcomes of interest, e.g., health. The table of country-by-country incomes is a statistical model in the sense that it attempts to explain or account for (some of the) variation in income.
These examples have all involved situations where the cases are people, but in general you can divide up the cases in your data, whatever they be, into groups as you think best. The simplest division is really no division at all, putting every case into the same group. This might descriptively be called all-cases-the-same models, but you will usually hear them referred to by the statistic on which they are often based: the grand mean or the grand median. Here, “grand” is just a way of distinguishing them from groupwise quantities: grand versus group.
All-cases-the-same is a bit of a misnomer. These models don’t say that everyone is the same, but they offer no explanation for why they differ: any variation is unexplained by these models. Given an interest in using models to account for variation, an all-cases-the-same model seems like a non-contender from the start since it doesn’t provide any explanation of the variation. But grand models are important in statistical modeling; they providing a starting point for measuring how much variation there is in the first place.
4.2 Accounting for Variation
Models explain or account for some (and sometimes all) of the case-to-case variation. If cases don’t vary one from the other, there is nothing to model!
It’s helpful to have a way to measure the “amount” of variation in a quantity so that you can describe how much of the overall variation a model accounts for. There are several such standard measures, described in Chapter 3:
- the standard deviation
- the variance (which is just the standard deviation squared)
- the inter-quartile interval
- the range (from minimum to maximum)
- coverage intervals, such as the 95% coverage intervals
Each of these ways of measuring variation has advantages and disadvantages. For instance, the inter-quartile interval is not much influenced by extreme values, whereas the range is completely set by them. So the inter-quartile interval has advantages if you are interested in describing a “typical” amount of variation, but disadvantages if you do not want to leave out even a single case, no matter how extreme or non-typical.
It turns out that the variance in particular has a property that is extremely advantageous for describing how much variation a model accounts for: the variance can be used to partitioned (split up) variation from case to case into the portion that is explained or accounted for by a model and the remaining variation that remains unexplained or unaccounted for. This latter is called the residual variation.
4.2.1 A small example
To illustrate, let’s consider a very small, artificial example – one small enough that we can see all the parts. Here is the data set.
y | group |
---|---|
7 | A |
12 | A |
11 | A |
13 | B |
16 | B |
20 | B |
19 | B |
We’ll consider a groupwise model with a mean for each group. Let’s add the overall mean, the group means, and the residuals to the table
y | group | grand mean | group mean | residual |
---|---|---|---|---|
7 | A | 14 | 10 | -3 |
12 | A | 14 | 10 | 2 |
11 | A | 14 | 10 | 1 |
13 | B | 14 | 17 | -4 |
16 | B | 14 | 17 | -1 |
20 | B | 14 | 17 | 3 |
19 | B | 14 | 17 | 2 |
Now let’s compute 3 variances
The variance of the original response (
y
) is \[ \frac{(-7)^2 + (-2)^2 + (-3)^2 + (-1)^2 + 2^2 + 6^2 + 5^2}{6} = 21.33 \]The variance of the model values (ie, the group means) is \[ \frac{3 \cdot (10 - 14)^2 + 4 * (17 - 14)^2}{6} = 14 \]
The variance of the residuals is
\[ \frac{(-3)^2 + (2)^2 + (1)^2 + (-4)^2 + (-1)^2 + 3^2 + 2^2}{6} = 7.33 \] Notice that the two smaller variances add up to the variance of the original response. That is, we have partitioned the variance into two pieces:
The part accounted for by the model.
One reason individuals differ is because they belong to different groups, and each group has a different mean. The is measured by taking the variance of the group means. The more different the group means are, the more the groups account for differences from individual to individual.
The part not accounted for by the model.
Individuals with in a group also vary. Some response values are larger than the group mean, some smaller. This is measured by taking the variances of the residuals.
Variance Partition:
Overall | Model | Residual | ||
---|---|---|---|---|
21.33 | = | 14 | + | 7.33 |
Notice that the same identity would hold if we ignored the denominators in the variances (since they are all the same). That is, we could just sum up the squares in the numerator and get the same sort of partitioning, everything would just be 6 times as big.
Sum of Squares Partition:
Overall | Model | Residual | ||
---|---|---|---|---|
128 | = | 84 | + | 44 |
Importantly, this identity holds for all models of this type, not just for our little example, and for many more general models we will encounter as well.
4.2.2 Heights of men and women
Returning to our heights of men and women, the groupwise mean model says “Women are 64.1 inches tall, while men are 69.2 inches tall.” Let’s seee how this model accounts for variation.
Measuring the person-to-person variation in height by the variance, gives a variation of 12.8 square-inches. That’s the total variation to be accounted for.
Now imagine creating a new data set that replaces each person’s actual height with what the model says. So all men would be listed at a height of 69.2 inches, and all women at a height of 64.1 inches. Those model values also have a variation, which can be measured by their variance: 6.5 square inches.
Now consider the residual, the difference between the actual height and the heights according to the model. A women who is 67 inches tall would have a residual of 2.9 inches – she’s taller by 2.9 inches than the model says. Each person has his or her own residual in a model. Since these vary from person to person, they also have a variance, which turns out to be 6.3 inches for this groupwise model of heights.
So once again we have our simple relationship among the three variances:
Variance Partition:
Overall | Model | Residual | ||
---|---|---|---|---|
12.8 | = | 6.5 | + | 6.3 |
This is the partitioning property of the variance: the overall case-by-case variation in a quantity is the sum of the variance of the model values and the variance of the the residuals.
4.2.3 Why variance?
You might wonder why it’s the variance – the square of the standard deviation, with it’s funny units (square inches for height!) – that works for partitioning. What about the standard deviation itself or the IQR or other ways of describing variation? As it happens, this is a special property of the variance (or equivalently, of the sum of squares). It’s possible to calculate any of the other measures of variation, but they won’t generally be such that the variation in the model values plus the variation in the residuals gives exactly the variation in the quantity being modeled. It’s this property that leads to the variance being an important measure, even though the standard deviation contains the same information and has more natural units.
The reason the variance is special can be explained in different ways, but for now it suffices to point out an analogous situation that you have seen before. Recall the Pythagorean theorem and the way it describes the relationship between the sides of \(a\) right triangle: if \(a\) and \(b\) are the lengths of the sides adjoining the right angle, and c is the length of the hypotenuse, then \(a^2 + b^2 = c^2\). One way to interpret this is that sides \(a\) and \(b\) partition the hypotenuse, but only when you measure things in terms of square lengths rather than the length itself. See the geometry section at the end of this chapter if your interested in more connections to the Pythagorean theorem – it turns out it isn’t just an analogy, there is actually a right triangle involved.
To be precise about the variance and partitioning … the variance has this property of partitioning for a certain kind of model – groupwise means and the generalization of that called linear, least-squares models that are the subject of later chapters – but those models are by far the most important. For other kinds of models, such as logistic models described in Chapter 16 there are other measures of variation that have the partitioning property.
4.3 Group-wise Proportions
It’s often useful to consider proportions broken down, group by group. For example, In examining employment patterns for workers, it makes sense to consider mean or median wages in different groups, mean or median ages, and so on. But when the question has to do with employment termination – whether or not a person was fired – the appropriate quantity is the proportion of workers in each group who were terminated. For instance, in the job termination data, about 10% of employees were terminated. This differs from job level to job level, as seen in the tables below. For instance, fewer than 2% of Principals (the people who run the company) were terminated. Staff were the most likely to be terminated.
Administrative | Manager | Principal | Senior | Staff | |
---|---|---|---|---|---|
Retained | 244 | 225 | 119 | 235 | 298 |
Terminated | 22 | 20 | 2 | 26 | 39 |
Administrative | Manager | Principal | Senior | Staff | |
---|---|---|---|---|---|
Retained | 91.7 | 91.8 | 98.3 | 90 | 88.4 |
Terminated | 8.3 | 8.2 | 1.7 | 10 | 11.6 |
Figure 4.1 shows another way of looking at the termination data: breaking down the groups according to age and, within each age group, showing the fraction of workers who were terminated. The graph suggests that workers in their late 60’s were substantially more likely to be fired. This might be evidence for age discrimination, but there might be other reasons for the pattern. For instance, it could be that those employees in their 60s were people who failed to be promoted in the past, or who were making relatively high salaries, or who were planning to retire soon. A more sophisticated model would be needed to take such factors into account.
We’ll return to (logistic regression) models that predict proportions in Chapter 17.
4.4 What’s the Precision?
The main point of constructing groupwise models is to be able to support a claim that the groups are different, or perhaps to refute such a claim. In thinking about differences, it’s helpful to distinguish between two sorts of criteria:
- Whether the difference is substantial or important in terms of the phenomenon that you are studying. For example, Administrative workers in the previous example were terminated at a rate of 8.27%, whereas Managers were terminated at a rate of 8.16%. This hardly makes a difference.
- How much evidence there is for any difference at all. This is a more subtle point.
In most settings, you will be working with a sample from a population rather than with the population itself. In considering differences between groups, you need to take into account the random nature of the sampling process. This randomness in the sampling process leads to randomness in the groupwise statistics: If different cases had been included in the sample, the differences would have been somewhat larger or smaller than the differences we observed.
Quantifying and interpreting this sampling variability is an important component of statistical reasoning. The techniques to do so are introduced in Chapter 5 and then expanded in later chapters.
4.5 Misleading Group-wise Models
Group-wise models appear very widely, and are generally simple to explain to others and to calculate, but this does not mean they serve the purposes for which they are intended. To illustrate, consider a study done in the early 1970s in Whickham, UK, that examined the health consequences of smoking. (Appleton, French, and MPJ 1996) The method of the study was simple: interview women to find out who smokes and who doesn’t. Then, 20 years later, follow-up to find out who is still living.
Examining the data Whickham
shows that, after 20 years, 945 women in the study were still alive out of 1315 total: a proportion of 72% . Breaking this proportion into groupwise into smokers and non-smokers gives
- Non-smokers: 68.6% were still alive.
- Smokers: 76.1% were still alive.
Before drawing any conclusions, you should know what is the precision of those estimates. Using techniques to be introduced in the next chapter, you can calculate a 95% confidence interval on each proportion:
- Non-smokers: 68.6 ± 3.3% were still alive.
- Smokers: 76.1 ± 3.7% were still alive.
The 95% confidence interval on the difference in proportions is 7.5 ± 5.0 percentage points. That is, the data say that smokers were more likely than non-smokers to have stayed alive through the 20-year follow up period.
Perhaps you are surprised by this. You should be. Smoking is convincingly established to increase the risk of dying (as well as causing other health problems such as emphysema).
The problem isn’t with the data. The problem is with the groupwise approach to modeling. Comparing the smokers and non-smokers in terms of mortality doesn’t take into account the other differences between those groups. For instance, at the time the study was done, many of the older women involved had grown up at a time when smoking was uncommon among women. In contrast, the younger women were more likely to smoke. You can see this in the different ages of the two groups:
- Among non-smokers the average age is 48.7 ± 1.3.
- Among smokers the average age is 44.7 ± 1.4.
You might think that the difference of four years in average ages is too small to matter. But it does, and you can see the difference when you use modeling techniques to incorporate both age and smoking status as explaining mortality.
Since age is related to smoking, the question the groupwise model asks is, effectively, “Are younger smokers different in survival than older non-smokers?” This is probably not the question you want to ask. Instead, a meaningful question would be, “Holding other factors constant, are smokers different in survival than non-smokers?”
You will often see news reports or political claims that attempt to account for or dismiss differences by appealing to “other factors.” This is a valuable form of argument, but it ought to be supported by quantitative evidence, not just an intuitive sense of “small” or “big.” The modeling techniques introduced in the following chapters enable you to do consider multiple factors in a quantitative way.
A relatively simple modeling method called stratification can illustrate how this is possible.
Rather than simply dividing the Whickham data into groups of smokers and non-smokers, divide it as well into groups by age. Table 4.6 shows survival percentages done this way:
smoker | [18,30) | [30,40) | [40,54) | [54,64) | [64,80) |
---|---|---|---|---|---|
No | 0.980 | 0.957 | 0.886 | 0.664 | 0.225 |
Yes | 0.973 | 0.960 | 0.797 | 0.570 | 0.255 |
Within each age group, smokers are less likely than non-smokers to have been alive at the 20-year follow-up, especially in the older groups. By comparing people of similar ages – stratifying or disaggregating the data by age – the model is effectively “holding age constant.”
You may rightly wonder whether the specific choice of age groups plays a role in the results. You also might wonder whether it’s possible to extend the approach to more than one stratifying variable, for instance, not just smoking status but overall health status. The following chapters will introduce modeling techniques that let you avoid having to divide variables like age into discrete groups and that allow you to include multiple stratifying variables.
These date are an example of Simpson’s paradox. Simpson’s paradox occurs when an association between two variables in a population emerges, disappears or reverses when the population is divided into subpopulations. Simpson’s paradox is one of many reasons that it is important to consider mutliple variables at once in many situations – ignoring additional variables can lead us to draw the wrong conclusion. Modeling with multiple explanatory variables will be one tool to avoid the problems associated with Simpson’s paradox.
4.6 The geometry of partitioning*
The idea of partitioning is to divide the overall variation into parts: that accounted for by the model and that which remains: the residual.
This is done by assigning to each case a model value. The difference between the observed response value and the model value is the residual. Naturally, this means that the model value plus the residual adds up to the observed response value. This happens for each row of our data.
We can represent this with three vectors, each with \(n\) components, where \(n\) is our sample size:
- \(\mathbf y\) = vector of observed response values
- \(\overline{\mathbf y}\) = vector of the grand mean (repeated \(n\) times)
- \(\hat{\mathbf y}\) = vector of model values (group means in our groupwise mean model)
The first is the sum of the second and third:
\[ \mathbf y = \hat{\mathbf y} + (\mathbf{y} - \hat{\mathbf y}) \] See the small example above (Table 4.3 and Table 4.7) where these three vectors are represented as columns in the table. \(\hat{\mathbf y}\) is labeled “group mean” and \(\mathbf{y} - \hat{\mathbf y}\) is labeled residual.
y | group | grand mean | group mean | residual |
---|---|---|---|---|
7 | A | 14 | 10 | -3 |
12 | A | 14 | 10 | 2 |
11 | A | 14 | 10 | 1 |
13 | B | 14 | 17 | -4 |
16 | B | 14 | 17 | -1 |
20 | B | 14 | 17 | 3 |
19 | B | 14 | 17 | 2 |
Now let’s think about our three variances.
- For the variance of the observed response (\(y\)), we need to add up the squares of \(y - \overline{y}\)
- For the variance of the model values (\(\hat y\)), we need to add up the squares of \(\hat y - \overline{y}\) (because the average model value will be the same as the average response value).
- For the variance of the residuls (\(y - \hat y\)), we need to add up the squares of \(y - \hat y - 0 = y - \hat y\) because the average of the residuals is always 0.
These sums of squares are the squared lengths of the vectors \(\mathbf y - \overline{\mathbf{y}}\), \(\hat{\mathbf y} - \overline{\mathbf{y}}\), and \(\mathbf y - \hat{\mathbf{y}}\).
For reasons to be described in Chapter 8 it turns out that this triangle is a right triangle, so the Pythagorean theorem applies and the square of the triangle side lengths add in the familiar way:
\[ \underbrace{|\mathbf y - \overline{\mathbf{y}}|^2}_{\mbox{total variation}} = \underbrace{|\hat{\mathbf y} - \overline{\mathbf{y}}|^2}_{\mbox{model variation}} + \underbrace{|\mathbf y - \hat{\mathbf{y}}|^2}_{\mbox{residual variation}} \]