Statistical Modeling: A Fresh Approach

Authors

Daniel Kaplan

Randall Pruim

Published

February 25, 2023

Preface

Preface to this electronic edition

This electronic edition was customized by Randall Pruim and Stacy DeRuiter for courses at Calvin University, but we hope the materials will useful in many other settings as well. It is currently a work in progress, but we hope the work will be mostly completed by January 2022.

Some of the bigger changes in this edition compared to previous editions:

  • The geometry sections have been moved to the end of chapters.
  • There has been some section releveling to more clearly show the outline of each chapter. More sections are numbered (sometimes at a different depth level) for easier reference.
  • Some examples have been added, removed, or modified.

Preface to the electronic 2nd edition

When Statistical Modeling: A Fresh Approach was being drafted, the Amazon Kindle had just been introduced and, at $400, was too pricey for most students. A year after the book appeared in print, the Apple iPad was released. Since then, a generation of students reads in an electronic format as a matter of course. Many have an ebook reader always at hand: a smart phone.

The e-book format has many advantages beyond portability. The books can be much cheaper and less of a drain on natural resources. The text can be searched easily. On many platforms, the display is in color. The reader can adjust the display to suit his or her preferences. All for the good.

The flip side of being able to read on devices of many different configuration is that authors can not reliably anticipate what the reader will be looking at on the “page.” Traditionally, authors and book designers have worked with a fixed page, allowing them to lay out text, graphics, mathematical notation, computer code, etc. as an integrated whole. If you put two images side by side so that the reader could easily refer back and forth, those images stayed side by side. This isn’t true with e-book formats. Tables, which are all about putting things in a spacial arrangement, can become almost microscopic on a smart phone. You can mitigate some of these deficiencies by being an active reader by, for instance, switching between portrait and landscape modes.

The original printed versions of this book had a “computational technique” section at the end of every chapter. That material is now online. The problems of formatting it for an e-book are just too great. Even more important, the web interface allows those materials to become interactive and thereby enhances learning.

The web-based materials are available through the project-mosaic-books.com page for this book.

Preface to the printed 2nd edition

The purpose of this book is to provide an introduction to statistics that gives readers a sufficient mastery of statistical concepts, methods, and computations to apply them to authentic systems. By “authentic,” I mean the sort of multivariable systems often encountered when working in the natural or social sciences, commerce, government, law, or any of the many contexts in which data are collected with an eye to understanding how things work or to making predictions about what will happen.

The world is uncertain and complex. We deal with the complexity and uncertainty with a variety of strategies including the scientific method and the discipline of statistics.

Statistics deals with uncertainty, quantifying it so that you can assess how reliable – how likely to be repeatable – your findings are. The scientific method deals with complexity: reduce systems to simpler components, define and measure quantities carefully, do experiments in which some conditions are held constant but others are varied systematically.

Beyond helping to quantify uncertainty and reliability, statistics provides another great insight of which most people are unaware. When dealing with systems involving multiple influences, it is possible and best to deal with those influences simultaneously. By appropriate data collection and analysis, the confusing tangle of influences can sometimes be straightened out. In other words, statistics goes hand-in-hand with the scientific method when it comes to dealing with complexity and understanding how systems work.

The statistical methods that can accomplish this are often considered advanced: multiple regression, analysis of covariance, logistic regression, among others. With appropriate software, any method is accessible in the sense of being able to produce a summary report on the computer. But a method is useful only when the user has a way to understand whether the method is appropriate for the situation, what the method is telling about the data, and what the method is not capable of revealing. Computer scientist Richard Hamming (1915-1998) said: “The purpose of computing is insight, not numbers.” Without a solid understanding of the theory that underlies a method, the numbers generated by the computer may not give insight.

Advanced methods of statistics can give tremendous insight. For this reason, these methods need to be accessible both computationally and theoretically to the widest possible audience. Historically, access has been limited because few people have the algebraic skills needed to approach the methods in the way they are usually presented. But there are many paths to understanding and I have undertaken to find one – the “fresh approach” in the title – that takes the greatest advantage of the actual skills that most people already have in abundance.

In trying to meet that challenge, I have made many unconventional choices. Theory becomes simpler when there is a unified framework for treating many aspects of statistics, so I have chosen to present just about everything in the context of models: descriptive statistics as well as inference.

Consequently, algebraic notation and formulas are strongly de-emphasized in this book. The traditional role that formulas have played in providing instructions for how to carry out a calculation is no longer essential for effective use of statistical methods. Software now implements the calculations. What’s needed is not a formula-based description that allows people to reproduce what computers do, but a way to understand the methods at a high level so that the rapidity and reliability of computers in performing calculations can be used to provide insight into real-world problems.

I have been fortunate to have the assistance and support of many people. Some of the colleagues who have played important roles are David Bressoud, George Cobb, Dan Flath, Tom Halverson, Gary Krueger, Weiwen Miao, Phil Poronnik, Victor Addona, Alicia Johnson, Karen Saxe, Michael Schneider, and Libby Shoop. Critical institutional support was given by Brian Rosenberg, Jan Serie, Dan Hornbach, Helen Warren, and Diane Michelfelder at Macalester and Mercedes Talley at the Keck Foundation.

I received encouragement from many in the statistics education community, including George Cobb, Joan Garfield, Dick De Veaux, Bob delMas, Julie Legler, Milo Schield, Paul Alper, Dennis Pearl, Jean Scott, Ben Hansen, Tom Short, Andy Zieffler, Sharon Lane-Getaz, Katie Makar, Michael Bulmer, Frank Shaw, and the participants in our monthly “Stat Chat” sessions. Helpful suggestions came from from Simon Blomberg, Dominic Hyde, Michael Lavine, Erik Larson, Julie Dolan, and Kendrick Brown. Michael Edwards helped with proofreading. Nick Trefethen and Dave Saville provided important insights about the geometry of fitting linear models.

It’s important to recognize the role played by the developers of the R software – the “core” R team as well as the group of volunteers who have provided numerous packages that extend R’s capabilities. Hadley Wickham, in particular, developed the ggplot2 package used to create many of the graphics in this Second Edition, as well as a remarkable array of other utilities for treating data in a unified way. The design of R (and its progenitor S) are not just a matter of good software design, but of a brilliant understanding and systematization of statistics that makes the underlying logic of statistics accessible to students as well as experts. Further extending the reach of R, J.J. Allaire, Joe Chang, and Joshua Paulson have created the RStudio interface to R, which makes it much easier to teach and learn with R.

Special thanks are due to Randall Pruim and Nicholas Horton who, as mosaic activists, have improved the extensions to R used in this book and provided a wide range of suggestions that have found their way into the Second Edition.

Thanks also go to the hundred or so students at Macalester College who enrolled in the early, experimental sessions of Math 155 where many of the ideas in this book were first tested. Among those students, I want to acknowledge particular help from Alan Eisinger, Caroline Ettinger, Bernd Verst, Wes Hart, Sami Saqer, and Michael Snavely. Approximately 500 Macalester students have used the First Edition of this book, many of whom have helped identify errors and suggested clarifications and other improvements.

Crucial early support for this project was provided by a grant from the Howard Hughes Medical Institute. An important Keck Foundation grant was crucial to the continuing refinement of the approach and the writing of this book. Google provided summer-of-code funding for my student Andrew Rich to develop interactive applets that can be used along with this book.

Finally, my thanks and love to my wife, Maya, and daughters, Tamar, Liat, and Netta, who endured the many, many hours during which I was preoccupied by some or another statistics-related enthusiasm, challenge, or difficulty.