Applied statisticians utilise a range of tools to help derive, assess and improve statistical models fitted to our clients’ data. Visual displays are particularly important to quickly assess the applicability of statistical models, including how well the models fit the data and whether assumptions are satisfied.
For example, graphical displays such as Figure 1 are much-used to check the common assumptions of normally-distributed random variation and stability of variance when fitting a linear model.
Linear models are fundamental to many statistical analysis techniques, with simple linear regression being the most well-known. Statistical models typically consist of expected or average values which might depend on some independent or explanatory variables, and some random variation which is unavoidable. Without random variation there would be little need for statistical analysis – everything would be known exactly. Statistical analysis involves describing, explaining and accounting for random variation, which can be regarded as “noise”, so that the real “signal” can be more clearly seen. The goal might be a clear understanding of which variables are related (and how) to the variable of interest.
Given several candidate variables which might arguably be associated with a variable of interest, we use statistical analysis to determine which variables or combinations of variables are important, and this is commonly done by formal statistical tests. All statistical tests are based on some sort of assumptions.
In standard linear models these assumptions might be that the random data values follow a normal or Gaussian (bell-shaped) distribution, and that the size of the variation about the estimated values is similar for all the data. These are the assumptions that can be checked by plots such as in Figure 1. In these plots, fitted values are the model’s best estimate of the underlying expected or average value, and residuals indicate the random variation present – they are the deviations of the actual observed data from the estimated average value based on the fitted model.
It is also important that observations be independent – i.e. that the random variation associated with one observation has no bearing on any other observation. This assumption cannot be checked by plots such as in Figure 1. Commonly, attempts are made to ensure independence by the way the data is collected, such as by using randomisation in experiments or selection of sample survey participants. It can be difficult to demonstrate independence, but in some cases particular forms of dependence might be hypothesised and can be checked. Time series data and spatial data are examples, where the residuals can be displayed in relation to time-point or spatial location, and the plot can be examined for non-random patterns. This is one of the checks illustrated in this article, and is important because lack of independence can invalidate the usual statistical tests carried out on linear models.