Mapping as a Diagnostic Tool for Model-fitting

Standard Diagnostic Plots

Figure 1. Diagnostic residuals plots from a simple linear regression model

Applied statisticians utilise a range of tools to help derive, assess and improve statistical models fitted to our clients’ data.  Visual displays are particularly important to quickly assess the applicability of statistical models, including how well the models fit the data and whether assumptions are satisfied.  

For example, graphical displays such as Figure 1 are much-used to check the common assumptions of normally-distributed random variation and stability of variance when fitting a linear model.

Linear models are fundamental to many statistical analysis techniques, with simple linear regression being the most well-known.  Statistical models typically consist of expected or average values which might depend on some independent or explanatory variables, and some random variation which is unavoidable.  Without random variation there would be little need for statistical analysis – everything would be known exactly.  Statistical analysis involves describing, explaining and accounting for random variation, which can be regarded as “noise”, so that the real “signal” can be more clearly seen.  The goal might be a clear understanding of which variables are related (and how) to the variable of interest.  

Given several candidate variables which might arguably be associated with a variable of interest, we use statistical analysis to determine which variables or combinations of variables are important, and this is commonly done by formal statistical tests.  All statistical tests are based on some sort of assumptions.  

In standard linear models these assumptions might be that the random data values follow a normal or Gaussian (bell-shaped) distribution, and that the size of the variation about the estimated values is similar for all the data.  These are the assumptions that can be checked by plots such as in Figure 1.  In these plots, fitted values are the model’s best estimate of the underlying expected or average value, and residuals indicate the random variation present – they are the deviations of the actual observed data from the estimated average value based on the fitted model.  

It is also important that observations be independent – i.e. that the random variation associated with one observation has no bearing on any other observation.  This assumption cannot be checked by plots such as in Figure 1.  Commonly, attempts are made to ensure independence by the way the data is collected, such as by using randomisation in experiments or selection of sample survey participants.  It can be difficult to demonstrate independence, but in some cases particular forms of dependence might be hypothesised and can be checked.  Time series data and spatial data are examples, where the residuals can be displayed in relation to time-point or spatial location, and the plot can be examined for non-random patterns.  This is one of the checks illustrated in this article, and is important because lack of independence can invalidate the usual statistical tests carried out on linear models.

House Rents Example

The diagnostic plots of residuals presented in Figure 1 arose from a simple example based on data from the Australian Bureau of Statistics’ 2011 Census of Population and Housing, for regions in Western Australia known as Statistical Areas Level 2, or SA2s.  For each SA2 we were interested in whether the median weekly rent paid by households in rented residential premises was related to median household income in that SA2, and fitted a simple linear regression.

The estimated model was: 

 "Median weekly rent" = 52.3 + 0.15 * "median weekly income"

from which we would conclude that an increase of $100 in median weekly income for an SA2 corresponds to an increase of $15 in median weekly rent.  This model accounted for 49% of the variance (adjusted R2, which allows for the number of terms in the model).  However, the diagnostic plots in Figure 1 show a few outliers (unusual values, not consistent with the normal-distribution assumption) and some skewness or asymmetry in the distribution.  We would like to improve our model so that the assumptions are better satisfied and we can have greater confidence in the model and associated statistical tests. 

Using Maps as a Diagnostic Tool

When data have a geographical association, such as data from different suburbs or localities, maps can provide an important and enlightening addition to the raft of tools used for model selection.  In Figure 2 we display the residuals for each SA2 on a map: pink shades represent negative residuals (smaller than expected rents relative to median income), and green shades represent positive residuals (larger than expected median rents according to the fitted model).  Darker colours represent bigger departures from the fitted model.

Figure 2. Residuals from the simple linear regression model, displayed for each SA2

What can be noted immediately from Figure 2 is a particularly obvious region with large negative residuals (dark pink, low rents relative to median income) in the north-west mining area (Pilbara) around Port Hedland and Newman, suggesting that these should perhaps be considered separately.  Therefore an indicator variable was added to the model, for six SA2s in the Pilbara region (four of them too small to show on the WA map in Figure 2).  

Another striking feature of Figure 2 is that the pink shades dominate in the larger SA2s and the green shades dominate in the small SA2s in the more highly populated regions, especially around Perth (inset), and we see clusters of similar colours rather than pure randomness, suggesting spatial dependence.  This suggests that some other variable might help explain the median weekly rent levels, perhaps related to population density – something that was not evident from the standard residuals plots in Figure 1.  Therefore variables related to population density were added to the model, models compared and a new model selected. 

Improved Outcome

Standard residuals plots (Figure 3) from this new model are an improvement over Figure 1, both in terms of patterns and the fact that the residuals tend to be smaller in magnitude, indicating the model fits better to the data.

The map of residuals (Figure 4) now shows much greater random scatter of colours across the regions, especially in the Perth area (the clusters of green have disappeared), although there is still a predominance of negative residuals in the large remote areas, suggesting we haven’t quite got the model right – but we are heading in the right direction and the maps have greatly facilitated that.

Figure 3. Standard residuals plots for the expanded model, including an indicator for the Pilbara mining area and a term relating to population density.
Figure 4. Residuals from the expanded model, including an indicator for the Pilbara area and a term relating to population density, displayed for each SA2.

Interpretation

The resultant model explained a greater percentage of the variance in the data (adjusted R2 89%, which allows for the number of terms in the model, compared with 49% for the simpler model), and the estimated model equation was: 

 "Median weekly rent" = 18.9 + 0.13 * "median weekly income" - 355 * "if Pilbara area" + 15.3 * log("PopDensity")

which suggests that an increase of $100 in median weekly income corresponds to an increase of $13 in median weekly rent in an SA2 with the same population density, while doubling the population density corresponds to an increase of approximately $10 in median weekly rent, for SA2s with same median income.  Based on this simplistic model, median weekly rents in the Pilbara were on average $355 lower than would be estimated for other SA2s with the same median income and population density (if any existed).  All terms were statistically significant.  

This is of course a simplistic example used to illustrate the value of mapping for model-diagnostic purposes, and many other factors are at play, such as the proportion of dwellings that are owner or buyer-occupied, standard of housing, and remoteness – the interpretation should not be carried too far, and decisions should not be based on this model!  As noted above, this model is not necessarily the end-point of an analysis, but the exercise illustrates the way in which maps can substantially aid the model-building process.  

Take-home Message

The take-home message is that, for geographical-based data, appropriate use of maps can reveal deficiencies in statistical models which might not have been apparent using only standard methods.  The same conclusions could have been reached in other ways, but displaying the residuals on a map quickly suggested some likely candidates for additional explanatory variables.  It makes use of additional spatial information and enables us to examine spatial dependencies, which is impossible from the standard diagnostic residuals plots.  The outcome is better-fitting and more appropriate models.  

By combining Data Analysis Australia’s expertise in both mapping and statistical analysis, a synergism can occur that provides greater insights and yields superior results than from the two fields in isolation. 

Data Analysis Australia can be contacted by email at daa@daa.com.au or by phone on 08 9468 2533.


September 2015