Data Analysis Australia
STRATEGIC INFORMATION CONSULTANTS
Copyright © 2012
Data Analysis Australia
Linear Mixed Models, REML and the Utilities
Many frequently used statistical analyses come under the generic term "linear models". This includes analysis of variance and linear regression. Underlying the statistical analysis and hypothesis tests is a statistical model with parameters that can be estimated by linear combinations of the observed data. These linear estimators are easy to work with and have some particularly nice properties; hence their popularity.
All statistical tests require certain assumptions to be made about the data - such as its statistical distribution. It is important to know and be aware of these assumptions, and where possible check that they are at least approximately satisfied. Many techniques are somewhat robust to mild departures from these assumptions, such as the assumption of normally distributed data. However, one assumption that can be obscure and difficult to check is the assumption that data values are statistically independent. This means that the variable component in one data value is unrelated to the variable part of any other. Violations of this assumption can occur, for example, in data on consumption of utilities by different customers over several billing periods, as described below. This article describes how linear mixed models provide a way to model such dependence.
Analysing Utilities Accounts
Over the years Data Analysis Australia has performed statistical analyses for many projects relating to utilities such as electricity, gas and water. Some of these have involved analysis of customers' accounts or usage patterns, relating them to variables such as types of households (e.g. houses or units) and weather data, and examining trends over time. This can help predict future requirements or inform where best to target programmes to reduce consumption. The data that forms the basis of such an analysis might include customers' consumption readings from a series of billing periods over a number of years.
Lack of Independence Among Data Values
Many statistical analyses require the data to be statistically independent. However, in the utilities bills example, amounts consumed in different periods by the same customer are likely to be more similar to each other than to amounts consumed by different customers - some customers will have more people in the household, more appliances or larger gardens, spend more time at home, or generally be less economical in their use, and these characteristics are likely to persist from one billing period to the next. For these customers, even after allowing for different household types and other explanatory variables, their usage will generally tend to be larger than the average usage and somewhat similar from one period to the next, whereas if successive usage amounts were independent, we might expect them to be above average or below average with equal probability. If we perform a statistical analysis ignoring this correlation between consumption in different periods by the same customer, our results can be misleading.
Standard techniques like analysis of variance and linear regression (collectively falling into the broad class called linear models) typically assume independence between data values. We estimate effects such as the difference between household types and relationships with seasonal weather patterns or specific weather events and many other explanatory variables. Some of these explanatory variables will be the target of the study, while others are nuisance variables which none-the-less must be accounted for to obtain accurate estimates of the effects of interest.
For example, the focus of a study might be on identifying and quantifying the usage for different household types, but we don't want different weather conditions to bias the results. If some customers or some periods experienced more extreme weather than others, we want to estimate typical usage patterns after allowing for or excluding the confounding effect of weather - we might want to estimate what their usage would have been if all had experienced the same weather conditions. The weather variables are called nuisance variables here. In other studies we might be particularly interested in the relationships between weather patterns and consumption patterns, and the household types might be a nuisance variable. In either case we need to allow for other variables that influence the response variable of interest (here consumption). Ignoring these nuisance variables weakens the analysis and makes it harder to detect or estimate the effects of interest.
In linear models, these effects, both target and nuisance effects, are called fixed effects. They have some notional fixed true values that we do not know but which we estimate from the data, based on a statistical model.
Even after allowing for all the fixed effects, there will always be some random component - in statistical jargon we often call these random deviations "errors", although this does not imply mistakes in the data. Some of this variation might be due to measurement error, but in general it is simply natural variability that occurs everywhere. This inescapable background variation is nuisance variation in that it makes it harder to see clearly the patterns of interest, and, just as we wish to allow for fixed effects parameters that are not of central interest, we also wish to allow our statistical model to explain as much of this nuisance random variation as possible, attributing it to known causes. This reduces the unexplained variation and increases the sensitivity of the study to detect or estimate the effects of most interest. This can result in stronger conclusions or can make conclusions possible based on a smaller sample size, saving resources.
We have noted in the utilities example that these random deviations (errors) for different periods for the same customer are likely to be positively correlated (they tend to be large together or they tend to be small together, i.e. they tend to vary in the same direction). In this situation, each new observation does not give as much information as a new independent observation would give. If we ignore this fact in our analysis, we will overestimate how much "information" we really have, and might attribute greater strength of evidence to some effect than is really warranted.
Random Effects, Linear Mixed Models and REML
One way to take account of correlations such as these is to incorporate random terms into our statistical models, thus turning linear fixed effects models into linear mixed effects models, models containing both fixed and random effects. The key differences between fixed and random effects are as follows:
While the household type (house or unit) is intrinsically of interest - we are interested specifically in houses compared with units and hence are interested in the estimates of the actual mean values for each household type, the random effects differ in that we could repeat the study using a different set of customers - they are simply representative of a much wider population. It is the variance of these random effects in the wider population that is the key parameter to estimate.
It is possible to fit a model containing a large number of random effects but we need only estimate a single variance parameter to summarise all those effects. This is distinct from fixed effects where we do estimate a separate effect for each level of any fixed effects factor.
A preferred estimation method for fitting linear mixed models is Residual Maximum Likelihood (REML), sometimes called Restricted Maximum Likelihood. Consequently the analysis of linear mixed models is often loosely referred to as REML analysis. REML estimation allows the fixed effects and parameters of the variance and covariance structure of the random effects to be estimated appropriately from separate parts of the one data set. This is in contrast to Maximum Likelihood (ML) which makes no allowance for the fact that it uses the same data to estimate both the fixed effects and the variance parameters, and consequently gives biased results. This becomes more important when models include more random effects.
REML estimation requires sophisticated iterative methods and specialised software, and has become increasingly feasible over recent years as computing capabilities have improved dramatically.
REML analysis is particularly well-suited to data involving groups with different numbers of subjects or experimental units, such as in animal and medical experiments, or any study that might be analysed by analysis of variance except that it is unbalanced. Care is required with the order of specifying terms in the model as well as finding one's way through the myriad of options available, comparing models, and interpreting the output.
By incorporating a random effect for each customer into our utilities bills example, we model positive correlations between bills for the same customer, but retain independence between different customers. However the form of this correlation is such that the correlation between one bill and another several periods later is equally as strong as the correlation between successive bills. While this is a considerable improvement on assuming no correlation, a still better approximation to reality might be available: we can fit more sophisticated linear mixed models that allow the correlation between bills for the same customer to change as the time-gap increases. These models for the correlation structure have elements of times series analysis, which is natural here.
Other enhancements to statistical models for correlations between data include allowing the variance to differ between different parts of the data set, and a choice of a wide and flexible range of correlation models. We can also incorporate additional random effects. For example, customers from the same region might be more similar than customers from different regions - an additional random effect for regions can make allowance for this. We can also perform statistical tests to determine whether such variation is statistically significant or whether simpler models are sufficient for a particular data set.
We Can Help You!
A fundamental aspect of many statistical analysis methods is to remove unwanted or "nuisance" variability that simply clouds the main picture and obscures the effects of real interest. Statistical methods and models that allow a more realistic representation of that variability provide greater opportunity for the underlying picture to be revealed.
Variability can never be eliminated, but by better explaining the natural variation in data with more flexible and appropriate models we reduce the unexplained variation remaining in the data and this can give greater power to discover the patterns that are our primary interest.
Data Analysis Australia has expertise in recognising where linear mixed models should be used based on the type of data and the precision in the analysis required by our client, and has the expertise to apply these wide-ranging techniques. Specialist statistical software is required, together with the knowledge and awareness to use these sophisticated techniques appropriately. The result will be more realistic providing better-fitting statistical models all leading to a greater confidence in the conclusions and inferences made from data.