# Linear Mixed Models, REML and the Utilities

Many frequently used statistical analyses come under the generic term "linear models".  This includes analysis of variance and linear regression.  Underlying the statistical analysis and hypothesis tests is a statistical model with parameters that can be estimated by linear combinations of the observed data.  These linear estimators are easy to work with and have some particularly nice properties; hence their popularity.

All statistical tests require certain assumptions to be made about the data - such as its statistical distribution.  It is important to know and be aware of these assumptions, and where possible check that they are at least approximately satisfied.  Many techniques are somewhat robust to mild departures from these assumptions, such as the assumption of normally distributed data.  However, one assumption that can be obscure and difficult to check is the assumption that data values are statistically independent.  This means that the variable component in one data value is unrelated to the variable part of any other.  Violations of this assumption can occur, for example, in data on consumption of utilities by different customers over several billing periods, as described below.  This article describes how linear mixed models provide a way to model such dependence.

### Analysing Utilities Accounts

Over the years Data Analysis Australia has performed statistical analyses for many projects relating to utilities such as electricity, gas and water.  Some of these have involved analysis of customers' accounts or usage patterns, relating them to variables such as types of households (e.g. houses or units) and weather data, and examining trends over time.  This can help predict future requirements or inform where best to target programmes to reduce consumption.  The data that forms the basis of such an analysis might include customers' consumption readings from a series of billing periods over a number of years.

### Lack of Independence Among Data Values

Many statistical analyses require the data to be statistically independent.  However, in the utilities bills example, amounts consumed in different periods by the same customer are likely to be more similar to each other than to amounts consumed by different customers - some customers will have more people in the household, more appliances or larger gardens, spend more time at home, or generally be less economical in their use, and these characteristics are likely to persist from one billing period to the next.  For these customers, even after allowing for different household types and other explanatory variables, their usage will generally tend to be larger than the average usage and somewhat similar from one period to the next, whereas if successive usage amounts were independent, we might expect them to be above average or below average with equal probability.  If we perform a statistical analysis ignoring this correlation between consumption in different periods by the same customer, our results can be misleading.

### Linear Models

Standard techniques like analysis of variance and linear regression (collectively falling into the broad class called linear models) typically assume independence between data values.  We estimate effects such as the difference between household types and relationships with seasonal weather patterns or specific weather events and many other explanatory variables.  Some of these explanatory variables will be the target of the study, while others are nuisance variables which none-the-less must be accounted for to obtain accurate estimates of the effects of interest.

For example, the focus of a study might be on identifying and quantifying the usage for different household types, but we don't want different weather conditions to bias the results.  If some customers or some periods experienced more extreme weather than others, we want to estimate typical usage patterns after allowing for or excluding the confounding effect of weather - we might want to estimate what their usage would have been if all had experienced the same weather conditions.  The weather variables are called nuisance variables here. In other studies we might be particularly interested in the relationships between weather patterns and consumption patterns, and the household types might be a nuisance variable.  In either case we need to allow for other variables that influence the response variable of interest (here consumption).  Ignoring these nuisance variables weakens the analysis and makes it harder to detect or estimate the effects of interest.

In linear models, these effects, both target and nuisance effects, are called fixed effects.  They have some notional fixed true values that we do not know but which we estimate from the data, based on a statistical model.

### Random "Errors"

Even after allowing for all the fixed effects, there will always be some random component - in statistical jargon we often call these random deviations "errors", although this does not imply mistakes in the data.  Some of this variation might be due to measurement error, but in general it is simply natural variability that occurs everywhere.  This inescapable background variation is nuisance variation in that it makes it harder to see clearly the patterns of interest, and, just as we wish to allow for fixed effects parameters that are not of central interest, we also wish to allow our statistical model to explain as much of this nuisance random variation as possible, attributing it to known causes.  This reduces the unexplained variation and increases the sensitivity of the study to detect or estimate the effects of most interest.  This can result in stronger conclusions or can make conclusions possible based on a smaller sample size, saving resources.

We have noted in the utilities example that these random deviations (errors) for different periods for the same customer are likely to be positively correlated (they tend to be large together or they tend to be small together, i.e. they tend to vary in the same direction).  In this situation, each new observation does not give as much information as a new independent observation would give.  If we ignore this fact in our analysis, we will overestimate how much "information" we really have, and might attribute greater strength of evidence to some effect than is really warranted.

### Random Effects, Linear Mixed Models and REML

One way to take account of correlations such as these is to incorporate random terms into our statistical models, thus turning linear fixed effects models into linear mixed effects models, models containing both fixed and random effects.  The key differences between fixed and random effects are as follows:

• For fixed effects, e.g. household type, we effectively estimate a parameter for the mean for each household type.  For example, units or apartments might use, on average, 10 units less than is used by houses.
• For random effects, e.g. individual customers, we do not directly estimate an effect for each customer, but instead estimate a single variance parameter representing the variability between customers.
• We assume the effects (differences between customers) arise from a random distribution of possible values, and if we repeated the study we would expect different effects.
• We assume these random effects follow a normal distribution with some fixed but unknown variance.
• We are not so interested in the actual random effects for each individual member of the population, but instead we estimate the value of that unknown variance for these random effects among the population.
• Values of the random effects can be "predicted" (the random effects analogue to estimating fixed effects or means) once the fixed effects and variance parameters have been estimated.
• More sophisticated models can have structured variances and covariances between the random effects and require additional variance and covariance parameters to be estimated.

While the household type (house or unit) is intrinsically of interest - we are interested specifically in houses compared with units and hence are interested in the estimates of the actual mean values for each household type, the random effects differ in that we could repeat the study using a different set of customers - they are simply representative of a much wider population.  It is the variance of these random effects in the wider population that is the key parameter to estimate.

It is possible to fit a model containing a large number of random effects but we need only estimate a single variance parameter to summarise all those effects.  This is distinct from fixed effects where we do estimate a separate effect for each level of any fixed effects factor.

A preferred estimation method for fitting linear mixed models is Residual Maximum Likelihood (REML), sometimes called Restricted Maximum Likelihood.  Consequently the analysis of linear mixed models is often loosely referred to as REML analysis.  REML estimation allows the fixed effects and parameters of the variance and covariance structure of the random effects to be estimated appropriately from separate parts of the one data set.  This is in contrast to Maximum Likelihood (ML) which makes no allowance for the fact that it uses the same data to estimate both the fixed effects and the variance parameters, and consequently gives biased results.  This becomes more important when models include more random effects.

REML estimation requires sophisticated iterative methods and specialised software, and has become increasingly feasible over recent years as computing capabilities have improved dramatically.

REML analysis is particularly well-suited to data involving groups with different numbers of subjects or experimental units, such as in animal and medical experiments, or any study that might be analysed by analysis of variance except that it is unbalanced.  Care is required with the order of specifying terms in the model as well as finding one's way through the myriad of options available, comparing models, and interpreting the output.

### Further Enhancements

By incorporating a random effect for each customer into our utilities bills example, we model positive correlations between bills for the same customer, but retain independence between different customers.  However the form of this correlation is such that the correlation between one bill and another several periods later is equally as strong as the correlation between successive bills.  While this is a considerable improvement on assuming no correlation, a still better approximation to reality might be available: we can fit more sophisticated linear mixed models that allow the correlation between bills for the same customer to change as the time-gap increases.  These models for the correlation structure have elements of times series analysis, which is natural here.

Other enhancements to statistical models for correlations between data include allowing the variance to differ between different parts of the data set, and a choice of a wide and flexible range of correlation models.  We can also incorporate additional random effects.  For example, customers from the same region might be more similar than customers from different regions - an additional random effect for regions can make allowance for this.  We can also perform statistical tests to determine whether such variation is statistically significant or whether simpler models are sufficient for a particular data set.