Statisticians have an in-depth understanding of the properties of different data structures and have developed a suite of tools, enabling effective and powerful analyses to be undertaken when the correct tool is applied. When the results of the analysis are this important, a comprehensive understanding of what each model is doing, what assumptions it uses and how well the model fits the subtleties of each dataset cannot be underestimated. Understanding the properties of the underlying data is key to selecting the correct method of analysis and this is one of Data Analysis Australia's strengths.

Datasets of this nature, where the outcome can be only one of two values, would naturally be expected to follow the **bin****omial distribution**, the properties of which a statistician understands well, and hence, models specifically based on this distribution were the natural starting point. In some cases, a different distribution - the **Poisson** **distribution** - can be used as an approximation to the binomial distribution. This approximation is often used for computational convenience, however this was not an issue for this analysis and hence there was no reason to choose the approximation over the exact distribution.

The general idea behind a binomial model is to model the probability of the outcome being positive based on potential factors that could influence this probability. The year of collection and an indicator for the current year were included in the models for this analysis to determine whether there has been a change in infection rates over time or whether the infection rates this year are "different" to previous years. Well understood ideas of "statistical significance" are used to determine which model terms should be retained in the model.

A model is considered to fit a dataset well when there is little difference between the predicted and actual values. A large difference is an indication of a poor model fit. Collecting more data, or including further terms in the model should then be considered and tested to see if these improve the model fit. For example, incorporating an extra term in the binomial model to differentiate between the different collection centres in a dataset has helped explain the variation between cases and improved the model fit in some datasets.

Sometimes, however, it is simply that the data itself displays greater variability than would be expected given the underlying distribution, a situation usually termed **overdispersion**. Overdispersion can lead to an understatement of the standard errors associated with the model estimates, which in turn can lead to erroneous conclusions being made about the effect of some variables.

A **quasi-Binomial model** can be used when overdispersion is present. This model is similar to the Binomial model, but with standard errors of the model estimates adjusted to take overdispersion into account. Data Analysis Australia uses model deviances (standard statistical outputs from models of this type) to determine whether overdispersion exists and hence whether a quasi-Binomial model is needed or whether the exact Binomial model can be implemented for each dataset to be analysed.

## The Result

Data Analysis Australia recognised that while the underlying data structure was common to each dataset, individual characteristics varied. A decision tree was created so that the most appropriate model could be chosen for each dataset submitted by the client for analysis. This systematic approach provided a quick turnaround from the receipt of data to reporting of the results. When dealing with safety issues such as disease control, timeliness of response is critical.

*February 2010*