Data Analysis AustraliaSTRATEGIC INFORMATION CONSULTANTS
|
|||
Copyright © 2012
Data Analysis Australia |
Significance of Rare Cases in Large Data Sets
The ProblemThe data collected came from many countries in the Australasian region, and as far away as Germany and the USA. The detection of changes or trends in prevailing infection rates in collection regions is of critical importance. This may sound straight forward however the nature of the data where there exists large numbers of cases and very few positive infection rates makes for some interesting modelling and analysis. Data Analysis Australia was asked to review current practices and provide a methodology for analysing the data. Datasets for each country to be analysed, consisted of a large number of independent cases, each of which could record one of two outcomes - in this case a positive or negative result. The number of positives was very low relative to the number of cases. The data analysis had to be statistically sound, demonstrate best practice and follow a well-documented, repeatable process, allowing external review by regulatory bodies. The Data Analysis Australia ApproachStatisticians have an in-depth understanding of the properties of different data structures and have developed a suite of tools, enabling effective and powerful analyses to be undertaken when the correct tool is applied. When the results of the analysis are this important, a comprehensive understanding of what each model is doing, what assumptions it uses and how well the model fits the subtleties of each dataset cannot be underestimated. Understanding the properties of the underlying data is key to selecting the correct method of analysis and this is one of Data Analysis Australia's strengths. Datasets of this nature, where the outcome can be only one of two values, would naturally be expected to follow the bin The general idea behind a binomial model is to model the probability of the outcome being positive based on potential factors that could influence this probability. The year of collection and an indicator for the current year were included in the models for this analysis to determine whether there has been a change in infection rates over time or whether the infection rates this year are "different" to previous years. Well understood ideas of "statistical significance" are used to determine which model terms should be retained in the model. A model is considered to fit a dataset well when there is little difference between the predicted and actual values. A large difference is an indication of a poor model fit. Collecting more data, or including further terms in the model should then be considered and tested to see if these improve the model fit. For example, incorporating an extra term in the binomial model to differentiate between the different collection centres in a dataset has helped explain the variation between cases and improved the model fit in some datasets. Sometimes, however, it is simply that the data itself displays greater variability than would be expected given the underlying distribution, a situation usually termed overdispersion. Overdispersion can lead to an understatement of the standard errors associated with the model estimates, which in turn can lead to erroneous conclusions being made about the effect of some variables. A quasi-Binomial model can be used when overdispersion is present. This model is similar to the Binomial model, but with standard errors of the model estimates adjusted to take overdispersion into account. Data Analysis Australia uses model deviances (standard statistical outputs from models of this type) to determine whether overdispersion exists and hence whether a quasi-Binomial model is needed or whether the exact Binomial model can be implemented for each dataset to be analysed. The ResultData Analysis Australia recognised that while the underlying data structure was common to each dataset, individual characteristics varied. A decision tree was created so that the most appropriate model could be chosen for each dataset submitted by the client for analysis. This systematic approach provided a quick turnaround from the receipt of data to reporting of the results. When dealing with safety issues such as disease control, timeliness of response is critical. |
||