Data Mining: Statistics or Something Else?

Statisticians have very mixed views on data mining.

At one stage it was a term of derision. To mine data suggested digging into data so much that something was bound to be found, without regard as to whether it was there purely by chance. The implication was that the mining proceeded until what was found fitted preconceived ideas.

Formal statistics had been developed precisely to overcome such dangers and every undergraduate student of statistics was taught to define hypotheses to be tested before looking at the data.

At the same time, much science has benefited by chance discoveries, observations on the data that prompt further questions. A favourite example in biology was the discovery by Fleming of mould contaminating some bacterial growths and apparently killing the bacteria. This was not a "hypothesis" posed before the experiment, but purely an observation that spurred the question "what might be causing this?". The answer of course was penicillin.

Modern statistics is attempting to find a path between two extremes – the traditional path that is too rigid and a new path that is in danger of losing rigour. The impetus has come from areas where the sheer volume of data means that there is less need to be statistically efficient and a greater need to be time efficient. ^[1]

Traditional Statistical Approaches

In the early years of statistics, probability theory was well developed and this gave great understanding of what randomness in data might be like. If you had a correct probability model then it was possible to predict what data might be reasonably observed. But there was no consensus on how the inverse problem should be handled – given the data what can be said about the model. Decisions were being made based on data but this was not always an objective process.

Into this void a mathematical theory of decision making was developed by Jerzy Neyman and Egon Pearson in 1928. As mathematicians often do, they considered a simplified situation where they had a well defined possible model and the decision to be made was simply whether this model could possibly be right ^[2]. Since it was already understood how to calculate probabilities from the model, this was a reasonable approach but it made an assumption that the model had not been specially chosen to be consistent with (or not be consistent with) the data. Ideally the model should have been hypothesised before even seeing the data.

In some areas where critical decisions are being made and it is hard not to be influenced by the data, this style of inference is taken very seriously. For example, many advances in medicine have come from "double blind" clinical trials, where patients are allocated to receive particular treatments at random, and neither the patients nor the doctors know who is allocated to each treatment. Only after the data is gathered and the statistical analysis defined is the data "unblinded" and, even then, only the prespecified analysis method is regarded as legitimate. Authorities such as the Food and Drug Administration set strict standards for both the trials and the statistical analysis.

In this context, what do you do when something unexpected is encountered in the data? The formal answer is to define a question based upon what has been observed and then gather some fresh data to test this question.

This approach may work well in a laboratory and if there is plenty of time, but it can be frustrating in many real situations. It may mean that the original data set is barely used and that many data collection cycles are required before a satisfactory understanding of the data is reached.

But in many contexts this formal approach is not an option. The data may have been collected for a different purpose. Decisions may have to be made in limited time. The data may initially not be well understood. Exploration is required to define the limits of how the data can be helpful. The data may be only part of the information available.

In these situations, the formality can stifle the application of statistics. Modern statistical methods have been developed to allow greater flexibility while maintaining rigour.

Statistical Learning

One modern approach is to have algorithms that automatically learn about structures in the data. In the 1980's this was often presented in the context of artificial intelligence, machine learning, expert systems and other constructs from the computer science community ^[3]. These approaches had as their ideal a program that would act like a human statistician in making judgements, but do so in an automated, fast and objective manner. In retrospect, this must be seen as not very successful, rarely moving beyond just testing that the assumptions behind various analyses were holding, but it did encourage a number of other approaches.

Some of these methods used models that were much more flexible than traditional statistical models and could effectively adapt to a wider class of data. These included:

neural networks that built up predictive models through training simple logic units;
splines and smoothing operators for fitting curves to data instead of the traditional parametric curves such as polynomials;
regression trees that fitted multidimensional step functions to complex data, automatically choosing where the steps should be;
automatic detection of interactions between factors in tables; and
generalised additive models that gave simple ways of building up complex models.

Their algorithms defined these methods as much as by any theory. In that respect they were like the much older stepwise regression algorithms that attempted to find the optimal set of predictive variables. If that was all they were then they would have had the same problems of a weak theoretical base.

However they were developed in parallel with methods that allowed inference to be carried out in a wide range of algorithmic contexts, based upon fitting models to subsets of the data.

In their simplest form, these new inference methods might involve splitting the data into two randomly selected subsets. The models are developed using just the first set and whatever algorithm is appropriate. The models are then tested using the second subset – since this is independent of the first data subset, the assumptions of the traditional Neyman-Pearson method hold.

A more complex form of this is the jack-knife developed by John Tukey. In this many different splits of the data are used and the range of the models thus generated are compared. Many of these methods have only become feasible with the advent of low cost but powerful.

On-Line Analytical Processing and Data Warehousing

Independently of these statistical developments, system administrators had recognised that large commercial databases held vast amounts of data that could be analysed for commercial benefit. Often these were databases of customer information.

However a problem was faced in that the database structures were optimised for handling large numbers of small transactions, typically affecting the records for one customer at a time. In addition, the emphasis in relational databases was heavily towards using the structures to guarantee correctness and consistency. These structures were often very inefficient for large scale access of all the records in one operation.

This developed the concepts of On-Line Analytical Processing (OLAP) and then data warehousing. Initially OLAP simply meant adding features to database systems to permit more efficient access and query, often by creating multidimensional tables or hypercubes. Then tools were added to give better reporting, typically the ability to cross tabulate and to produce graphs.

This led to the realisation that often it was easier to set up a separate copy of the data, optimally structured for analysis. Most importantly, this copy could become a repository for data no longer needed for normal operations but still useful in analysis. This is the data warehouse. ^[4]

The data warehouse also provides the opportunity to clean up the data. Efficient use of a data warehouse often comes from a strict control over what goes into it. Too much emphasis on keeping the data warehouse current through frequent updates can result in letting through errors that limit its use. (We would argue that strategic decision making should never be dependent on the data being absolutely the latest – if one more week's data would change the decision it is highly unlikely to have been strategic).

Perhaps the most perplexing part of these developments has been the proliferation of terminology. Terms such as data marts, business intelligence, decision support systems and many others each have a range of definitions, some very technical and some very vague. Sometimes it is right to suspect that only the names are changing, while the important ideas beneath are the same.

Statisticians might see multidimensional tables and cubes as being very similar to what for many years have been called data matrices and some other reporting tools as being familiar tabulation software. This perception has much truth in it but statistical software often lacked the flexibility in manipulating truly large volumes of data that is encountered in business. ^[5]

Data Mining

These two streams of development have both led to what is called data mining today. But they are still largely separate. Each with their own strengths. And each with areas where they could learn from the other.

Computer scientists have approached this with an emphasis on data structures ^[6]. This often tries to enhance standard SQL database approaches with additional functionality for investigating the apparent relationships in the data. For example, one approach is to effectively aggregate a large data set by replacing groups of similar records with single records plus a count. This permits a large database to be reduced to a size where it is then possible to explore it interactively.

As part of this approach, computer scientists have developed ideas such as concept hierarchies ^[7] as a means of tackling the notion of what might be "similar". Some of these ideas could well benefit the statistical community where the internal structure of variables and what they mean has not been well developed.

However most statisticians would be very disappointed by the lack of sophistication evident when the computer scientists talk of understanding the variation in the data. Concepts such as "similar" and "significantly different" are rarely well thought out and, with some notable exceptions, ignore the statistical developments of the past century. This means that doubts will always be present on whether the results of some data mining procedures are really meaningful or just chance.

In the meanwhile many data mining software products are appearing in the market, often making promises that sound too good to be true (and probably are). Many imply that all that is needed is to throw the data in and the results will start coming out. This ignores the experience of many years that to find something you have to have some idea of what you might be looking for.

The Necessary Synthesis

To realise the potential of data mining it is clearly necessary to combine the expertise of the statisticians and the computer scientists. The experience of Data Analysis Australia, where analyses are frequently carried out on databases measured in tens of gigabytes, is that the productive partnership is that of the statistician and the database expert. The statistician guides the asking of questions and the evaluation of results while the database expert is able to determine the efficient methods of handling the data so the question- answer cycle is kept short. ^[8]

One might hope that this need would see our universities training graduates with the appropriate combination of skills. Unfortunately that does not seem to be the case – statistical graduates rarely know much about database technologies and computing graduates rarely know much about statistics. The emphasis in the teaching of statistics is still towards the best possible analysis of relatively small data sets, while the teaching of databases concentrates on transaction processing and data integrity rather than exploring data.

A synthesis of expertise is necessary here and this does not come about through simply having the right software package. Rather it is having the right people.

December 2004

Footnotes

[1] It is no accident that one of the main developers of some modern methods is Jerome Friedman working at the Stanford Linear Accelerator Centre (SLAC). Particle accelerators can generate gigabytes of data on interactions between sub-atomic particles each day.

[2] It is always worth remembering what they did not talk about. They did not ask what model might best describe the data, not whether in the context. They did not suggest how the model to be tested should be chosen in the first place. They did not consider the costs of making a wrong decision. Many poor decisions have been made through blindly following the Neyman-Pearson approach when some of these other questions should have been asked instead.

[3] A very good survey of this field is the book Artificial Intelligence and Statistics, edited by William Gale in 1986.

[4] A key technology that allowed this to happen was the quantum jump in hard disk capacities in the 1990's following the introduction of magneto-resistive heads that increased data densities. Suddenly data could be effectively archived on disk instead of tape, and then be available for analysis.

[5] An interesting question might be to ask what statisticians have been doing with the vastly increased computing power available to them today, if it is not to analyse larger data sets. Part of the answer is that a major theme in computational statistics has been applying computationally intensive methods to very precise inference problems, often with moderate sized data sets. Markov Chain Monte Carlo (MCMC) is perhaps the dominant such method.

[6] A widely used standard reference with this approach is J. Han and M. Kamber, 2001, Data Mining: Concepts and Techniques, Morgan Kaufmann.

[7] In the statistician's language, a concept hierarchy puts structure on the levels of a factor or categorical variable. An example might be a variable called "transport mechanism" which at the highest level is simply air/land/water; the next level for land transport might have train/car/bus/bicycle etc.

[8] One experience here is that SQL database engines are surprisingly unpredictable in how they handle large queries. Posing a query in a slightly different form can sometimes reduce computation times by orders of magnitude. A good engine is one that is not too much of a "black box", so that it is possible better guide the computation. This goes against much database dogma that assumes that the engine is smart, an assumption that unfortunately often fails in practice.