Statisticians have very mixed views on data mining.

At one stage it was a term of derision. To mine data suggested digging into data so much that something was bound to be found, without regard as to whether it was there purely by chance. The implication was that the mining proceeded until what was found fitted preconceived ideas.

Formal statistics had been developed precisely to overcome such dangers and every undergraduate student of statistics was taught to define hypotheses to be tested *before* looking at the data.

At the same time, much science has benefited by chance discoveries, observations on the data that prompt further questions. A favourite example in biology was the discovery by Fleming of mould contaminating some bacterial growths and apparently killing the bacteria. This was not a "hypothesis" posed before the experiment, but purely an observation that spurred the question "what might be causing this?". The answer of course was penicillin.

Modern statistics is attempting to find a path between two extremes – the traditional path that is too rigid and a new path that is in danger of losing rigour. The impetus has come from areas where the sheer volume of data means that there is less need to be statistically efficient and a greater need to be time efficient. ^{[1]}

### Traditional Statistical Approaches

In the early years of statistics, probability theory was well developed and this gave great understanding of what randomness in data might be like. If you had a correct probability model then it was possible to predict what data might be reasonably observed. But there was no consensus on how the inverse problem should be handled – given the data what can be said about the model. Decisions were being made based on data but this was not always an objective process.

Into this void a mathematical theory of decision making was developed by Jerzy Neyman and Egon Pearson in 1928. As mathematicians often do, they considered a simplified situation where they had a well defined possible model and the decision to be made was simply whether this model could possibly be right ** ^{[2]}**. Since it was already understood how to calculate probabilities from the model, this was a reasonable approach but it made an assumption that the model had not been specially chosen to be consistent with (or not be consistent with) the data. Ideally the model should have been hypothesised before even seeing the data.

In some areas where critical decisions are being made and it is hard not to be influenced by the data, this style of inference is taken very seriously. For example, many advances in medicine have come from "double blind" clinical trials, where patients are allocated to receive particular treatments at random, and neither the patients nor the doctors know who is allocated to each treatment. Only after the data is gathered and the statistical analysis defined is the data "unblinded" and, even then, only the prespecified analysis method is regarded as legitimate. Authorities such as the Food and Drug Administration set strict standards for both the trials and the statistical analysis.

In this context, what do you do when something unexpected is encountered in the data? The formal answer is to define a question based upon what has been observed and *then gather some fresh data to test this question*.

This approach may work well in a laboratory and if there is plenty of time, but it can be frustrating in many real situations. It may mean that the original data set is barely used and that many data collection cycles are required before a satisfactory understanding of the data is reached.

But in many contexts this formal approach is not an option. The data may have been collected for a different purpose. Decisions may have to be made in limited time. The data may initially not be well understood. Exploration is required to define the limits of how the data can be helpful. The data may be only part of the information available.

In these situations, the formality can stifle the application of statistics. Modern statistical methods have been developed to allow greater flexibility while maintaining rigour.

### Statistical Learning

One modern approach is to have algorithms that automatically learn about structures in the data. In the 1980's this was often presented in the context of artificial intelligence, machine learning, expert systems and other constructs from the computer science community ^{[3]}. These approaches had as their ideal a program that would act like a human statistician in making judgements, but do so in an automated, fast and objective manner. In retrospect, this must be seen as not very successful, rarely moving beyond just testing that the assumptions behind various analyses were holding, but it did encourage a number of other approaches.

Some of these methods used models that were much more flexible than traditional statistical models and could effectively adapt to a wider class of data. These included:

- neural networks that built up predictive models through training simple logic units;
- splines and smoothing operators for fitting curves to data instead of the traditional parametric curves such as polynomials;
- regression trees that fitted multidimensional step functions to complex data, automatically choosing where the steps should be;
- automatic detection of interactions between factors in tables; and
- generalised additive models that gave simple ways of building up complex models.

Their algorithms defined these methods as much as by any theory. In that respect they were like the much older stepwise regression algorithms that attempted to find the optimal set of predictive variables. If that was all they were then they would have had the same problems of a weak theoretical base.

However they were developed in parallel with methods that allowed inference to be carried out in a wide range of algorithmic contexts, based upon fitting models to subsets of the data.

In their simplest form, these new inference methods might involve splitting the data into two randomly selected subsets. The models are developed using just the first set and whatever algorithm is appropriate. The models are then tested using the second subset – since this is independent of the first data subset, the assumptions of the traditional Neyman-Pearson method hold.

A more complex form of this is the jack-knife developed by John Tukey. In this many different splits of the data are used and the range of the models thus generated are compared. Many of these methods have only become feasible with the advent of low cost but powerful.

### On-Line Analytical Processing and Data Warehousing

Independently of these statistical developments, system administrators had recognised that large commercial databases held vast amounts of data that could be analysed for commercial benefit. Often these were databases of customer information.

However a problem was faced in that the database structures were optimised for handling large numbers of small transactions, typically affecting the records for one customer at a time. In addition, the emphasis in relational databases was heavily towards using the structures to guarantee correctness and consistency. These structures were often very inefficient for large scale access of all the records in one operation.

This developed the concepts of On-Line Analytical Processing (OLAP) and then data warehousing. Initially OLAP simply meant adding features to database systems to permit more efficient access and query, often by creating multidimensional tables or hypercubes. Then tools were added to give better reporting, typically the ability to cross tabulate and to produce graphs.

This led to the realisation that often it was easier to set up a separate copy of the data, optimally structured for analysis. Most importantly, this copy could become a repository for data no longer needed for normal operations but still useful in analysis. This is the data warehouse. ^{[4]}

The data warehouse also provides the opportunity to clean up the data. Efficient use of a data warehouse often comes from a strict control over what goes into it. Too much emphasis on keeping the data warehouse current through frequent updates can result in letting through errors that limit its use. (We would argue that strategic decision making should never be dependent on the data being absolutely the latest – if one more week's data would change the decision it is highly unlikely to have been strategic).

Perhaps the most perplexing part of these developments has been the proliferation of terminology. Terms such as data marts, business intelligence, decision support systems and many others each have a range of definitions, some very technical and some very vague. Sometimes it is right to suspect that only the names are changing, while the important ideas beneath are the same.

Statisticians might see multidimensional tables and cubes as being very similar to what for many years have been called data matrices and some other reporting tools as being familiar tabulation software. This perception has much truth in it but statistical software often lacked the flexibility in manipulating truly large volumes of data that is encountered in business. ^{[5]}