Databases and Statisticians

Statisticians work with data.  Database professionals manage data.  It would seem that the two groups would have much to talk about and should be seen assisting each other.  At Data Analysis Australia we have teams with both statisticians and computer scientists working together, but our experience suggests that our approach is not common.  Most statisticians have little deep knowledge of databases and most database professionals have limited experience in interpreting the data they manage.  This leads to much inefficiency, with statisticians putting more effort than necessary into manipulating their data before analysis, and database experts often reinventing methods known to statisticians for decades.

It is natural to ask how and why this separation arose, and what can be done to bridge the chasm between the two fields.

Linked Origins

Statistical work was at the centre of the development of computers.  The modern computer and the company IBM evolved from punched card machines developed by Herman Hollerith (1860-1929) to process the US Census in the late 19th century.  At that stage the cards themselves were the only storage for the data and the computers - termed tabulating machines - simply counted cards according to the hole patterns punched in them, with almost no computation as we would think of it today.  The original name of IBM was the Computing Tabulating Recording Corporation.

At the same time, statisticians were developing methods of analysis and these often required significant computation.  Mechanical calculators were the norm.  While these had surprising accuracy - often 20 digits - they had little or no storage capacity.  Algorithms were developed to minimise the number of times that data might need to be entered, resulting in the "calculating formulae" for quantities such as the variance that until recently filled elementary texts on statistics.

Modern electronic computers were initially developed for a different purpose from the older tabulating machines- intensive computation - although some of the old technologies such as punched cards were carried across.  Computer memories were small, better than the old calculators but not by much.  

When these new computers were then used to process data, the early emphasis was on batch processing, with the data initially on cards as before, and then magnetic tape.  The statistical algorithms from the days of calculators were readily adapted to this environment, with the emphasis on doing as much as possible with a single reading of the data and not having to store it in memory.  Many of the older statistical programs such as SAS [1] and SPSS [2] still have this legacy data matrix structure, enabling them to process large volumes of data on relatively modest computers.  The simple structure does have its limitations however, as is discussed below.  

Divergence

All this changed with the advent of random access mass storage - the hard disk drive - invented by Rey Johnson of IBM in 1956.  This allowed a program to read parts of a data file in any order, in its time a revolutionary concept.  In fact some say IBM almost stifled the hard disk development since it went against their corporate marketing approach of sequential processing of large files.  

The disk drive generated a flurry of software development exploring how to reliably store data on these new systems, leading to what are now termed database management systems (DBMSs).  After a brief period of experimentation one model for database design evolved as a clear winner, the relational model of Edgar Codd, especially when paired with a version of the Simple Query Language (SQL).  While this model had a number of strengths, it particularly emphasised maintaining consistency and integrity when a database was handling numerous small updates or transactions.  The benchmark used in measuring the performance of database systems was the number of standard transactions per second they could achieve.  Today these systems dominate business processes - virtually all the bills you receive will have been generated from such systems.

A key feature of the relational model is that data is conceptually stored without redundancy.  If one table stores invoice information and several invoices will go to the one customer, then the data on the customer must be in a separate data so that there is only one version of the information (such as the address) for that customer.  This ensures consistency - all invoices for a customer will go to the same address and changes to that address need only be made in one place, but it does mean that the data is spread across numerous tables.  These principles are often presented as part of Codd's Twelve Rules. [3]

Database Professionals Discover Analysis

But transactions are not the only purpose that databases must be used for.  All businesses need to analyse their data to inform business decisions.  It was found that systems optimised for transactions were often poor at this second task, often because the data was spread across many tables and detailed knowledge was required when linking these tables together.  

This led to the development of On Line Analytical Processing (OLAP) systems, typically extensions to SQL databases.  OLAP systems often work by creating a hypercube, a simple table that combines the data from many of the main tables and is hence easy to query.

Statisticians, who had little or no need to consider transactional systems, had little interaction with database professionals during this time.  Most statisticians worked with relatively small data sets, so they saw little need for sophisticated storage.  

Many statisticians are somewhat bemused by the development of hypercubes and similar data warehousing techniques - they see it as the discovery by computer scientists of the data matrix that statisticians have been using for years.  The data matrix often lacks the sophisticated storage of the hypercube and is therefore slower to use for some problems, but statisticians had long recognised that its simplicity gives enormous power.  The SAS program grew in its use in the business world with its simple emphasis on the data matrix simple because it provided what business analysts needed long before the database systems did.

At the same time, some statisticians have considered that the SQL language is not ideal for their data management.  Used in its most direct way, SQL can easily lead to losing or duplicating data without warning and some questions such as what is the median of a set of numbers can be difficult to answer.  While there are methods of getting around these problems, it is difficult to argue that standard SQL databases are ideal for statisticians.

Statisticians Discover Data Structures

At the same time that computer scientists have been rediscovering the data matrix, statisticians have had to face real problems in managing data.  In the 1970s statistical programs began to take advantage of increased memory and improved operating systems.  Many of these programs are forgotten today, although two important ones survive - Genstat [4] and S [5]. While taking different approaches, both developed into highly programmable systems that implemented important algorithms taking advantage of modern computers.  Today S has been largely replaced in all of its features and language by the free open source system R. [6]

One of the most important features of R is its object orientation.  Objects of classes have been used in computing since the 1960's [7] but their use in statistics is more recent and comes from the work of John Chambers on S.  In R each statistical function can produce complex data structures that are then available for further analytical steps.  Hence the functions become the building blocks for particular analyses.  This makes R and similar systems far more flexible than the older data matrix oriented programs, although they all support the concept since the data matrix is still useful for many simpler problems.

While many statisticians appear to have discovered data structures, it is not so clear that most have discovered how to best manage data.  Perhaps this is best illustrated by the approach to the problem "where did the data come from?"  Any statistician involved in data collection knows that this is rarely a simple issue.  Decisions must continually be made on what data to include or exclude.  Corrections are often required.  Missing values must be managed.  Sometimes the reasons that data is missing are informative in themselves.  

This is an area where database methodologies can assist, particularly since they often have an emphasis on providing an audit trace.  The ability to "roll back" changes and to document decisions can provide greater confidence in the data before the statistical analysis.

The Future

While it is not always apparent to the user, these object oriented statistical systems are heavily dependent upon the improved memory size of modern computers.  The size of a data set that R or S can handle with ease is directly related to available memory.  Low cost memory of many gigabytes has meant that this is no longer a limitation for most statistical work.

This availability of large memory has also raised questions about the design of database systems.  The existing major databases such as DB2 and Oracle were designed when disk storage provided a major bottleneck to performance.  This meant that while the disk was being accessed for one transaction, it was very likely that more transactions would begin.  Rather than queuing these, the approach was to run them in parallel to the greatest extent possible.  This added complication but optimised disk access.  

Today many large databases can fit into the memory of a computer.  Where this is the case, processing transactions in sequence can be not only feasible but even the most efficient method.  Removing parallel transactions enables the database to be simplified, increasing performance by a factor of two or three.  It remains to be seen how this changes the accepted wisdom of what good database design is.  (For example, different approaches to reliability are required.)  However it is likely that there will be significant changes ahead of us.

In this environment, we hope that statisticians and computer scientists work together.  That being said, it is not clear that these two fields are coming together.  University courses are usually quite separate.  Few computer science students learn much statistics and few statistics students learn much real computing.

The experience of Data Analysis Australia is that it pays to keep these groups talking to each other.  Our approach, reinforced through internal training, is to ensure that each group is aware of the strengths of the other and to minimise barriers.  This has led to success in large computational projects that external computing professionals said were impossible and in sophisticated statistical projects where external statisticians said they could not manage the volumes or complexity of the data.

Back to Top

March 2009


Footnotes

[1] SAS (or Statistical Analysis System) traces its origins back to North Carolina State University in the mid 1960s, with the first public release in 1971. It initially developed on the IBM 360 mainframe, being programmed in a combination of IBM Assembler and PL/1, with a number of key routines in Fortran. Over the years it has evolved to run on a number of platforms with the dominant one today being Windows.

[2] SPSS was released as the Statistical Program for the Social Sciences in 1968, running on a number of mainframe platforms. It originally had a very limited language but it emphasised making statistics available to people with limited mathematical and computing background. It remains in heavy use in industries such as market research.

[3] See for example Codd’s Twelve Rules

[4] Genstat was developed by a team under John Nelder at Rothamsted Research Station in the 1970s, although there was an earlier program under the same name developed by Nelder while visiting the CSIRO in the 1960s. Genstat is not so well known today outside applications in agricultural statistics, where its analysis of variance has a sophistication not matched by any other program.

[5] Genstat was developed by a team under John Nelder at Rothamsted Research Station in the 1970s, although there was an earlier program under the same name developed by Nelder while visiting the CSIRO in the 1960s. Genstat is not so well known today outside applications in agricultural statistics, where its analysis of variance has a sophistication not matched by any other program.

[6] R was developed by Ross Ihaka and Robert Gentleman, then at the University of Waikato, in the early 1990s, originally to give S like functionality but at a cost that made it feasible for teaching. Several of its innovations, particularly its excellent memory management and its release under the General Public Licence (GPL) has made it extremely popular and today a large community of statisticians contribute to it.

[7] The Simula language was a notable early example developed as an extension to Algol 60 by Ole-Johan Dahl and Kristen Nygaard in Norway in the 1960s. While definitions can always lead to some disagreement, it is generally considered the first object oriented language.