Copyright © 2013
Data Analysis Australia

Precision in Recording and Reporting Data

Any organisation using data to inform or direct its activities is faced with how to record and present data where there is a degree of uncertainty or limits upon the precision.  This uncertainty is usually unavoidable due to how the data was first collected - it may be a sample or analytical equipment may only be calibrated to a certain accuracy.  

To address this uncertainty, it has long been common practice to round or truncate such data to an acceptable level of precision when observations are first recorded, to avoid giving a false impression on the precision.  This concern is valid, but such an approach can lead to losing valuable information in the data.  At a time when computers have revolutionised data storage, it is appropriate to question whether a method developed in the pencil and paper days is still the best.

The Problem

When many people read a number that is given with many decimal places they will assume it is accurate to that level of precision.  There are many examples, across every industry, where reporting figures to a large number of decimal places results in misleading an audience as to the certainty of the figure.  

Newspaper articles have provided this false impression on numerous occasions, sometimes as a result of converting from one unit of measurement to another.  For example, an article may report that a car accident occurred on a highway "about 1.609 km" north of a particular intersection.  In this case an approximate distance given in an imperial unit from a report (approximately one mile) is converted to the metric distance to three decimal places, so suggesting that the location is precise to the nearest metre.   Here it would certainly be appropriate to round the figure to, say, one decimal place to communicate that the distance is only approximate.

As a communication issue, limiting the number of decimal places that figures are presented to when reporting individual observations is important to prevent confusion.  However, limiting decimal places, when recording and analysing data, only serves to reduce the amount of information that can be gained from statistical investigation of the data.

The traditional way of recording measurements taught across many scientific disciplines is to limit the number of decimal places of each observation to the expected precision of the measurement technique.  Similarly, where there are detection limit issues, observations are often recorded as being below, or above, a particular threshold level despite the measurement techniques still producing a particular value.  Although the reasons for traditionally taking this approach were understandable, collecting and recording of data is essentially a statistical issue and ignoring statistical principles can have serious consequences when it comes to subsequent analysis of the data and the ability to use results to inform effective decision-making.

A simple illustration of this might consider two measurements of the arsenic present in soil samples returning raw values of 0.081% and 0.102%.  If these are then recorded to one decimal place as 0.1%, the information that the second observation is likely to be greater than the first is lost. 

These problems increase as the numbers of rounded or truncated observations increase and are further compounded when numerous variables are recorded with limited precision.  In the example illustrated in Figure 1, suppose arsenic levels of water samples taken from a dam over eight days have been observed and then rounded to the expected precision of the assay technique.  The loss in detail in the recorded values becomes apparent when comparing the two data series.  In particular, the observed values suggest that the arsenic levels plateau at a level higher than that indicated by the corresponding recorded values - each single observation suggests the true value is likely to be greater than 0.1 but the combination of observed values suggests that the plateau is very likely to be greater than 0.1.  This is crucial information when trying to determine the likelihood that the arsenic levels have fallen below a threshold level.

A closely related problem is the treatment of observations that are "below the detection limit".  These are often just reported as "Not Detected" or "ND" even though a number is usually available from the analytical process.  Sometimes the original number is clearly not correct - for example the concentration of a chemical substance may come out as negative where it is the difference between two measurements.  (For spectral techniques like Inductively Coupled Plasma used in chemical analysis, similar results can occur through a more complex but fundamentally identical process.  A problem here is that the equipment sometimes hides these "impossible" values.)  As statisticians we would argue that even nonsensical values should be retained as they do contain some information that may prove useful when analysing the larger data set.

The Solution

The problem is that the two tasks of recording data and reporting data have become confused, essentially due to old paper based systems, perpetuated by a period of high cost electronic storage.  It is valid to ask whether these two tasks can be separated and better handled today when systems are no longer paper based and storage is cheap.

Spreadsheets provide a good example of how to handle both recording and reporting data.  Most people who use Microsoft Excel will be familiar with the way numbers are displayed on the screen with a limited number of decimal places whilst further decimal places have been recorded in the cell.  It is these recorded numbers that are used in any calculations rather than the rounded numbers displayed on the screen.

This spreadsheet approach suggests that with modern systems it is practical to manage data containing imprecision by separating the recording and the reporting steps.  A general rule is to record individual measurements with a level of precision that is slightly beyond the accuracy of the measurement.  The statistical analysis will still need to consider the precision and any special issues such as "impossible values".  However modern statistical methods can do this, extracting all the possible meaning from the data.  

In practice this usually means an extra one or two decimal places, but in many data systems there is no penalty in going beyond this.  The results obtained from subsequent analysis of these observations can then be reported with fewer decimal places to prevent confusion as to the precision of the results.

The benefit of doing this can be immediate - information is valuable in every organisation so it makes sense not to waste it.  Today physical limits and costs on data storage are no longer an issue - computers and data loggers can easily store vast amounts of information very cheaply and are now insignificant compared with the cost of collecting the data.  Methods that made sense in the pencil and paper days are simply no longer appropriate.