Skip to main content

Data, Statistics and COVID-19

The world is facing a pandemic on a scale that has not been seen since the “Spanish Flu” of 1919.  Today the world has global travel to spread the virus much more quickly and the Internet to inform (or misinform) everyone.  It is the first pandemic we are seeing in close to real time. 

The numbers are stark.  With what we know, if no action is taken then 70% of the population may be infected.  Perhaps 1% of infections might be fatal.  That suggests 175,000 fatalities in Australia.  Such an outcome would be comparable to the Spanish Flu.  Fortunately no country has failed to take any action and the fatalities are much lower than that.

As statisticians and data scientists, we at Data Analysis Australia see this very much from the data.  COVID-19 provides an illustration of data saving lives, in keeping with the tradition of that pioneer of both statistics and nursing, Florence Nightingale.

The models

Epidemiologists – statisticians who study the spread of diseases – need this data so they can apply models to understand what is going on.  At their simplest, these models divide the population into three groups – susceptible, infected and immune (or removed).  Typically a person who is infected and survives becomes immune.  A vaccine can make a susceptible person immune.  A key parameter is the reproduction ratio R which is the number of persons infected by a single infected person. is actually the product of a base reproduction ratio R0, corresponding to the situation where all the population is susceptible and the proportion S of the population that is susceptible.  If R is below 1 then an epidemic will naturally die out, given time.  At the moment the best estimate for most countries is that R is around 3. 

Without action, the epidemic will grow until S is less than or the proportion that is immune is greater than . Without a vaccine to achieve this, two thirds or about 70% must have contracted the virus. At that stage “herd immunity” kicks in.

The value of R0 depends upon the virus itself – how infective it is – and social effects such as the way that an infected person might interact with others in the community.  At this stage we cannot do much about the virus itself but we can change behaviour.  Unfortunately for COVID-19 individuals may be infective well before symptoms are evident, so behaviour change cannot just be limited to known cases.  Social distancing is one method of reducing R0, hopefully to less 1 so the epidemic will just die out.  This should work over several generations of the virus, provided that no new cases are being imported.  In the long run a vaccine will enable us to have the proportion susceptible to be less than a third (ideally much less) without relying upon the virus itself to achieve this.  (There were reports that the UK government was planning to rely upon the virus to generate herd immunity.  Fortunately, if that was ever their plan they changed very quickly.)

In the shorter term, if R is reduced but is still greater than one, the epidemic will be delayed, more spread over time and, most importantly, the peak in the number of people requiring treatment will be less, allowing the health systems to better manage them.  This is “squashing the curve”.  A very readable article is at, while the modelling used to guide the UK response is here.  An interesting video (not for the faint hearted) is here.

The data

We see data daily, especially numbers of positively diagnosed cases and deaths, from all countries around the world.  Websites such as and are updating the numbers around the world hour-by-hour.  But what does this tell us?  Unfortunately, the data is invariably incomplete, with key information often lacking.  One of the main problems is that the level of testing varies between countries and has changed over time.  Some countries seemed to have applied the old adage “if you don’t want to find something, don’t look”.

Many of these data issues are standard ones faced by statisticians:

  • In almost all countries including Australia testing is focused on those who are more likely to have the virus.  When there is a shortage of resources and the aim is to control the epidemic, this makes sense.  However it is heavily biased “sampling”, and will badly overestimate the proportion of the population who are positive. 
  • Likewise only testing people who have critical symptoms may overestimate the fatality rate, as seemed to have occurred early in Italy.
  • It follows from this that young children who tend to show minimal symptoms are rarely tested.  Hence we know little about the prevalence amongst school age children.  This, in addition to lack of knowledge on how infective a child can be, makes it difficult to set policies for opening or closing schools.
  • Some degree of random sampling of the population is required to measure the actual prevalence.  That is, a properly designed survey.  Surveying a rapidly changing situation is not easy, and health authorities must also consider the effect of diverting testing resources away from those at greatest risk.  However at some stage the need for real information will soon be paramount.
  • Fatality rates are similarly affected by poor data.  Many estimates of rates uses the numbers of reported cases rather than estimates of the number of actual cases.  It seems likely that to varying degrees all countries have undercounted the actual number and hence fatality rates can appear quite high.  For example, in Italy, the numbers of deaths is around 10% of the number of known cases.

At present, statisticians are trying to make do with the information they have.  Key parameters needed for the models are the values of R0 for different social distancing regimes.  At this stage, this often means making assumptions about the proportion of cases that remain undetected.  (We are fortunate that Australia appears to have one of the higher rates of testing.) 

John Henstridge
March 2020