# Sample Sizes for Experiments

In general, there are two types of samples

•  those designed to estimate a population quantity; and

• those designed to compare samples to estimate a difference or effect size.

Our previous Analytical Idea entitled What Size Sample Do I Need? focused on how to size samples to answer questions about a population quantity, such as an average or a proportion.  We typically use surveys to collect such sample information.

This article discusses sample sizes when our questions centre on comparisons.  We usually answer these sorts of questions using data collected in experiments, studies or trials, which typically involve two or more sample groups.  A common approach is to use a formal hypothesis testing framework, which is a convenient statistical model for how the result of an experiment should be interpreted.  Under this approach, we assume there is a basic hypothesis (the null hypothesis, or H0) and accept this hypothesis unless the experiment shows enough evidence to the contrary.  The null hypothesis is therefore set to the opposite of what we are trying to prove true.  For example, if we expect reading program A is better than reading program B, a null hypothesis might be “H0: Reading scores from program A are no better than reading scores from program B”.

We design and conduct an experiment to test this hypothesis.  If our experiment has enough power, we hope it will gather enough evidence to let us reject the null hypothesis and hence conclude that reading program A is superior.

All models are wrong, but some are useful
Hypothesis testing is not the only framework for sizing experiment samples. Other approaches, such as Bayesian theory and decision theory, are equally valid in some contexts, but are not further discussed here.

Sample size calculations may appear simple on the surface, but actually require a depth of understanding and expertise not immediately obvious to the layperson. While anyone can Google a formula and plug in numbers to get an estimated sample size, statisticians understand the theory underlying these calculations and the assumptions this theory is based on.  We can choose the right calculation for an application, assess the underlying assumptions for reasonableness and help guide our clients in making informed choices about the inter-connected factors and quantities that feed into the calculation.

This is especially important when designing experiments on people or animals.  Experiments can often impact in real ways on people’s lives, affecting them in terms of their time, money, health and well-being.  Ethical issues around choosing a sample size too small or too large then become extra considerations.  Using qualified and skilled statisticians to conduct your sample size calculation lends credence to your design and can help your ethics application succeed.  Indeed, sometimes the statistician’s sign-off is a requirement of ethics approval.

## What Information Do I Need For My Sample Size Calculation?

A surprising amount of information feeds into a sample size calculation.  This information can be loosely grouped into two components:

• information needed to choose the correct calculation; and
• information needed to determine calculation inputs

### Step 1: Find the Information

Consider:

• the research question;
• similar previous or pilot studies;
• available resources (such as time, money, population size, etc.);
• subject-matter expertise; and
• statisticians!

Ideally, the information gathering process is a collaborative and iterative effort between the statistician and the experiment designers.  There is a trade-off between the ideal design and the resources needed – larger and more powerful samples need more resources.  When resources are limited – and when are they not? - a statistician can offer design alternatives and help pinpoint the best places for compromise.

A cautionary tale

Often, through clever design, we can reduce sample sizes. Other times, we need more sample than expected to have the power to achieve anything meaningful.

### Step 2: Select the Right Calculation

There are many different experimental designs, including parallel, cross-over, matched case-control, equivalence, and group sequential.  Different designs need different sample size calculations.  Some designs are more efficient than others, resulting in smaller samples. For example, a cross-over trial is very efficient in terms of sample size, but is inefficient in terms of the total length of the trial, since all the participants try all the treatments.  We can highlight the pros and cons of different designs, helping you to choose the right design for your application, aims and available resources.

Typical outcome types include continuous and categorical measures, proportions, odds and risk ratios, relative risk and survival times.  Sample size calculations change with the outcome type, and some outcome types are more efficient than others.  For example, numerical measurements often contain more information than categorical measurements and result in smaller sample sizes.  The expected value of your outcome can also be an issue.  If an event is rare, its prevalence will be close to 0 and a larger sample will be needed than for a more common event with a prevalence closer to 0.5.  We can help you choose the best outcome type for your study.

Usually, we design an experiment because we’re expecting (or perhaps hoping is a better word) to see a benefit of some kind – a longer survival time, better test scores, or our preferred candidate ranking higher in the polls.  The temptation is to design our experiment with a one-sided hypothesis test; the sample size will be smaller and we can prove things have improved.  But in reality, we’re usually uncertain about the size and sometimes even the direction of the effect we hope to see.  Unexpected and inconclusive results still have value and can be worthy of reporting.  If you want to report the results, no matter which direction they go in, we need to size the sample and design the hypothesis for a two-sided test.

The two-sided penalty

Estimated sample sizes are around 20-30% larger for two-sided than for one-sided tests, assuming typical power levels and a 5% significance level. This difference is smaller for smaller significance levels and larger powers.

### Step 3: Define Your Inputs

We usually design experiments because we expect to see a difference or improvement of a certain size (or we hope to see one no bigger than a certain size in the case of Equivalence studies).  The expected difference is the effect size that feeds into our sample size calculation.  The smaller the difference, the larger the sample needed to detect it.  However, it’s important to consider not only what effect there might be, but also whether this effect is of practical importance.  It would be wasteful to design and conduct an experiment powerful enough to detect a difference of a certain size if policy wouldn’t change for anything less than a difference twice that size.

Variance is an input that can be difficult to understand and is often the most difficult to estimate.  It is a measure of how similar observations are to each other.  The more similar they are, the smaller the variance and the smaller the sample size needed.  Variance estimates usually come from one of three sources.  In some cases, the variance is known to be a function of the mean or prevalence because of the outcome variable’s distribution.  Therefore, once we have an estimate for the mean or prevalence, we can determine the variance.  For all other cases, we would ideally draw on previous similar or pilot studies to estimate the variance.  However, when such studies aren’t available, statisticians need to work with subject–matter experts to formulate reasonable and appropriate estimates.

Reducing the variance reduces the sample size

When the variance isn’t constrained by the outcome distribution, there are practical ways of reducing it.  These include measuring observations more accurately, using log or other transformations, stratifying the sample so observations within groups are more similar, restricting your population to exclude more variable segments, or choosing a different design such as a cross-over or matched-pairs experiment.  Caution should be used when restricting your population, since this may impact the generalisability of your findings.

The significance level is chosen arbitrarily and the smaller the significance level, the larger the sample size. common choice is 5% (a 1 in 20 chance of being wrong) but this is not universally a good choice.  There will be times when a more or less stringent significance level is needed.  Be guided by the consequences of getting your hypothesis test wrong.  For example, incorrectly concluding reading program A gives better scores than reading program B probably won’t have dire consequences; however, wrongly concluding cancer patients are better off using treatment A than treatment B may well have.

Interim analyses

The significance level is also affected by the number of analyses you’re planning to do.  Sometimes, for ethical, administrative or economic reasons, clinical trials are designed to have interim analyses as the data accumulates.  This allows an experiment to stop early to avoid exposing further people to an inferior treatment if we already have enough evidence of a positive or negative outcome.  Repeated tests on the same data inflate the type I error rate.  We need to reduce the significance level in order to maintain the desired overall (or family-wise) error rate and this increases the sample size.

Power is also an arbitrary choice, but should always be greater than 50%.  In fact, experiments are typically designed to achieve a power of around 80-90%, meaning that if there is a difference, there is an 80-90% chance the experiment will detect it.  Sample size increases with increasing power.  We like to provide you a table with sample sizes for a range of powers.  This gives you an appreciation of the trade-off between power and sample size and allows you to choose the best power achievable with your available resources.

In experiments, as in life, things rarely go according to plan.  When people drop out of a study, switch between groups or fail to comply with the experiment protocol, the power of the study is reduced.  In order to compensate for this, the sample size needs to be inflated.  The right time to consider likely rates of losses and non-compliance is at the design stage.  How non-compliers will be handled in the analysis, through an intention-to-treat or a treatment-received approach, should also be factored in.  In this way, the power of the study can be maintained in the face of the unexpected.

Absence of evidence isn’t evidence of absence

It’s important to keep in mind that when an experiment fails to show a significant difference, there are two possible reasons for this:

1. There actually is no difference (of the size you expected); or

2. There is a difference but the experiment didn’t have the power to detect it (a Type II error).

The first is not really a failure.  Science is advanced as much by showing something didn’t work as expected, as by showing it did.  The second is minimised through good design practices and the well-considered allocation of resources.

### Step 4: And Once More with Feeling

In many experiments or trials, we are also interested in subgroups of the population and/or secondary outcomes.  For example, we might like to know if a reading program has a different impact in rural versus urban schools (a subgroup analysis) or we might like to assess if spelling test scores are also affected (a secondary analysis).  Each of these subgroups and secondary outcomes needs its own sample size calculation.  The experiment sample size needs to be at least as big as the largest of those calculated.  Subgroup analyses can quickly increase the required sample size, especially if we maintain the same power and significance level inputs as the primary outcome.  We can offset this, at least in part, by accepting a reduced power, and/or significance level for subgroup and secondary analyses.

Subgroup variance

In some cases, people within subgroups may be more similar to each other than to people in other subgroups.  This leads to reduced variance and hence smaller samples.  Consider whether a different variance estimate should be used in sample size calculations for each subgroup, primary and secondary outcome.

## What Does it All Mean?

It should be pretty clear by now that calculating sample sizes is a complex task which combines a mixture of qualitative and quantitative information.  Sizing a sample for any experiment is a trade-off – we have limited resources (such as time, money, people with the condition under study, etc.) and we want to maximise the information gleaned from each person or unit sampled.  We want to be sure that if the effect we hope to see actually exists, our experiment has the power to detect it and we don’t want to expose more people than absolutely necessary to the possible costs associated with participating in the experiment.

Data Analysis Australia’s statisticians can help ensure your sample is sized accurately, efficiently, and defensibly.  We can provide you with the knowledge necessary to make an informed decision on the feasibility of your experiment, allowing you to address important questions like “Do I have the necessary resources?”, “Does the sample size I can afford have enough power to detect a practically important difference?” and “Will I need to find research collaborators in order to achieve a large enough sample within a reasonable time frame?”

### Some Handy Definitions

If you’re not sure what some of the terms we used mean, the following list describes them.

Population

This is the entire group about which we want to know something.  It’s important to realise that populations are not restricted to people.  For example, populations can also refer to plants, animals, areas of land, hospitals, schools or whatever else is interesting to the experimenter.  The population of interest needs to be clearly defined before the experiment begins.  This can sometimes be difficult.  Examples of populations are:

• All people in Australia with a certain disease;
• All plants of a certain type in the area of land surrounding a proposed mine site;
• All primary school students in Western Australia;
• All woylies in the Upper Warren region of Western Australia's South-West; or
• All babies born in a Sydney Metropolitan Area hospital.

Sample

The sample is the subset of units or people in the population who actually take part in the experiment.  This is often, but not always, a random selection.  There are many different sample designs, ranging from the simple random sample, where each person in the population has the same chance of being selected, through to more structured designs such as stratified, multistage and cluster samples.  Each of these latter designs still involves random selection, but they need slightly different handling at both the design and analysis stages.

Hypothesis

Experiments are ideally designed to answer a specific question.  This question is usually formulated as a hypothesis, which we test using the data collected in the experiment.  For example, the question “Do people with a certain disease survive for longer using a new treatment than using the standard treatment?” could be formulated into the null hypothesis of “H0: The new treatment is no different than the standard treatment”, which would be tested against the alternative hypothesis of “H1: The new treatment is different to the standard treatment”.  This is an example of a two-sided hypothesis test, since we are looking for a difference, but are not restricting the difference to being either positive or negative.

Standard Error

The standard error (which is the square root of the variance) is a measure of the precision of something estimated from a sample, for example a mean or a proportion.  Since, by definition, any sample doesn’t have information from the whole population, anything we estimate from the sample will be an imperfect estimate – it will have sampling error.  The term “error” is something of a misnomer here.  Rather than implying a mistake, it refers to the difference between a sample and the population. As the sample size increases, the sample becomes more like the population, and so the associated standard error decreases.  The standard error is also smaller if measurements are less variable.

Significance Level

The significance level (α) is the probability or chance that our experiment shows a significant difference when there really isn’t one.  It’s the first of two ways we can get our hypothesis test wrong; hence it’s also known as the Type I error (α)

Power

The power of an experiment (1 - β) is the likelihood the experiment can detect a difference when that difference really exists.  For example, suppose we expect a new cancer treatment to improve the six-month survival rate for patients when compared to the standard treatment, and we want to test this belief by conducting an experiment with a 90% power.  Our experiment would have around a 90% chance of detecting a significant improvement in the six-month survival rate.  However, it would also have around a 10% chance of failing to detect a significant improvement, even if that improvement really exists.  This failure to detect a difference that is actually there is the second way we can get our hypothesis test wrong; hence it’s known as a Type II error (β). When designing an experiment we want to try and minimise the chance of this type of error, which means maximising the power.