What Size Sample Do I Need?

At Data Analysis Australia, one of the most common queries we receive from current and prospective clients is "what size sample do I need?". A common misconception is that 400 is the magic number. However, it is not always this easy - one size does not fit all applications. In fact 400 is rarely the right answer. Not surprisingly the answer depends upon the details of the question and understanding the question is the best starting point.

Survey results are often used to find an answer to a question or to help make an informed decision. Sometimes this is expressed in terms of estimating a number such as the proportion of shoppers who might buy a product, the proportion of customers who are satisfied with a service, the average turnover of companies or the gold grade in a deposit. Major decisions can be based on such survey estimates and clearly reliable decisions need reliable estimates.

However any results that come from a survey will be subject to some degree of error. This error can be separated into two types:

Sampling error. This type of error is caused by surveying only some of the population rather than surveying all of the population. If you repeat the survey, but randomly choose a different group of units to include in the sample, you would expect to receive a slightly different answer simply by virtue of surveying these different units. Both are equally valid answers but both have a degree of uncertainty.

There is a well developed statistical theory that helps us understand this type of error. The theory is used when setting the sample size and choosing the sampling design. The error can often be readily quantified (often before the survey) which helps when choosing the most appropriate sample size and sampling design.

Non-sampling error. This refers to all other sources of error. Examples include leading questions, communication error, ambiguous questions, data entry errors, poorly defined populations, non-response and deliberately false answers.

Usually this type of error can't be quantified, but steps can be taken to minimise its effects. Having a good and clear questionnaire is the first step. It is good practice to have questionnaires tested before the survey begins, so that these sources of error can be identified and fixed.

It is important to consider both types of error when designing a survey. Any benefit achieved from reducing the size of one type of error can very easily be wasted if the other type of error is larger.

Terminology

Before proceeding too far, we need to define a few terms.

Population.

This is the entire group about which answers are to be obtained. It is important to realise that populations are not restricted to people. For example, populations can also refer to businesses, clubs, households, mine deposits or whatever else is of interest to the survey. The population of interest needs to be clearly defined before the survey begins and sometimes it is quite difficult to define accurately. Examples of populations are:

All people in the Perth Metropolitan Area;
All small businesses in Australia;
All employees of an organisation;
All doctors surgeries in Western Australia's South West;
All tennis clubs in the Southern Suburbs; or
A gold deposit that is about to be mined.

Sample.

The subset of units in the population who are actually surveyed. This is often but not always a random selection. To demonstrate, the figure to the right shows a population of 9 units (people). The sample consists of 3 units (people) as shown by the red circles.

Standard error.

A measure of the error that results from surveying a subset of the population rather than the entire population.

Questions to be Asked

A number of questions need to be asked (and answered) before a suitable sample size can be determined. These include:

What level of accuracy is required? In general, the higher the level of accuracy required, the larger the sample size should be. However, smart sample designs can often be used to reduce the sample size without reducing the accuracy. Also, the changes in sample size that are required to achieve a change in accuracy are often not proportional.
Are estimates for subgroups also required? In many surveys, specific interest lies in subgroups as well as in the overall population. This is often because one survey may cover many issues, or comparisons between different subpopulations (subgroups) may be of interest. For example, in a survey of people some answers may be required broken down by gender and some questions such as number of pregnancies may only be relevant to one gender. The overall sample size needs to be large enough to ensure that an adequate level of accuracy for these subgroups can also be achieved. Usually, less accurate results are required for subgroups.
What resources are available? The sample size and design obviously need to fit within the available resources. Obviously, the more time and money that is available to conduct a survey, the more accurate the results you would expect to achieve. In some cases, unrealistic expectations are placed on the survey, and these ideals may need to be relaxed. In other cases, it may suggest that the available budget is not sufficient and the question may arise of obtaining extra resources or dropping the survey.
What method of data collection? Self-completion questionnaires are often cheaper to implement than personal interviews (either face to face or telephone), but their response rates tend to be lower and results cannot be obtained as quickly. Personal interviews also usually result in higher quality survey data with fewer missing or inconsistent responses, as the interviewer can provide clarification when needed. Self-completion questionnaires therefore need to have a very good design and production in order to be answered to a high standard. This is an upfront cost that may be acceptable for a large survey but not for a small one. A trade-off often needs to be made between the data collection method and sample size that can be afforded.
Should everyone be surveyed? For some surveys, it is more appropriate to survey the entire population, rather than surveying only a sample. In some surveys, to achieve the desired level of accuracy the sample size required is so close to the entire population that it makes more sense to simply survey everyone. Surveying the entire population essentially removes the sampling error component and is often useful for staff or customer satisfaction surveys and has the added benefit of everyone feeling that their views count.
How does the population size affect the sample size? Contrary to popular opinion, the population size plays a relatively small role in determining the required sample size, particularly if only a small percentage of the population is being surveyed. For example, to conduct an opinion poll in Western Australia, approximately the same size sample needs to be selected as for a poll in New South Wales, despite New South Wales having a much larger population.
How variable are the responses expected or known to be? The level of variability between responses has a large impact on the sample sizes required. The less variable the responses are, the smaller sample size that is required to achieve the same level of accuracy. For example, there is a large degree of variability between household income but a relatively small amount of variation between the number of jobs that an employed person has. Hence an income survey would need a larger sample size to achieve the same accuracy as a survey on the number of jobs.
Is the burden being placed on respondents too high? Respondent burden is a big issue in today's society, with many surveys being conducted for many different purposes. If people or businesses get surveyed too frequently, they are less likely to take the survey seriously. This means that the sample size should not be larger than necessary to obtain the accuracy needed.

Sample Design

So, determining the "correct sample size" is not a simple task. In fact, a large part of determining the sample size is not simply "how many should we sample", but how cleverly the sample is chosen. A "smarter" sample design can give more accurate estimates with a smaller sample size.

In general, the more complex a survey that is being conducted the more effective a smart design can be. It is often more cost effective to spend additional resources in designing the sampling methodology than simply sampling more units. Techniques such as systematic sampling, stratified sampling, cluster sampling, multi-stage or multi-phase sampling can all be used to improve the sample. Some of these are described below, and in practice a sampling design is likely to have elements of many types of sampling techniques.

Stratified Sampling

Stratified sampling is one of the most common types of survey design. This involves separating the population into distinct groups and then choosing a sample size for each group (for example, males/females, states of Australia or divisions of a company).

There are two main benefits of a stratified sample:

1. Stratified sampling ensures that an adequate number of respondents are gained for each subgroup of interest. This also helps to ensure that a representative sample is achieved.

2. For the same size sample, a superior estimate at the overall level and also at the subgroup level can be obtained by allocating a higher proportion of the sample to the groups with higher variability. To maximise the benefit achieved from using a stratified sample, the distinct groups should be chosen so that units within the same group are as similar to each other as possible.

A good example is for surveys relating to business activity. In many cases, a few very large companies have a big effect on the overall value. However, it is also important to get a good estimate of the combined value of the smaller companies. In such cases, groups (strata) can be formed according to the company size with only a percentage of the smaller companies being surveyed, and all of the larger companies being surveyed. All companies with 0 to 9 employees might be grouped, and 10% of them surveyed, 15% of companies with 10 to 20 employees could be surveyed and all companies with more than 20 employees could be surveyed. Although this means that the responses need to be weighted using statistical techniques to provide meaningful estimates at the overall level, this method provides superior estimates. This sampling design also helps to reduce burden for small companies that often have the greatest difficulty in responding.

Cluster Sampling

Another variant is a multistage or cluster sample which can used when there is a natural physical grouping in the population that can be exploited to reduce the effort, and hence the cost, of surveying the respondents. In its simplest form, once the groups or clusters have been identified, a number of those groups are chosen at random and then all individuals within the chosen groups are selected. Sometimes only a subset of individuals within the chosen groups are sampled, rather than all of them. This is often called cluster sampling, and may well go beyond just two stages as described here.

An example is a face to face household survey where firstly suburbs are randomly selected and then houses within the chosen suburbs. A saving comes from an interviewer being able to visit all the selected households in a suburb in a few trips, minimising travel time. From a strictly statistical sense cluster sampling is usually less efficient than would be the case for a Simple Random Sample, in that for a given total sample size less information is obtained about the population. However the reduced travel costs involved in sampling from fewer distinct locations may permit an increase in sample size that more than compensates for this. For this reason, cluster sampling is typically preferred when there is a need for the personal contact with survey participants or to ease administrative burden associated with sampling distinct groups.

Another example is a survey of teachers, whereby firstly schools may be randomly chosen and then teachers within those schools. A saving comes from being able to deliver a batch of questionnaires to a single school and being able to identify a point of contact within the school to assist in the administration and promotion of the survey to increase response rates, rather than needing to individually contact randomly selected teachers at different schools. However, perhaps more importantly in this context, it also has the advantage of ensuring sufficient sample in each selected school to analyse similarities and differences between teachers' responses within the same school and in different schools. This can enable, for example, the effects of the department, the effects of the school and the effects of the individual teachers to be separated out and analysed. As such, cluster sampling should not only be considered as a "necessary evil" in terms of being a cost saving measure associated with personal contact, but also when a statistically efficient way of analysing responses both within and between groups is required.

Acceptance Sampling

The above discussion has focussed on sampling to obtain valid estimates for the entire population based on results obtained from surveying a subset of the population only. But consider the special case where the primary question is not to determine the overall population value, but is simply a question of quality, to determine whether or not a batch or group of items is of sufficient quality to accept the whole batch or reject the whole batch. The question then becomes "how many items do I need to sample, and how many can fail, before I deem a batch to be of insufficient quality to return it?"

With a history in military applications, whereby shells had to be highly consistent in trench warfare, and further development through manufacturing, a suite of International Acceptance Testing Standards have been developed to cover just such applications. These Standards cover a range of quality levels, termed "Acceptable Quality Levels" (or AQLs), and cover a range of batch sizes. At the simplest level, what acceptance sampling does is give the user the sample size of how many items they need to test for a given batch size and AQL and also, how many are allowed to fail the test before rejecting the whole batch. For example, with an AQL of 2.5% and a batch size of 1000, 125 items need to be tested and if 7 or fewer fail, the batch passes, otherwise the batch fails. While an AQL of 1%, say, requires higher quality (fewer failures) than an AQL of 2.5%, it should not be confused with allowing a 1% failure rate, compared to a 2.5% failure rate.

Acceptance sampling is very different to the other types of sampling discussed in this article. Its purpose is not to estimate a population value with desired accuracy, but to provide an unambiguous set of decision rules as to when a batch should be considered of sufficient quality to accept it. It is frequently used, for example, when determining contractual obligations between consumer and supplier, whereby the appropriate balance of protection can be given to both parties - if the batch passes based on the results of the acceptance testing, the consumer must accept the batch and pay for it, whereas if the batch fails, the supplier must fix the problem or replace it. Therefore the Standard is designed to operate within the context of continual improvement, with the Standard being as much about providing feedback and incentives to the supplier as it is about acting as a gateway to hold back defective items, with rules to implement higher rates of sampling and checking for consistently poor performance and reduced rates of sampling for consistently good performance.

As such, the Standard needs to be defined in a way that is useable "on the floor" without the need for complex calculations for each case. The way that this is implemented is by deriving a series of tables, covering different AQLs, different batch sizes and various other assumptions, with each table providing the relevant sample size for that batch and how many items can fail before the batch is deemed rejected. While these tables are based on statistical principles, their need for simplicity and suitability to be included in tables means that by necessity, they are not as statistically rigorous and specific as the other means of sampling described elsewhere in this article. While as a general rule larger sample sizes are required for larger batches, batch sizes are "grouped" into tables in the Standard, with the same sample size being used for a large number of different batch sizes. For example, whether there are 10,001 or 35,000 items in a batch (or any number in-between), the same size sample is used for a given AQL, but this sample size is greater than if there were, say, between 91 and 150 items in the batch. This is very much based on practicality and decision rules, rather than pure statistical properties.

Even though the Standard in some sense "gives" the required sample sizes, there are still decisions to be made to select the appropriate tables of the Standard to follow. The results and applicability of using Acceptance Sampling techniques and sample sizes must be carefully considered on a case by case basis rather than applying the Standard without proper thought.

What if the "right sample size" is not affordable?

Sometimes, the ideal sample size and design just doesn't fit into the budgeted time and/or money constraints. In these cases, a trade-off decision typically needs to be made between the competing priorities of the survey. Some options include:

Accepting a slightly lower level of accuracy, particularly for subgroups of the population;
Diverting resources from another component of the research - for example by reducing the length of the questionnaire;
Changing the implementation methodology to a cheaper alternative, perhaps replacing interviews with a mail back questionnaire;
Redefining the survey aims. Consideration should be given as to whether the right questions are being asked in the right way, or whether unrealistic goals are being set with regards to the practical importance of the accuracy of the survey results; or
Reassessing the budget. If none of the above trade-offs can be made it could be because the budget is just not reasonable or realistic to meet the requirements of the survey. If the purpose of the survey is so important that these trade-offs can't be made then extra resources need to be supplied.

So, when is a sample size of 400 appropriate?

So, going back to the magical number of 400 referenced in the opening paragraph, is it ever the right sample size to use? Yes, it can be appropriate, but only with the right assumptions and accuracy requirements. In fact the usual argument for it is based upon:

Asking a question with a yes/no response;
Expecting approximately a 50/50 split in the responses; and
Needing an accuracy of ±5% with 95% confidence.

When are all these assumptions met? Probably rarely. This means that many surveys are carried out with sample sizes that are either unnecessarily large, leading to unnecessary cost, or giving insufficient accuracy to make proper decisions.

Data Analysis Australia has both the statistical expertise on these issues and the practical experience in conducting high quality surveys. Our consultants understand the process from the first steps of formulating the questions that the survey must answer through to the analysis of results, providing professional judgement on what is best for each situation.

September 2010