Data Analysis Australia

The location of a place can be described in many ways - the address, in reference to what it's near, or just by a name. Statisticians use addresses in many ways, including:

• For sampling in household based surveys;
• For summarising information that has a geographical dimension, ranging from the Census to small surveys; and
• For describing personal and freight movements in transport surveys.

The level of accuracy required in the address can vary depending on its planned use. An address that has adequate information for a letter to be delivered, for example, may not necessarily have enough detail to enable an ambulance to arrive at the correct destination in reasonable time. For some statistical work it may be sufficient that just the suburb is correct.

The most common thing being addressed in the statistical context is a dwelling or a block of land. However it might also be a geographical feature such as Bluff Knoll or simply a point in space where some measurements were made. Each of these needs a different approach to how the location is described or handled. Here we are mainly concerned with blocks of land, houses and buildings.

Addresses can take a number of different forms. We tend to think of "street addresses" that effectively define the street access to a block of land and thus how to get to it. A mail address can be quite different, being a description of how to get to the mailbox for the occupiers of the land. In many cases that may be at the local Post Office. Finally there is the legal description of a block of land, often with somewhat arcane terms such as divisions and folios, that most people only encounter when buying or selling a house.

For statisticians, an address is often a portal to information that can be linked by location. Data from the Census for example, can be reported by geographic areas such as suburbs, local government areas and collection districts. The address places a household within the boundaries of these geographic areas.

This is all possible through geocoding - placing a point on a map using a common set of coordinates. A simple way of geocoding is to look up an address in a list containing reference coordinates. This is replacing older methods that were more approximate, such as interpolating from just a few addresses on each street segment.

The address itself is also a useful link to other databases, such as electronic telephone listings or other operational/customer databases that store addresses. Here, the link would be used to add information to a database to extend its analytical or operational capabilities.

Geocoding and address matching are straightforward processes if the address being geocoded/considered and the address in the reference list are the same - in format and structure. The problem is that addresses aren't always collected or stored in a way that makes them easily geocodable:

• Street names can be misspelt;
• Abbreviations are not used consistently, for example Av or Ave for Avenue and Mt or Mount;
• Level of detail given can vary; and
• Redundant formatting such as commas and full stops are evident.

There are also data structure issues. Street number, street name and suburb may be stored in a single field or in separate fields in a database, there may be a mix of upper and lower case, etc. Many systems use free text fields to record address details, which place no constraints on what is entered. This can be good when the best description of an address may be "at the south west corner of Smith and Jones Streets" but can be hard to use in a systematic manner.

The two dimensions of addresses that must be considered are accuracy and sufficiency. The following different representations of the same address (of Data Analysis Australia) are all accurate in some way depending on the context of its use:

• Technically complete and correct - all information required to identify the address is present, free of errors and in the right format:

NEDLANDS, WA, 6009
• Incomplete but sufficient and correct - the details are free of errors, but not enough information has been given to uniquely describe the location (unless you already know the other details, for example all addresses were from Western Australia)

NEDLANDS
• Incomplete and incorrect but still sufficient - if the street and suburb combination is uncommon (unique) and the context known (addresses in Western Australia), the location might be still identifiable despite minor misspellings of the suburb or incorrectly recorded street details:

NEDLAND
• Not sufficient but correct in what it has - the address is lacking vital details such as suburb to be able to find it, though the street name, number and state details are all correct:

WA.

A set of guidelines on what is a good address, might be:

• No obvious errors such as misspelt street names or incorrect suburbs;
• Sufficient information to clearly identify the location; and
• Have some redundant information (such as suburb, state and postcode) to allow it to tolerate errors.

While there are no absolute rules on what constitutes a good address, standards exist that describe how address information can be stored. These standards not only suggest what fields are necessary, but what structure the data should take.

A basic rule in collecting address information for uses other than address labels is to separate the address components as much as possible. From the database management perspective, an address can be broken up and stored in a number of fields where each field stores a different piece of information. For example, 123B Smith Street E, BIGTOWN, WA, 6888 would be stored as

 Street NumberStreet Number SuffixStreet NameStreet TypeStreet SuffixSuburbStatePostcode 123BSmithStreetEBIGTOWNWA6888

Other fields that might be useful are building name, floor number, unit/flat/suite indicator, and second street details (for corner street addresses).

The way in which addresses are collected and stored for many applications can be controlled to maximise the 'correctness' of the details. For example, having a pull down list of valid street names, street suffixes, being consistent about how abbreviations are used, etc. Address validation software is also readily available on the market, and provides other business benefits such as a reduction in the time and cost to collect the data and checks on valid billing addresses.

Until recently there had not been a national, consistent source of all 'valid' addresses in Australia. (By comparison, the Australian Gazetteer has provided a very complete list of place names and geographical features for many years.) Most agencies or organisations maintained their own lists of customer addresses, relying on the customers to keep them up to date and accurate. In December 2003, the Public Sector Mapping Agencies released the Geocoded National Address File (G-NAF). G-NAF is a geographically referenced list of all residential and business addresses in Australia. Though still in its infancy, G-NAF has the potential to fulfil this role at a national level. Since it is based on data from state and territory land ownership databases, it is accurate on the legal aspects of addresses, reasonable on the street addresses and does not attempt mail addresses.

Data Analysis Australia has recently used G-NAF on a project with the national telecommunications consultants Gibson Quai to test the accuracy of address information of the Integrated Public Number Database (IPND) for the Australian Communications Authority. As one of the uses of the IPND data is by emergency service organisations in locating people who rang the 000 number with a life threatening circumstance, having a highly accurate address can mean a difference of critical minutes in providing assistance to someone in need.

In the Perth and Regions Travel Survey (PARTS) conducted by Data Analysis Australia for the Department for Planning and Infrastructure, personal travel information will be collected from over 9,400 households over four years. Among details about what type of places people go every day and why and how, a detailed description of the location of each place is also collected. The clearer or more accurate the address information given in the survey, the greater our ability is to accurately geocode it. This faces the additional challenge that some locations are not even on land - for example Matilda Bay is technically in the Swan River. For this project Data Analysis Australia developed a sophisticated Bayesian algorithm for matching the descriptions to actual addresses.

Both of these examples demonstrate the effect that various levels of accuracy in address information can have on research and operations.

Address information can be a valuable component of any database, if it is stored in an appropriate format and structure. Adopting address standards has operational benefits as well as enabling analytical processes to happen through geocoding and address matching.