Data integration policy

Introduction

This Data Integration Policy states Statistics New Zealand policy on integrating personal data.

Data integration involves linking together information from different sources. This can be used to produce new statistics, to enhance the value of existing statistics, and also to enable a greater level of research. This can benefit New Zealand by increasing knowledge on the country’s people, economy and environment.

Individuals have legitimate privacy expectations. Integration of personal data can be privacy intrusive when it uses information for purposes other than those for which the information was originally provided.

New Zealand legislation recognises individual expectations of privacy and the wider benefits of statistical information. The Statistics Act 1975 allows Statistics New Zealand to require responses to its surveys while also requiring that these responses are kept confidential. The Privacy Act 1993 requires adherence to a set of information privacy principles while also recognising exceptions when information is used for statistical or research purposes.

This policy describes how Statistics New Zealand ensures that any integration of personal data is justified. It details the care taken by Statistics New Zealand when integrating personal data to ensure any impact on privacy is minimised. This policy provides strict conditions, often beyond statutory obligations, on how Statistics New Zealand undertakes data integration.

The Government Statistician’s approval is required for all data integration projects. The Government Statistician will only give the go-ahead to a data integration project if satisfied that the principles set out in this policy will be fully observed.

top

Applicability

This Data Integration Policy applies to all integration of personal data undertaken by Statistics New Zealand for statistical or related research purposes.

Personal data is information about natural persons (including those deceased). This includes data on individuals and data on households. Personal data does not include data on businesses.

Integration of data means the linking of records from one data source to another. This includes exact matching, probabilistic matching and statistical matching. This does not include linking records from one dataset to data aggregated above the level of individual households or persons.

Statistics New Zealand operates under the Statistics Act 1975. In addition to this policy, all relevant provisions of this Act, all relevant provisions of the legislation that the source data was collected under, and all other relevant policies of Statistics New Zealand, apply to integrated data.

top

Data Integration Principles

The following principles govern when integration of personal data for statistical or related research purposes can occur:

  1. Statistics New Zealand must only undertake data integration if integration will produce or improve official statistics.
  2. Data integration should be considered when it can reduce costs, increase quality or minimise compliance load.
  3. Data integration benefits must clearly outweigh any privacy concerns about the use of data and risks to the integrity of the official statistics system.
  4. Data integration must not occur when it will materially threaten the integrity of the source data collections.
  5. Data must not be integrated where any undertaking has been given to respondents that would preclude this.
  6. Data integration must be approved at an appropriate level by all the agencies involved.

The following principles govern how integration of personal data for statistical or related research purposes will be done:

  1. Integrated data must only be used for approved statistical or related research purposes.
  2. The size and data variables of the linked dataset must be no larger than necessary to support the approved purposes.
  3. Integrated data will be stored apart from other data.
  4. Names and addresses can only be kept in an integrated dataset while necessary for linking.
  5. Unique identifiers assigned by an external agency must not be retained in an integrated dataset.
  6. Data integration must be conducted openly.

top

Applying the Data Integration Principles

1. Statistics New Zealand must only undertake data integration if integration will produce or improve official statistics.

(a) A data integration business case will be approved before any data integration work is undertaken. The business case will identify how the integration work will produce or improve official statistics.
(b) Statistics New Zealand can undertake a pilot study to determine whether using data integration to produce or improve official statistics is feasible in a particular case.

2. Data integration should be considered when it can reduce costs, increase quality or minimise compliance load.

(a) Before undertaking a new survey for statistical or related research purposes, consideration must be given to how data integration could be used to reduce costs, increase quality, or minimise compliance load.

3. Data integration benefits must clearly outweigh any privacy concerns about the use of data and risks to the integrity of the official statistics system.

(a) The data integration business case must assess the benefits of the data integration work against any privacy concerns about the use of data, and any risks to the integrity of the official statistics system. Data providers and, where appropriate, other stakeholders, must be consulted to ensure that all benefits, privacy concerns and risks are identified.

(b) The data integration business case should list all benefits, including those resulting from any intended research.

(c) The data integration business case must include a privacy impact assessment.

(d) Ongoing data integration work must be reviewed regularly to assess whether the benefits continue to outweigh any privacy concerns and risks. If a review determines that this is no longer the case, then the data integration work must be brought to an end and the integrated dataset destroyed. The data integration business case must specify the frequency of reviews. This should be at least once every two years.

4. Data integration must not occur when it will materially threaten the integrity of the source data collections.

(a) The data integration business case must include an assessment of any risks to the source data collection posed by the data integration work. This assessment must be conducted in consultation with data providers.

5. Data must not be integrated where any undertaking has been given to respondents that would preclude this.

(a) The data integration business case must investigate what undertakings have been made to respondents. The Government Statistician will not approve a business case if any element of the business case is incompatible with these undertakings.

top

6. Data integration must be approved at an appropriate level by all the agencies involved.

(a) All data integration business cases must receive the approval of the Government Statistician. When a data integration business case proposes to integrate datasets across agencies, approval will also be required from the chief executives of the agencies involved.

(b) Approval of each data integration business case will be done on its own merits.

(c) All approved data integration projects will be formally notified to the Minister of Statistics.

7. Integrated data must only be used for approved statistical or related research purposes*.

(a) The data integration business case must specify all proposed statistical or related research use of the integrated data. A new business case must be approved before integrated data is used for purposes not specified in the original business case.

(b) All statistical or related research purposes must be achievable, must be based on a scientifically sound methodology, and must satisfy any appropriate research ethics requirements.

(c) In assessing whether a given purpose is achievable, consideration should be given to the quality of the integrated data and restrictions on data use (eg requirements to maintain respondent confidentiality, requirements to use data collected under the Statistics Act only for statistical purposes, the requirements of the legislation that the source data was collected under).

(d) Statistics New Zealand must not provide information to data providers about individual records in integrated data that could assist the data provider in carrying out any administrative purpose.

(e) Ongoing data integration work must be reviewed regularly to ensure that the data integration work remains consistent with the approval given to integrate data.

(f) Integrated datasets must be destroyed once the purposes for which data was integrated have been accomplished.

* At least one of these purposes must produce or improve official statistics if Statistics New Zealand is to undertake the integration as required by principle 1.

8. The size and data variables of the linked dataset must be no larger than necessary to support the approved purposes.

(a) The ‘approved purposes’ are those listed in the approved data integration business case. Only data variables necessary to support these purposes can be used in data integration work.

(b) Data variables can be included to support research purposes as long as these are listed in the approved data integration business case.

(c) The number of records integrated must be the minimum necessary to support the approved purposes (eg consideration should be given to integrating a sample of a full-coverage data source).

9. Integrated data must be stored apart from other data.

(a) All integrated data must be stored in their own environment. The environment must only be accessible to the smallest number of Statistics New Zealand employees practicable.

10. Names and addresses can only be kept in an integrated dataset while necessary for linking.

(a) Data that includes personal names and addresses can be received from data providers. These can be used to validate the data, clean the data, impute values (eg sex or geographical information), and link data.

(b) Personal names and addresses must be removed from an integrated dataset immediately, unless approval has been obtained for a limited retention for ongoing linking. Any request to retain personal names and addresses must be included in the data integration business case. This request must specify how long names and addresses can be retained. If approved, then the data integration project must only retain names and addresses for this length of time. This time period applies from when name and address data is received, to when the names and addresses are deleted.

(c) Data analysis cannot be undertaken in an environment that includes personal names and addresses. Either names and addresses must be removed before analysis, or separate environments for linking and analysis must be created. Integrated data, excluding names and addresses, can be transferred to the analysis environment after linking.

11. Unique identifiers assigned by an external agency must not be retained in an integrated dataset.

(a) Data can be received that includes unique identifiers assigned by an external agency. These identifiers can be used to verify the integrity of the data, or to clean the data. They can also be used for integration of data, but they must be removed immediately after integration.

(b) When linking needs to occur on an ongoing basis, then the externally assigned identifier must be replaced by a new identifier. This new identifier can be used for integration. It must not be possible to derive the externally assigned identifier from the new identifier.

12. Data integration must be conducted openly.

(a) No data integration work will be undertaken in secret. A list of data integration work being undertaken must be maintained on Statistics New Zealand’s website. Statistics New Zealand’s annual report must list all data integration projects undertaken in the year of the report.

(b) The primary results of any data integration work must be made publicly available. When data integration work is used to improve the production of official statistics (eg through improving quality), then this requirement is met through the publication of the official statistic in question.

(c) Data providers will be asked to ensure that data collection processes inform respondents that their information may be used for statistical or related research purposes.

top

Definitions

Data providers – Data providers supply Statistics New Zealand with the data to be integrated. This includes the agencies undertaking the collection of the data and any intermediary agencies that handle the data before it is supplied to Statistics New Zealand.

Exact matching – Exact matching involves using a unique identifier (eg a tax number, passport number or driver licence number) that is present on all datasets to integrate records. It is the easiest and most efficient way to match datasets

Probabilistic matching – Where a unique identifier is not available, probabilistic matching can be employed using variables common to both files (eg first name, date of birth or sex). This process is not totally precise, so the possibility of mismatches is taken into account.

Respondents – Respondents are any entities described by the data. This term originates from respondents to a survey, but for data integration could include any individuals or households recorded in an administrative dataset (eg through using a government service).

Statistical matching – Statistical matching is a technique for linking data at the unit-record level. It is similar to probabilistic matching, but does not necessarily aim to link together records about the same person or business. Instead, a record from one dataset is matched to a record in another dataset based on shared characteristics.