Producing anonymous results

On this page you’ll find instructions for producing anonymous results. All those who process personal data must provide the results of their analyses in an anonymous form that cannot be used to reveal any data or aspects concerning individual participants following the instructions given by the Data Permit Authority Findata. According to the Act on the Secondary Use of Health and Social Data, the Data Permit Authority must ensure anonymity. This applies to all materials that have been authorised under said Act.

This page introduces some of the most common result types that are usually included in results. The list is not comprehensive. It’s good to keep in mind that the content of the variables also affects the risk of disclosure. In some cases the risk of disclosure is clear but there are many variables where even an individual exact value won’t expose the person.

The results must be checked and it must be verified that they follow these instructions regarding different result types. With some result types the anonymity can be verified easily but with other result types more specific examination may be required. Even though following these instructions won’t secure the anonymity of the results, following them will get you as close to the goal as possible.

What does anonymisation mean?

Anonymisation refers to a process in which the material is processed in a way that

  • it cannot be used to identify any individual persons either directly or indirectly
  • it cannot be used to make any conclusions concerning a specific individual person
  • any information concerning a specific person cannot be linked to any other material

The anonymous material must be impossible or exceedingly difficult to return to a form that could be used to identify any individuals. According to the Act on the Secondary Use of Health and Social Data, the results must be anonymous. If there is a need to publish results that cannot be anonymised, this should be taken into account during the study’s planning process by considering using other criteria for carrying out the study, such as obtaining consent from the research subjects.

This chapter describes the most common analysis types that are usually included in various study results. This is by no means a comprehensive list. It should be noted that the disclosure risk is influenced by the content of the variables. In some cases, the risk of disclosure is obvious, but in the case of many variables, their characteristics are such that even the most detailed individual values cannot be used to determine the identity of any individual participants. Check the results you have produced and ensure that they comply with these instructions in terms of different result types. The anonymity of some result types can be established fairly easily, while other types may require more detailed examination. Please note that while your compliance with these instructions will not necessarily ensure that your results will be completely anonymous, it will help you get as close as to this goal as possible.

It is important to note that while an individual result may be anonymous in itself, the risk of disclosure is present when several results are combined together. A typical example of this is the production of several frequency tables that use the same variable classifications. These can usually be combined to create more detailed frequency tables in which the data is defined in relation to several variables. Take into account how your results could be combined with both current and previous analyses. If you are aware of any prior publications whose analyses have utilised the same or nearly the same material or a subset of it, remember to provide at least the links to any such publications.

To protect study participants’ anonymity, the frequency present in the results is five at minimum. This criterion is used to ensure data protection. For justifiable reasons and on a case-by-case basis, deviations from this minimum frequency of <5 may be made and a minimum frequency of <3 may be used, for example, in the case of very small target groups, rare disease studies or studies concerning rare phenomena. This result must represent a significant finding that is necessary to report at this level of accuracy while still ensuring that the criteria for anonymity are met (i.e. the variables cannot be used to identify any specific individuals).

The Data Permit Authority must be provided with access to sufficient background information for each analysis type to ensure their anonymity. This information must be displayed in connection with the result (next to the result or as a separate document so that the result and background information can be easily understood). The Data Permit Authority uses the principles described in Table 1 as the basis for verifying anonymity.

Classification of disclosure risk by result type

Data typeResult typeGeneral classification
Descriptive statistics  
 Frequency tableTo be verified
 Quantity tableTo be verified
 Maximum, minimum, percentile, medianTo be verified
 ModeGenerally safe
 Mean, indices, ratios, indicatorsTo be verified
 Degree of concentration            Generally safe
 Higher momentum indicators (such as variance, covariance, kurtosis, skewness)Generally safe
 Graphs: visual representations of the original materialTo be verified
Correlations and regression-type analyses 
 Linear regression coefficientsGenerally safe
 Non-linear regression coefficients                        Generally safe
 Estimation residualsTo be verified
 Estimate summary and test variables (R2, χ2 etc.)                            Generally safe
 Correlation coefficientsGenerally safe
 Factor analysisGenerally safe
 Correspondence analysisGenerally safe

Descriptive indicators and analyses

In the text below, the terms “group” and “target group” refer to the observations from which statistics are calculated.

Minimum, maximum and range

In general, minimum and maximum often refer to individual units, which are the easiest to identify, meaning that they contain a clear disclosure risk. These can be published if the value of the statistics is based on more than one unit. Anonymity of these results can be improved by categorising data, as categories will then include several individuals. Consider using suitable quantiles alongside any minimum and maximum figures.

Fractiles – quantiles, deciles, percentiles, median

These can be published if the underlying frequency is large enough.

Mean, standard deviation

In rare cases, may contain a disclosure risk. Check that the result represents a sufficiently large group and that the entire target group is not issued the same value. Check that these statistics are not reported from several nearly identical groups or subgroups.

Mode

This can be published in principle, but check that it will not disclose the entire group, i.e. that it does not describe the value of the entire target group.

Higher momentum indicators, such as variance

These can be published in principle, as the indicator has been clearly converted from the original individual values. Make sure not to publish an excessive number of indicators from a small group, as they could serve to disclose the entire group.

Correlation coefficients

These can be published in principle when the group under consideration contains a sufficient number of observations.

Degrees of concentration

These can be published in principle when the group under consideration contains a sufficient number of observations.

Linear regression, non-linear regression

Coefficients can be published in principle.

Test variables

These can be published in principle.

Factor analysis

These can be published in principle, but make sure that your factors are not based on a single variable.

Principal component analysis

Main component vectors and their corresponding values can be published in principle. Check any projections of the main components (they correspond to scatter plot, see below).

Indices, ratios, indicators

Indices can be published in principle, but the calculation formula must be taken into account. Indices based on more complex formulas (e.g. Fisher Price) do not usually pose a disclosure risk, while very simple formulas are more prone to this, in which case they must be based on a sufficient number of observations. 

Gini coefficients

Gini coefficients must be calculated for a sufficiently large number of observations. Data needed for the verification process: calculation formula and possibly frequencies underlying the figures.

Graphs

The data protection evaluation process for graphs relies on aggregated tabular presentations, as they make it easier to perceive the frequency of the observations underlying the points or plots in a graph, which would usually be impossible to discern from the graph itself. Therefore, a table specifying the underlying frequencies should be provided with the graph if it is used to depict individual observations or a small target group.

Histogram

When using histograms, the data should be classified in a way that each individual class contains a sufficient number of observations. This can be particularly challenging for the tail ends of any normal distributions, for example. This instruction is proportional to the case of descriptive statistics, and it may thus limit the depiction of the entire tail.

Scatter chart or scatter plot

As a rule, each individual point in a scatter chart describes a single unit. Therefore, these types of charts and plots are not publishable without first grouping the data so that each point contains several observations. These can be published only if the data which the chart or plot is based on could be published as a table. However, the assessment process should also take into account whether a combination of the variables you have used would allow for the identification of any individuals. You can improve the anonymity of scatter charts by replacing them with a graph depicting the frequency of observations in grid cells or by adding randomness to your points.

Box plots

Box plots pose a disclosure risk in principle, as they contain dots that pertain to individual observations, and abnormal observations in particular could lead to the disclosure of someone’s identity. Means may also pose a disclosure risk. This instruction is proportional to the case of descriptive statistics, and any outlier observations pose a particular disclosure risk.

Residuals

Residuals refer to a single observation. When depicting any residuals, it is recommended to use a graph format instead of a graph that is based on individual points. If a graph based on individual points is used, avoid disclosing the values of the axes.

Survival analysis, Kaplan-Meier curve

These may include a disclosure risk, depending on the definition of the analysis. These may be published if each step of the curve corresponds to a sufficient number of observations. If it is clear that the data behind the curve cannot be used to determine any exact ages or dates, also steps with single observations can be allowed. Closer consideration needs to be done if the curve includes steps with single observations. That is because it is possible that detailed backround information may enable individual persons recognization.

Spatial analysis

Particularly challenging in terms of data protection, as location information usually plays a key role in the disclosure of an individual. Usually requires a great deal of reclassification and, preferably, presenting the data as thermal maps instead of observation points.

Other result types

Photographs and other imaging materials

Imaging materials are verified on a case-by-case basis. It is very difficult to define any general guidelines for imaging materials, as they represent such a diverse group of materials. Naturally, these types of materials may never contain any direct text-based identifiers, and the rougher an image is the more difficult it will be to identify. The persons who handle imaging materials are usually the best at determining the disclosure risk presented by each piece of material. For example, an image of a single tooth will rarely reveal the identity of its owner, but an entire tooth chart could be used for that purpose.

Hereditary genetic data

In terms of hereditary genetic data, indicators concerning a sufficiently small number of variants calculated from a sufficiently large group can be considered anonymous. However, these must be verified on a case-by-case basis.

Machine learning

In terms of neural networks and other machine learning models (decision trees, etc.), the actual publication of these materials is rarely needed. However, the need may arise for disclosing such results outside Kapseli, and the verification of these results will be carried out on a case-by-case basis. General instructions will be provided at a later date.

Individual-level data

The anonymity of individual-level data must always be ensured on a case-by-case basis. Contact Findata for more detailed instructions.

References

FAQ about verifying the results

Content will be added later.