Producing anonymous results

Results exported from a secure processing environment must be produced in an anonymous format, so that no individual person can be identified or any personal characteristics disclosed.

On this page, you will find instructions for producing anonymous results, as well as examples to support the application of the guidelines.

Instructions for producing anonymous results

Introduction

If you process personal data, you must produce the results in an anonymous format so that no individual person can be identified or any personal characteristics disclosed.

The permit holder anonymises results that will be exported from the secure processing environment, and Findata verifies the anonymity of the results through a risk-based assessment.

Read more: Exporting results from the processing environment

Findata does not charge a fee for anonymity verification.

Results must be produced in accordance with these guidelines for different types of outputs. The aim of producing anonymous results is to ensure data protection.

This guidance includes a list of common result types and their associated disclosure risks. The list is not exhaustive, and the risk of disclosure is also affected by the properties of the data and the content of its variables. The anonymity of some result types can be assessed fairly easily, while others require more detailed evaluation.

Ensuring the anonymity of results consists of the following components:

  • Findata’s guidance on producing anonymous results
  • The risk assessment completed by the permit holder
  • Verification carried out by Findata based on a risk-based assessment
I consider the results anonymous, but they do not comply with Findata’s guidelines

If you are unable to produce anonymous results in accordance with Findata’s guidelines, but you consider the results to be anonymous, proceed as follows:

  1. Provide detailed justification for the anonymity of the results when submitting them. The justification must relate only to anonymity, not, for example, to the significance of the study or the importance of publishing the results.
  2. Include sufficient background information for each type of analysis to enable an assessment of anonymity. This may include, for example, a description of how the study population was formed and the number of observations.
  3. Please note that verifying the anonymity of results that require case-by-case assessment may take longer than usual.

What does anonymisation mean?

Anonymisation is a process in which data or material is processed so that:

  • an individual person cannot be identified directly or indirectly
  • no conclusions can be drawn about an identifiable individual
  • data relating to an individual cannot be linked with other materials

Anonymised material must be impossible or unreasonably difficult to restore into a form in which an individual can be identified.

Under the Secondary Use Act, results produced from disclosed data must be anonymous (Section 52). If a study requires exporting results that cannot be anonymised, this must be taken into account already at the planning stage. In such cases, it should be considered whether the study can be conducted under legislation other than the Secondary Use Act.

Results must be anonymised in a way appropriate to the result type. This can be done, for example, by:

  • using sufficiently coarse categories,
  • censoring, editing or deleting data involving a disclosure risk, or
  • adding uncertainty to the data.

Even if an individual result is anonymous, combining multiple results may still lead to disclosure of information about an individual. A typical example is producing several frequency tables using the same background variable classifications. When combined, these tables may form a more detailed frequency table that could reveal additional information about the observations.

The risk of combining results must be considered across ongoing analyses, as well as past and future analyses. If you are aware of earlier publications based on the same or nearly the same data, or on a subset of it, these must also be taken into account when assessing disclosure risk.

To ensure anonymity, the starting point is that results are based on multiple observations and that the minimum frequency in the results is three.

This means that, for example, each table cell or group must be based on at least three observations.

Using a minimum frequency threshold is a clear and simple way to reduce the risk of disclosure and to ensure data protection in the results. However, you must assess whether the minimum frequency is sufficient, as meeting this threshold does not always guarantee anonymity.

For example, publishing 0% and 100% cells (where all or none of the individuals have a given characteristic) requires case-by-case consideration. If a cell with a low cell-specific frequency is suppressed, it must also be ensured that the suppressed value cannot be derived from other values in the table.

In addition to cell-specific frequencies, you must also consider the size of the cell-specific population. This refers to the group of individuals who could fall into the cell based on its defining characteristics (e.g. persons of a certain age living in a particular municipality). A small cell-specific population increases the risk of identifying individuals.

Findata requires sufficient background information for each type of analysis to verify anonymity. This information, such as the number of observations that the results are based on, must be included in the summary form.

Descriptive indicators and analyses

The principles described in the table below form the basis of our anonymity verification process.

Result typeDisclosure riskNotes
Descriptive analyses and indicators  
 FrequenciesUsually highCan be published in principle when the minimum frequency in the results is at least three. See also Example 1.
Maximum, minimum, rangeUsually highCan be published in principle when the value of the indicator is based on several observations or if disclosing the individual person can be prevented in some other way. Usually, the minimum and maximum are applied to an individual observation, so these values may pose a risk of disclosure. You can improve the anonymity of your results by dividing your data into categories, as these categories will then include several individuals. Consider using suitable quantiles instead of minimum and maximum figures. See also Example 2.
Fractiles (quantiles, deciles, percentiles, median)Usually highCan be published in principle when the number of background observations is large enough.
ModeLowCan be published in principle if all the members of a group are not issued the same value.
Mean, standard deviationUsually highCan be published in principle, but check that the result represents a sufficiently large group and that the entire target group is not issued the same value. In addition, make sure not to report indicators from several nearly identical groups or subgroups.
Indices, ratios, indicatorsLowCan be published in principle, but remember to take your calculation formula into account. Indices produced with more complex formulas (e.g. Fisher Price) do not usually pose a disclosure risk, but very simple formulas are more prone to that risk, in which case your figures must be based on a sufficient number of observations. 
Degrees of concentration  LowCan be published in principle when the group under consideration contains a sufficient number of observations.
Higher momentum indicators (variance, covariance, kurtosis, skewness)LowCan be published in principle because the indicator has been clearly converted from the original individual values. Make sure not to publish an excessive number of indicators from a small group, as they could serve to disclose the entire group.
Graphs: visual representations of the original materialUsually highFor more detailed instructions, see the Graphs section.
Correlations and regression-type analyses 
Regression coefficientsLowCan be published in principle.
Estimation residualsUsually highResiduals refer to a single observation, so there is a disclosure risk. For more detailed instructions, see the Graphs section.
Estimate summary and test variables (t, F, R2, χ2 etc.) LowCan be published in principle when the group under consideration contains a sufficient number of observations.
Correlation factorsLowCan be published in principle when the group under consideration contains a sufficient number of observations.
Factor analysisLowCan be published in principle, but make sure that your factors are not based on a single variable.
Main component analysisLowMain component vectors and their corresponding values can be published in principle. Make sure to check the projections of the main components because they correspond to the scatter plot. For more detailed instructions, see the Graphs section.
Correspondence analysisLowCan be published in principle.
Table 1. Classification of result types according to disclosure risk

The terms “group” and “target group” refer to the observations from which indicators are calculated.

Graphs

The data protection assessment process for graphs relies on aggregated tabular presentations, as they make it easier to perceive the frequency of the observations underlying the points or plots in a graph, which would usually be impossible to discern from the graph itself. If necessary, a table specifying the background of your result should be provided with the graph if said graph is used to depict individual observations or a small target group.

Click the titles below to read more about each result type.

Distribution graphs, histogram

Distribution graphs require special attention to outliers, as they may pose a disclosure risk.

This is particularly challenging at the tails of a normal distribution, where it may not be possible to include the entire tail in the graph. For histograms, ensure the material is classified so that each class contains a sufficient number of observations. This instruction corresponds to the principles that apply to descriptive statistics.

Bar chart

Bar charts must be prepared so that each category includes a sufficient number of observations. This instruction corresponds to the principles that apply to descriptive statistics.

Scatter chart, scatter plot

As a rule, each individual point in a scatter chart is based on a single unit. As such, these charts cannot be published unless the data has been grouped so that each point represents multiple observations.

Scatter charts may only be published if the underlying data would meet anonymisation requirements in tabular form. The assessment must also consider whether the combination of variables used could enable individual identification.

You can improve the anonymity of scatter charts by replacing them with a graph depicting the frequency of observations in grid cells or by adding randomness to your points.

Box plots

Box plots pose a disclosure risk by default, as they contain points referring to individual observations. Outliers are particularly susceptible to causing identification risk. The use of means may also increase the risk of identification. This instruction corresponds to the principles that apply to descriptive statistics.

Residuals

Residuals refer to individual observations. When visualising residuals, prefer a graph format that does not rely on individual points. If a graph based on individual points is used, avoid displaying the axis values.

Survival analysis, Kaplan-Meier curve

The disclosure risk in survival analysis depends on how the analysis is defined. In principle, the results can be published if each step in the curve corresponds to a sufficient number of observations

Individual steps may also be allowed if it is clear that the data underlying the curve cannot be used to infer precise ages or calendar dates. In such cases, however, you must assess how well the background information identifies individuals.

See also Example 3.

Spatial analysis

Spatial analysis is particularly challenging in terms of data protection, as location information usually plays a key role in identifying individuals. Publishing spatial analysis results generally requires reclassifying location information and presenting the data as heat maps instead of individual observation points.

Sankey diagrams

The data protection implications of Sankey diagrams (flow diagrams) depend on the type of information presented. If the diagram depicts exact frequencies, the same principles apply as for descriptive statistics.

See also Example 4.

Other result types

Click the titles below to read more about each result type.

Images and other imaging materials

The privacy risks associated with imaging materials must be assessed on a case-by-case basis. Such materials must not include any identifiers or metadata that could be linked to an individual. The risk of identification can be reduced by limiting the data and lowering the image resolution. If imaging materials can be combined with other data, the risk of identification increases.

For more information on the anonymisation and anonymity of imaging and signal data, see the policy outline prepared by the Expert Group on Data Protection appointed by the Ministry of Social Affairs and Health:

Results based on genome data

The anonymity of results based on genome data must always be assessed on a case-by-case basis.

1. Rare individual genetic variations

Publishing rare significant genetic findings is as a rule permitted if the variation is found in at least three individuals. If fewer than three individuals share the variation, the exact number should be concealed.

When publishing such findings, particular attention must be paid to the risk of identification. A minimum frequency threshold of n=3 may help reduce this risk, but its sufficiency should be evaluated in each case. The smaller the target group and the more background information is provided (such as phenotype or geographical origin) the higher the risk of identifying individuals.

When publishing results related to individual subjects, you must not include:

    • exact age,
    • a detailed description of the subject’s phenotype,
    • medical history,
    • the exact geographical area where the subjects were collected, or
    • any other information that could compromise the anonymity of the subjects.

    Details of the genetic variation may be published if they are already available in public variant databases, including:

    • allele frequency,
    • clinical significance of the variant,
    • rsID identifying the variant, or
    • the research method used to identify it.

    Table 2 provides an example of how to publish rare genetic findings.

    Causal geneDisorderPatient NZygosity of variantReference transcriptDNA variantProtein variantMolecular consequencePathogenicity classificationdb SNP IDgnomAD MAF, EuropeangnomAD MAF, FinnishSpliceAIphyloPCADD
    EPHA2Lung cancer<3HeterozygousNM_004431.5c.1171G>Cp.Gly291Argmissense variantLikely pathogenicrs341925490.016790.009470.01000.1356.37
    NRASLung cancer5HeterozygousNM_002524.5c.183A>Cp.Gln61Hismissense variantPathogenic/Likely pathogenicrs121913255NANANANANA
    DDR2Lung cancer<3HeterozygousNM_006182.4c.716T>Gp.Leu239Argmissense variantLikely pathogenicrs5780152160.0000.0000.0006,3224.5
    Table 2. Example of reporting rare genetic findings

    Genetic variations not present in the individual’s own genome

    Time-bound genetic mutations that do not occur in the subject’s genome, such as those found in cancer tumours, may be published at an individual level. However, extra caution must be taken to ensure individuals cannot be identified.

    When publishing results related to individual subjects, you must not include:

    • exact age,
    • a detailed description of the subject’s phenotype,
    • medical history,
    • the exact geographical area where the subjects were collected, or
    • any other information that could compromise the anonymity of the subjects.

    Where possible, results should be presented in grouped form based on mutation or person classifications.

    3. Genealogical research

    When publishing genealogical results, the anonymity of extended family members must be ensured. Hereditary genetic changes identified through genealogical studies may be presented as pedigrees, as illustrated in Figure 1.

    When presenting persons in a pedigree, you cannot include any other detailed background information about them, such as:

    • age,
    • time of discovery of the genetic changes,
    • geographical area from which the genealogical history has been collected, or
    • any other information that endangers the anonymity of the family.

    The more detailed the information, the greater the risk of identification. In cases of very rare diseases, it is possible that a pedigree cannot be published in anonymous format.

    Figure 1. Example of pedigree reporting
    Artificial intelligence (AI) and machine learning models

    If an AI or machine learning model/algorithm is the publishable result of your study, you must consider its anonymity already during the development. You must be able to prove the anonymity of the model in order to publish it. Anonymity is assessed on a case-by-case basis.

    When submitting a model for anonymity verification, you must also submit the Anonymity form for AI and machine learning models to Findata. You should read the form carefully already when planning the model, so that you are familiar with factors affecting anonymity already when developing the model.

    Download the form: Anonymity form for AI and machine learning models (Word, 44 kb)

    Anonymity is affected by:

    1. characteristics of the model itself (e.g. model type, number of parameters to be trained)
    2. features of the training data (e.g. such as quantity, content, format and possible processing, such as anonymisation)
    3. the model’s ability to generalise and how it performs the task for which it is trained.

    If a model is trained exclusively on anonymous training material, the model itself can also be considered to be anonymous.

      The anonymity of a model can be ensured by using differential privacy during training. There are no unequivocal guidelines for selecting the differential privacy parameters, as it depends on the situation and must be justified on a case-by-case basis.

      Small values, such as

      where N is the number of individuals in the material used, produce mathematically strong protection. However, in some cases, even higher values may suffice.

      As a rule, training data should not include material containing individual-level direct identifiers. A model with parameters that include information referring to an individual in an understandable format (e.g. text) does not meet anonymity criteria. The development process must also include sufficient measures for assessing the generalisation ability of the model.

      Individual-level result materials

      As a rule, published results must be based on several observations.

      If you need to publish result materials that are individual-level on the whole, the result materials must meet the definition of anonymous data (see What does anonymisation mean?).

      Producing individual-level anonymous data usually requires the use of advanced methods, such as differential privacy. The anonymity of individual-level result materials must always be ensured on a case-by-case basis.

      Synthetic result materials

      The anonymity of synthetic materials depends on:

      1. the method of generation, and
      2. the properties of the personal data used as potential background data.

      The fact that data is synthetic is not, on its own, sufficient to guarantee anonymity.

      Demonstrating anonymity depends largely on the method of generating the data and its complexity. If the aim is to generate anonymous synthetic data, you should consider already when developing the process how the anonymity of the data can be verified. The anonymity of synthetic result materials must always be ensured on a case-by-case basis.

      Results of qualitative research

      Results produced using qualitative research methods must also be anonymous and meet the definition of anonymous data.

      If possible, a minimum frequency of three should be used in published results to ensure data protection. If this is not possible, results must be processed so that individual persons cannot be identified directly or indirectly.

      The method of forming the target group affects the risk of identification. Identification risk is higher if the target group includes all the persons meeting certain inclusion criteria compared to a situation where the target group is selected randomly or by sampling.

      Especially when publishing quotations from text, you as the producer of the results have to ensure that the quotation does not enable the identification of an individual or event. The risk of identification increases if the quotation contains details of an individual or event or if it concerns a rare phenomenon. 

      Examples to support the application of the instructions

      Example 1: Frequency table and secondary disclosure

      In a frequency table, data in each cell must be based on at least three observations. If the number of observations is less than that, the exact numeric value must be censored. Data referring to fewer than three observations can for example be censored with “<3”.

      It should not be possible to calculate the censored value using the other numbers in the table. If the other numbers in the table can be used to determine the exact value of the censored frequency, the other numbers must also be censored.

      Example 1.1.

      In the table below, the Year 3 cell contains an observation based on fewer than three persons, so it has to be censored.

       Year 1Year 2Year 3Total
      Group a, n012113

      However, simply censoring the small observation value is not enough, since the other numbers can be used to calculate its exact value. You can prevent that calculation by coarsening the total number or the exact numerical value of another observation value as presented below. When doing so, Year 2 and Total can be used to conclude that the value of the Year 3 cell can be 1 or 2, but the exact value is not known.

       Year 1Year 2Year 3Total
      Group a, n012<313–14
       Year 1Year 2Year 3Total
      Group a, n0>10<313

      When the numerical value is zero, you do not usually need to hide it because that data does not refer to any individual person.

      Example 1.2.

      Two cells of the table below contain an observation based on fewer than three persons (n=1), and both need to be censored.

       Year 1Year 2Year 3Total
      Group b, n1315

      Even if both of the observation values were censored with <3, the values in Year 2 and Total could be used to deduce that the value of both censored cells is 1. In this case, one way you can prevent the calculation of small observation values is by censoring all three observation values with ≤3.

       Year 1Year 2Year 3Total
      Group b, n≤3≤3≤35

      The situation would be different if, for example, the total amount was 6 (as in table 1.3. in the example below). In this case, it would not have been possible to determine which of the two censored cells marked with <3 had the value 1 and which had the value 2.

      Example 1.3.

      Observations based on fewer than three persons can also get disclosed by proportional shares. In the example below, even if the small frequencies were censored with <3, it is possible to use the percentages and the total to calculate the exact values of cells Year 1 and Year 2.

       Year 1Year 2Year 3Total
      Group c, n1326
      Group c, %17.75033.3100

      In this case, the percentages have to be censored as well.

       Year 1Year 2Year 3Total
      Group c, n<33<36
      Group c, %≤3350.017–33100

      Example 2: Minimum and maximum

      Minimum and maximum values usually concern an individual observation, so they can involve a disclosure risk, especially in cases where the group size is small and/or individual observations are very abnormal. Minimum and maximum values can be published when the value of the indicator is based on several observations or if the individual’s disclosure risk can be prevented in some other way.

      For example, when presenting indicators on the height of subjects, abnormally short or tall persons may stand out from the group, which increases the risk of disclosing the individual. In the group presented in the table below, where the average height is 168 cm, the numerical value of 195 cm stands out as abnormal. People with abnormal characteristics are also more likely to stick in people’s minds, which means that the risk of disclosure is higher.

       Height, cm
      Person 1162
      Person 2162
      Person 3170
      Person 4157
      Person 5195
      Person 6166
      Person 7171
      Person 8164

      You can improve the anonymity of your results by dividing your data into categories, as these categories will then include several individuals. Consider using suitable quantiles alongside any minimum and maximum figures.

      Example 3: Kaplan Meier graph

      A Kaplan-Meier graph and its background data can be published if the data behind the graph cannot be used to determine exact calendar dates or individuals’ characteristics, such as age. In this case, even individual steps can be allowed in the graph. The graph in Figure 2 can be published because it cannot be used to identify individuals or to determine precise calendar dates.

      Figure 2. Example: Kaplan Meier graph. Image source: Tutkijaportti (In Finnish, PDF-file, 1 335 kb)

      Example 4. Sankey diagram

      If a Sankey diagram depicts exact numbers, you have to take the minimum frequency into account in the results. The data in Figure 3 cannot be published without anonymisation, such as reclassification, as the figure determines the exact n< 3 frequencies.

      Figure 3. Example: Sankey diagram.

      Example 5: Individual-level material and qualitative research

      As a rule, published results must be based on several observations. This also applies to qualitative studies. However, if you need to publish individual-level result materials, you have to process the results so that no individual person can be identified from the result materials.

      The risk of identification is especially increased in text quotations if the quotation contains details of an individual or event or if it concerns a rare phenomenon. The fictitious quotation below contains the patient’s name and a fair amount of other identifying information.

      Eero Example, 13 years old, was brought to Turku University Hospital after he fell in the courtyard of the Turku Yhteiskoulu School. E.E. suffers from haemophilia A. The gastroenterologist who examined E.E. made a referral to…

      In text quotations, details can be censored by categorisation, which reduces the risk of identification.

      [A patient] [of primary school age] was brought to [a university hospital] for treatment after [they] fell in the courtyard [of their school]. [The patient] has [a severe chronic illness]. [The physician] who examined [the patient] made a referral to…

      Checklist for producing anonymous results

        Done
      I have read Findata’s instructions on producing anonymous results. 
      The results meet the definition of anonymous data. 
      The results do not contain direct identifiers (e.g. name, personal identity code). 
      The results do not contain strong indirect identifiers (e.g. address, registration number). 
      The results do not contain pseudo-ID codes (e.g. Findata FID). 
      The results do not contain cell-specific frequencies n< 3. If the results contain cell-specific frequencies n< 3, I have justified the anonymity of the results. 
      If small frequencies have been censored, the censored values cannot be inferred from other results. 
      The results do not contain individual-level data or data that refers to a single person (e.g. minimum, maximum). If the results contain such data, I have justified the anonymity of the results. 
      I have assessed the anonymity of the results with especial care if the results include:
      An artificial intelligence / machine learning model
      Images or other imaging materials
      Genetic or genomic data
      Synthetic result materials
         
      I have checked that no hidden data has been left behind in the result files. I have paid particular attention to R files (.rda, .rdata, .rds, .rdm), metadata in imaging materials, Python’s Jupyter notebooks (.ipynb), and hidden sheets in Excel files.

      References