Producing anonymous results

On this page you’ll find instructions for producing anonymous results. All those who process personal data must provide the results of their analyses in an anonymous form that cannot be used to reveal any data or aspects concerning individual participants following the instructions given by Findata. According to the Act on the Secondary Use of Health and Social Data, Findata must ensure anonymity. This applies to all materials that have been authorised under said Act.

This page introduces some of the most common result types that are usually included in results. The list is not comprehensive. It’s good to keep in mind that the content of the variables also affects the risk of disclosure. In some cases the risk of disclosure is clear but there are many variables where even an individual exact value won’t expose the person.

You must check the results and verify that they follow these instructions regarding different result types. With some result types the anonymity can be verified easily but with other result types more specific examination may be required. Even though following these instructions won’t secure the anonymity of the results, following them will get you as close to the goal as possible.

In addition to complying to these instructions, you must submit a summary form along with the results and Findata will perform result checks. Do not send the results for anonymity verification until you are certain that the results have been generated in an anonymous format.

Please fill out the form carefully. The verification of anonymity relies heavily on the information provided in the summary form. We can provide you guidance on producing anonymous results. Generate the results in a format that allows for anonymity verification. See below for tips on expediting the verification of result anonymity.

How to expedite the verification of result anonymity:

  1. Carefully read the instructions on this page. Ensure that the results you produce adhere to the guidelines.
  2. Fill out the summary form diligently. Complete all sections of the form and tick all necessary boxes on the Summary page.
    • If your results do not align with all statements, provide justification for why the results can still be considered anonymous.
    • If the results are being submitted through the transfer service Nextcloud, ensure that the Nextcloud ID is mentioned in the form. Additional instructions on encryption and data submission through Nextcloud can be found on the Data Transfers to Findata -page.
  3. Generate the results in a format that allows for anonymity verification. Ensure that all variables are labeled with names understandable to individuals outside the research.
    • Clearly indicate the type of results (e.g., frequency, regression coefficient, or another test statistic).
  4. Request results in reasonable-sized packages for export.
    • Avoid sending individual result packages frequently (e.g., every day). Handling result packages through multiple separate submissions consumes more time for data transfer and communication.
    • We recommend submitting results in packages of no more than 50 files. Handling an extremely large result package containing hundreds of files can be labor-intensive, especially if there are uncertainties or comments regarding result anonymity.
  5. If you are requesting other data besides results from the processing environment, ensure that these files do not contain results. Clearly describe the data being transferred in the summary form.
    • Ensure that code files do not contain results or data (e.g., pseudo-ID).

What does anonymisation mean?

Anonymisation refers to a process in which the material is processed in a way that

  • it cannot be used to identify any individual persons either directly or indirectly
  • it cannot be used to make any conclusions concerning a specific individual person
  • any information concerning a specific person cannot be linked to any other material

The anonymous material must be impossible or exceedingly difficult to return to a form that could be used to identify any individuals. According to the Secondary Use Act, the results must be anonymous.

If there is a need to publish results that cannot be anonymised, this should be taken into account during the study’s planning process by considering using other criteria for carrying out the study, such as obtaining consent from the research subjects.

Note that while an individual result may be anonymous in itself, the risk of disclosure is present when several results are combined together. A typical example of this is the production of several frequency tables that use the same variable classifications. These can usually be combined to create more detailed frequency tables in which the data is defined in relation to several variables. Take into account how your results could be combined with both current and previous analyses. If you are aware of any prior publications whose analyses have utilised the same or nearly the same material or a subset of it, remember to provide at least the links to any such publications.

To protect study participants’ anonymity, the frequency present in the results is five at minimum. This criterion is used to ensure data protection. For justifiable reasons and on a case-by-case basis, deviations from this minimum frequency of <5 may be made and a minimum frequency of <3 may be used, for example, in the case of very small target groups, rare disease studies or studies concerning rare phenomena. This result must represent a significant finding that is necessary to report at this level of accuracy while still ensuring that the criteria for anonymity are met (i.e. the variables cannot be used to identify any specific individuals).

Provide Findata with access to sufficient background information for each analysis type to ensure their anonymity. This information must be displayed in connection with the result, next to the result or as a separate document so that the result and background information can be easily understood.

Findata uses the principles described in the table below as the basis for verifying anonymity.

Classification of disclosure risk by result type

Data typeResult typeGeneral classification
Descriptive statistics  
 Frequency tableTo be verified
 Quantity tableTo be verified
 Maximum, minimum, percentile, medianTo be verified
 ModeGenerally safe
 Mean, indices, ratios, indicatorsTo be verified
 Degree of concentration            Generally safe
 Higher momentum indicators (such as variance, covariance, kurtosis, skewness)Generally safe
 Graphs: visual representations of the original materialTo be verified
Correlations and regression-type analyses 
 Linear regression coefficientsGenerally safe
 Non-linear regression coefficients                        Generally safe
 Estimation residualsTo be verified
 Estimate summary and test variables (R2, χ2 etc.)                            Generally safe
 Correlation coefficientsGenerally safe
 Factor analysisGenerally safe
 Correspondence analysisGenerally safe


Before delivering your results for anonymity verification, make sure that:

  1. your results include no <5 frequencies
  2. your results cannot be used to identify any individuals either directly or indirectly and that the data cannot be combined with other data concerning the same person

Descriptive indicators and analyses

In the text below, the terms “group” and “target group” refer to the observations from which statistics are calculated.

Minimum, maximum and range

In general, minimum and maximum often refer to individual units, which are the easiest to identify, meaning that they contain a clear disclosure risk. These can be published if the value of the statistics is based on more than one unit. Anonymity of these results can be improved by categorising data, as categories will then include several individuals. Consider using suitable quantiles alongside any minimum and maximum figures.

Fractiles – quantiles, deciles, percentiles, median

These can be published if the underlying frequency is large enough.

Mean, standard deviation

In rare cases, may contain a disclosure risk. Check that the result represents a sufficiently large group and that the entire target group is not issued the same value. Check that these statistics are not reported from several nearly identical groups or subgroups.

Mode

This can be published in principle, but check that it will not disclose the entire group, i.e. that it does not describe the value of the entire target group.

Higher momentum indicators, such as variance

These can be published in principle, as the indicator has been clearly converted from the original individual values. Make sure not to publish an excessive number of indicators from a small group, as they could serve to disclose the entire group.

Correlation coefficients

These can be published in principle when the group under consideration contains a sufficient number of observations.

Degrees of concentration

These can be published in principle when the group under consideration contains a sufficient number of observations.

Linear regression, non-linear regression

Coefficients can be published in principle.

Test variables

These can be published in principle.

Factor analysis

These can be published in principle, but make sure that your factors are not based on a single variable.

Principal component analysis

Main component vectors and their corresponding values can be published in principle. Check any projections of the main components (they correspond to scatter plot, see below).

Indices, ratios, indicators

Indices can be published in principle, but the calculation formula must be taken into account. Indices based on more complex formulas (e.g. Fisher Price) do not usually pose a disclosure risk, while very simple formulas are more prone to this, in which case they must be based on a sufficient number of observations. 

Gini coefficients

Gini coefficients must be calculated for a sufficiently large number of observations. Data needed for the verification process: calculation formula and possibly frequencies underlying the figures.

Graphs

The data protection evaluation process for graphs relies on aggregated tabular presentations, as they make it easier to perceive the frequency of the observations underlying the points or plots in a graph, which would usually be impossible to discern from the graph itself. Therefore, a table specifying the underlying frequencies should be provided with the graph if it is used to depict individual observations or a small target group.

Histogram

When using histograms, the data should be classified in a way that each individual class contains a sufficient number of observations. This can be particularly challenging for the tail ends of any normal distributions, for example. This instruction is proportional to the case of descriptive statistics, and it may thus limit the depiction of the entire tail.

Scatter chart or scatter plot

As a rule, each individual point in a scatter chart describes a single unit. Therefore, these types of charts and plots are not publishable without first grouping the data so that each point contains several observations. These can be published only if the data which the chart or plot is based on could be published as a table. However, the assessment process should also take into account whether a combination of the variables you have used would allow for the identification of any individuals. You can improve the anonymity of scatter charts by replacing them with a graph depicting the frequency of observations in grid cells or by adding randomness to your points.

Box plots

Box plots pose a disclosure risk in principle, as they contain dots that pertain to individual observations, and abnormal observations in particular could lead to the disclosure of someone’s identity. Means may also pose a disclosure risk. This instruction is proportional to the case of descriptive statistics, and any outlier observations pose a particular disclosure risk.

Residuals

Residuals refer to a single observation. When depicting any residuals, it is recommended to use a graph format instead of a graph that is based on individual points. If a graph based on individual points is used, avoid disclosing the values of the axes.

Survival analysis, Kaplan-Meier curve

These may include a disclosure risk, depending on the definition of the analysis. These may be published if each step of the curve corresponds to a sufficient number of observations. If it is clear that the data behind the curve cannot be used to determine any exact ages or dates, also steps with single observations can be allowed. Closer consideration needs to be done if the curve includes steps with single observations. That is because it is possible that detailed backround information may enable individual persons recognization.

Spatial analysis

Particularly challenging in terms of data protection, as location information usually plays a key role in the disclosure of an individual. Usually requires a great deal of reclassification and, preferably, presenting the data as thermal maps instead of observation points.

Other result types

Photographs and other imaging materials

Imaging materials are verified on a case-by-case basis. It is very difficult to define any general guidelines for imaging materials, as they represent such a diverse group of materials. Naturally, these types of materials may never contain any direct text-based identifiers, and the rougher an image is the more difficult it will be to identify. The persons who handle imaging materials are usually the best at determining the disclosure risk presented by each piece of material. For example, an image of a single tooth will rarely reveal the identity of its owner, but an entire tooth chart could be used for that purpose.

Hereditary genetic data

In terms of hereditary genetic data, indicators concerning a sufficiently small number of variants calculated from a sufficiently large group can be considered anonymous. However, these must be verified on a case-by-case basis.

Machine learning

In terms of neural networks and other machine learning models (decision trees, etc.), the actual publication of these materials is rarely needed. However, the need may arise for disclosing such results outside Kapseli, and the verification of these results will be carried out on a case-by-case basis. General instructions will be provided at a later date.

Individual-level data

The anonymity of individual-level data must always be ensured on a case-by-case basis. Contact Findata for more detailed instructions.

References

Publishing the results

In this context publication means bringing information to the public and spreading it to the surrounding society. Publication is defined as the presentation of results outside your own working group.

Publication may take place in a scientific or other journal, thesis, textbook or manual, conference or other presentation, or in an abstract, report, review or some form of internet publication.

Publishing results from Kapseli

Data processing is done in Kapseli-environment and only the final analysis results are exported from Kapseli. The user produces the results in an anonymous format and Findata ensures the anonymity of the results in accordance with the Secondary Use Act.

  1. Verify the anonymity of the results intended for publication using the instructions found on this page.
  2. Transfer the results and the summary form to Findata via the Output (O:) drive in Kapseli.
    • The summary form for the verification of the anonymity of results can be found in the Kapseli D-folder from folder named “Käyttöohjeet_User_guide_05062023”.
    • Compress the files and the summary form into a zip folder and name it as follows:
      • “Results_[Record_number_of_permit_decision]_[Kapseli_ID]_[Delivery_date]” (e.g., “Results_ THL_1234_14.02.00_2020_a01_15032021”).
      • Note: write the date in format ddmmyyyy.
    • Create an empty text file named as ZZZ_READY.txt to the Output drive. This will initiate an automatic transfer of the zip folder. Make sure to double check the spelling of the ZZZ_READY.txt file. Transfers take place on the hour and in every 30 minutes. Transferred files will be automatically deleted from the Output drive.
    • You can notify Findata of your transfer (data@findata.fi) if you wish and we will get back to you if we do not receive your transfer. There will be no verification that the transfer succeeded.
  3. Findata will review the requests within 5 working days and submit the results via Nextcloud to the permit holder
    • If you don’t have a Nextcloud account, fill in the form “Order a new Nextcloud account” in Findata’s E-service (asiointi.findata.fi).
    • Note that if your result files are very large, the verification process may exceed the usual 5-day time limit. The time limit also applies only to the verification of the anonymity of results, not to the import of other files out of Kapseli (e.g. code files).

Publishing results from other secure operating environments

If you are processing data in a secure operating environment other than Findata’s Kapseli and are ready to publish the results, follow the instructions below.

  1. Download the summary form and fill in the requested information: Summary form – verifying the anonymity of the results (Word-file, 40 kB).
  2. Compress the files and the summary form into a zip folder and name it as follows:
    • “Results_[Record_number_of_permit_decision]_[Kapseli_ID]_[Delivery_date]” (e.g., “Results_ THL_1234_14.02.00_2020_a01_15032021”).
    • Note: write the date in format ddmmyyyy.
  3. You can deliver the results to Findata in two ways:
    • If you have a Nextcloud-account, transfer the results via Nextcloud
    • If you do not have a Nextcloud-account transfer the results via secure e-mail
    • Note: do not send the result files to Findata as an attachment to a regular, non-secure e-mail.
  4. Contact Findata at data@findata.fi
    • Name the subject of your e-mail as “Ensuring the anonymity of results”
    • Specify in your e-mail whether you are transferring the results via Nextcloud or via secure email.
    • If you are using Nextcloud, please include the diary number of the data permit and your Nextcloud ID. Findata will provide you with the name of the folder where you can transfer your results and a zip folder containing the summary form.
    • If you transfer your results by secure email, you will receive a secure e-mail from Findata to which you can reply to securely transfer the zip folder containing your results and the summary form.
    • For more information on encryption and how to transfer data via Nextcloud, see the page Data transfers to Findata.
  5. If there are any concerns about the anonymity of the results, we will be in touch within seven working days of the results being submitted.
    • If you do not hear from us within seven working days of submitting your results, you can proceed with the publication of your results.

See tips on how to speed up the process of verifying the anonymity of your results at the top of this page.