On this page, you’ll find instructions for producing anonymous results. If you process personal data, you must provide the results in an anonymous form that cannot reveal any information about individuals. Findata ensures this anonymity according to the Act on the Secondary Use of Health and Social Data. This applies to all materials that have been authorised under the Act.
This page includes common result types and considerations for ensuring anonymity. Note that the list is not comprehensive, and the content of variables can affect the risk of disclosure. Even if an exact value is present, it may not expose an individual.
You must check your results and verify that they follow these instructions regarding different result types. While some result types can be easily verified for anonymity, others may require more specific examination. Adhering to these instructions doesn’t guarantee anonymity, but it gets you as close as possible to the goal.
In addition to following these instructions, you must submit a summary form along with the results, and Findata will perform result checks. Do not send the results for anonymity verification until you are certain they have been generated in an anonymous format.
Fill out the summary form carefully, as the verification of anonymity relies heavily on the information provided. If you need guidance on producing anonymous results, we can assist you. Generate the results in a format that allows for anonymity verification. Below, you’ll find tips to expedite the verification process.
How to expedite the verification of result anonymity:
- Carefully read the instructions on this page. Ensure that the results you produce adhere to the guidelines.
- If needed, contact Findata’s Help Desk for assistance in applying the instructions.
- Fill out the summary form diligently. Complete all sections of the form and tick all necessary boxes on the Summary page.
- If your results do not align with all statements, provide justification for why the results can still be considered anonymous.
- If the results are being submitted through the transfer service Nextcloud, ensure that the Nextcloud ID is mentioned in the form. Additional instructions on encryption and data submission through Nextcloud can be found on the Data Transfers to Findata -page.
- Generate the results in a format that allows for anonymity verification. Ensure that all variables are labeled with names understandable to individuals outside the research.
- Clearly indicate the type of results (e.g., frequency, regression coefficient, or another test statistic).
- Request results in reasonable-sized packages for export.
- Avoid sending individual result packages frequently (e.g., every day). Handling result packages through multiple separate submissions consumes more time for data transfer and communication.
- We recommend submitting results in packages of no more than 50 files. Handling an extremely large result package containing hundreds of files can be labor-intensive, especially if there are uncertainties or comments regarding result anonymity.
- If you are requesting other data besides results from the processing environment, ensure that these files do not contain results. Clearly describe the data being transferred in the summary form.
- Ensure that code files do not contain results or data (e.g., pseudo-ID).
What does anonymisation mean?
Anonymisation is the process of ensuring that material:
- Cannot be used to identify any individual directly or indirectly.
- Cannot be used to draw conclusions about a specific individual.
- Cannot be linked to other material concerning a specific person.
The anonymised material must be impossible or extremely difficult to revert to an identifiable form. According to the Secondary Use Act, results must be anonymous. If there is a need to publish results that cannot be anonymised, this should be considered during the study’s planning process by using other criteria, such as obtaining consent from research subjects.
Even if an individual result is anonymous, there is a risk of disclosure when multiple results are combined. For example, several frequency tables using the same variable classifications can often be combined to create more detailed tables. Consider how your results could be combined with both current and previous analyses. If prior publications have used similar material, provide links to these publications.
To protect data subjects’ anonymity, the minimum frequency in results is five. This ensures data protection. For justifiable reasons and on a case-by-case basis, a minimum frequency of three may be used. This includes studies involving very small target groups or rare diseases. This is only acceptable if the result is significant and necessary to report at this level of accuracy while still meeting anonymity criteria.
Provide Findata with sufficient background information for each analysis type to ensure anonymity. This information must be displayed with the result or as a separate document so that the result and background information can be easily understood.
Findata uses the principles described in the table below as the basis for verifying anonymity.
Classification of disclosure risk by result type
Data type | Result type | General classification |
---|---|---|
Descriptive statistics | ||
Frequency table | To be verified | |
Quantity table | To be verified | |
Maximum, minimum, percentile, median | To be verified | |
Mode | Generally safe | |
Mean, indices, ratios, indicators | To be verified | |
Degree of concentration | Generally safe | |
Higher momentum indicators (such as variance, covariance, kurtosis, skewness) | Generally safe | |
Graphs: visual representations of the original material | To be verified | |
Correlations and regression-type analyses | ||
Linear regression coefficients | Generally safe | |
Non-linear regression coefficients | Generally safe | |
Estimation residuals | To be verified | |
Estimate summary and test variables (R2, χ2 etc.) | Generally safe | |
Correlation coefficients | Generally safe | |
Factor analysis | Generally safe | |
Correspondence analysis | Generally safe |
Before delivering your results for anonymity verification, make sure that:
- your results include no <5 frequencies
- your results cannot be used to identify any individuals either directly or indirectly and that the data cannot be combined with other data concerning the same person
Descriptive indicators and analyses
In the text below, the terms “group” and “target group” refer to the observations from which statistics are calculated.
Minimum, maximum and range
In general, minimum and maximum often refer to individual units, which are the easiest to identify, meaning that they contain a clear disclosure risk. These can be published if the value of the statistics is based on more than one unit. Anonymity of these results can be improved by categorising data, as categories will then include several individuals. Consider using suitable quantiles alongside any minimum and maximum figures.
Fractiles – quantiles, deciles, percentiles, median
These can be published if the underlying frequency is large enough.
Mean, standard deviation
In rare cases, may contain a disclosure risk. Check that the result represents a sufficiently large group and that the entire target group is not issued the same value. Check that these statistics are not reported from several nearly identical groups or subgroups.
Mode
This can be published in principle, but check that it will not disclose the entire group, i.e. that it does not describe the value of the entire target group.
Higher momentum indicators, such as variance
These can be published in principle, as the indicator has been clearly converted from the original individual values. Make sure not to publish an excessive number of indicators from a small group, as they could serve to disclose the entire group.
Correlation coefficients
These can be published in principle when the group under consideration contains a sufficient number of observations.
Degrees of concentration
These can be published in principle when the group under consideration contains a sufficient number of observations.
Linear regression, non-linear regression
Coefficients can be published in principle.
Test variables
These can be published in principle.
Factor analysis
These can be published in principle, but make sure that your factors are not based on a single variable.
Principal component analysis
Main component vectors and their corresponding values can be published in principle. Check any projections of the main components (they correspond to scatter plot, see below).
Indices, ratios, indicators
Indices can be published in principle, but the calculation formula must be taken into account. Indices based on more complex formulas (e.g. Fisher Price) do not usually pose a disclosure risk, while very simple formulas are more prone to this, in which case they must be based on a sufficient number of observations.
Gini coefficients
Gini coefficients must be calculated for a sufficiently large number of observations. Data needed for the verification process: calculation formula and possibly frequencies underlying the figures.
Graphs
The data protection evaluation process for graphs relies on aggregated tabular presentations, as they make it easier to perceive the frequency of the observations underlying the points or plots in a graph, which would usually be impossible to discern from the graph itself. Therefore, a table specifying the underlying frequencies should be provided with the graph if it is used to depict individual observations or a small target group.
Histogram
When using histograms, the data should be classified in a way that each individual class contains a sufficient number of observations. This can be particularly challenging for the tail ends of any normal distributions, for example. This instruction is proportional to the case of descriptive statistics, and it may thus limit the depiction of the entire tail.
Scatter chart or scatter plot
As a rule, each individual point in a scatter chart describes a single unit. Therefore, these types of charts and plots are not publishable without first grouping the data so that each point contains several observations. These can be published only if the data which the chart or plot is based on could be published as a table. However, the assessment process should also take into account whether a combination of the variables you have used would allow for the identification of any individuals. You can improve the anonymity of scatter charts by replacing them with a graph depicting the frequency of observations in grid cells or by adding randomness to your points.
Box plots
Box plots pose a disclosure risk in principle, as they contain dots that pertain to individual observations, and abnormal observations in particular could lead to the disclosure of someone’s identity. Means may also pose a disclosure risk. This instruction is proportional to the case of descriptive statistics, and any outlier observations pose a particular disclosure risk.
Residuals
Residuals refer to a single observation. When depicting any residuals, it is recommended to use a graph format instead of a graph that is based on individual points. If a graph based on individual points is used, avoid disclosing the values of the axes.
Survival analysis, Kaplan-Meier curve
These may include a disclosure risk, depending on the definition of the analysis. These may be published if each step of the curve corresponds to a sufficient number of observations. If it is clear that the data behind the curve cannot be used to determine any exact ages or dates, also steps with single observations can be allowed. Closer consideration needs to be done if the curve includes steps with single observations. That is because it is possible that detailed backround information may enable individual persons recognization.
Spatial analysis
Particularly challenging in terms of data protection, as location information usually plays a key role in the disclosure of an individual. Usually requires a great deal of reclassification and, preferably, presenting the data as thermal maps instead of observation points.
Other result types
Photographs and other imaging materials
Imaging materials are verified on a case-by-case basis. It is very difficult to define any general guidelines for imaging materials, as they represent such a diverse group of materials. Naturally, these types of materials may never contain any direct text-based identifiers, and the rougher an image is the more difficult it will be to identify. The persons who handle imaging materials are usually the best at determining the disclosure risk presented by each piece of material. For example, an image of a single tooth will rarely reveal the identity of its owner, but an entire tooth chart could be used for that purpose.
Also, see the principle guidelines prepared by the high-level expert group appointed by the Ministry of Social Affairs and Health on the anonymisation and anonymity of image and signal data:
Hereditary genetic data
In terms of hereditary genetic data, indicators concerning a sufficiently small number of variants calculated from a sufficiently large group can be considered anonymous. However, these must be verified on a case-by-case basis.
Machine learning
In terms of neural networks and other machine learning models (decision trees, etc.), the actual publication of these materials is rarely needed. However, the need may arise for disclosing such results outside Kapseli, and the verification of these results will be carried out on a case-by-case basis. General instructions will be provided at a later date.
Individual-level data
The anonymity of individual-level data must always be ensured on a case-by-case basis. Contact Findata for more detailed instructions.
References
- Bond et al.: Guidelines for Output Checking (PDF file, 1515 kB)
- Brandt et al. (2009): Guidelines for the checking of output based on microdata research (ec.europa.eu)
- Griffiths, E. et al. (2019). Handbook on Statistical Disclosure Control for Outputs, version 1.0 2019
- Hundepool, Anco; Domingo-Ferrer, Josep; Franconi, Luisa; Giessing, Sarah; Schulte-Nordholt, Eric; Spicer, Keith & de Wolf, Peter-Paul (2012). Statistical Disclosure Control. Wiley
Publishing the results
Publication refers to making information publicly available, which includes presenting results outside your immediate working group. This can be in the form of a scientific journal, thesis, textbook, manual, conference presentation, abstract, report, survey, or internet publication.
Publishing results from Kapseli
- Data processing: Data is processed in the Kapseli environment, and only the final analysis results are exported. Results must be in an anonymous format, with Findata ensuring anonymity as per the Secondary Use Act.
- Verify anonymity: Use the guidelines on the page Procucing anonymous results to verify the anonymity of results intended for publication.
- Transfer results: Transfer the results and the summary form to Findata via the Output (O:) drive in Kapseli.
- The summary form for verifying anonymity is located in the Kapseli D-folder under “Käyttöohjeet_User_guide_05062023.”
- Compress the files and the summary form into a zip folder named as follows:
“Results_[Record_number_of_permit_decision][Kapseli_ID][Delivery_date]” (e.g., “Results_THL_1234_14.02.00_2020_a01_15032021”).- Note: Date format should be ddmmyyyy.
- Create an empty text file named “ZZZ_READY.txt” in the Output drive. This triggers the automatic transfer of the zip folder. Ensure the file name is correct.
- Transfers occur hourly and every 30 minutes. Files will be deleted from the Output drive after transfer.
- Notify Findata (optional): Email Findata at data@findata.fi to confirm your transfer. We will follow up if we do not receive your submission. There will be no confirmation of transfer success.
- Review and delivery of results: Findata will review the submission within 5 working days and provide the results via Nextcloud to the permit holder. If additional information is needed, we will contact you.
- For large result files, the review process might exceed the usual 5-day limit. This time limit pertains only to verifying anonymity, not to other file imports from Kapseli (e.g., code files).
- If you don’t have a Nextcloud account, request one via the “Order a new Nextcloud account” form in Findata’s e-service.
Publishing results from other secure operating environments
- Summary form: Download and complete the form for verifying the anonymity of the results
- Compress files: Zip the files and the summary form, naming the folder as follows:
- “Results_[Record_number_of_permit_decision][Kapseli_ID][Delivery_date]” (e.g., “Results_THL_1234_14.02.00_2020_a01_15032021”).
- Note: Date format should be ddmmyyyy.
- “Results_[Record_number_of_permit_decision][Kapseli_ID][Delivery_date]” (e.g., “Results_THL_1234_14.02.00_2020_a01_15032021”).
- Transfer results:
- Nextcloud: If you have a Nextcloud account, transfer results via Nextcloud.
- Secure Email: If you do not have a Nextcloud account, transfer results via secure email. Do not send results via regular, non-secure email.
- Contact Findata: Email Findata at data@findata.fi with the subject “Ensuring the anonymity of results.”
- Indicate whether you are using Nextcloud or secure email for the transfer.
- If using Nextcloud, include the diary number of the data permit and your Nextcloud ID. Findata will provide the folder name for your transfer and a zip folder with the summary form.
- If using secure email, Findata will send a secure email for you to reply with your zip folder containing the results and the summary form.
- Follow-up: If there are concerns about the anonymity of the results, Findata will contact you within seven working days.
- If you do not hear from us within this timeframe, you may proceed with publishing your results.
Report published research results to Findata
Use the form below to report articles and publications that have made use of data authorised by Findata. One of the criteria for the issuing of a data permit for the purpose of scientific research is that the results are published as scientific publications. The form can also be used to report publications of data authorised for other uses.
Reference Guide
If Findata has granted a data permit or made a data request decision for your project, cite Findata in publications as follows: “Sosiaali- ja terveysalan tietolupaviranomainen Findata” or “Finnish Social and Health Data Permit Authority Findata.”
- Follow the writing guidelines of the scientific publication series.
- We recommend that references to Findata be made in accordance with its statutory duties. In data permits, these duties include, for example, pseudonymization and ensuring the anonymity of results, and in data requests, the duties include data integration, aggregation, and anonymization.
- Findata can be referenced in the text, tables, figures, permit lists, acknowledgments, and reference lists.
- Whenever possible, include the diary number(s) of the data permit or data request in the references.
Examples of in-text citations
“Research data was obtained from the Finnish Social and Health Data Permit Authority Findata with data permit THL/XXXX/14.XX.00/20XX. Findata was responsible for the pseudonymization of the data and ensuring the anonymity of the final results.”
“The statistics were produced by Findata, the Finnish Social and Health Data Permit Authority, with data request THL/XXXX/14.XX.00/20XX. Findata was responsible for data integration and producing the anonymized statistics.”
Example of table citation
Data | Source |
---|---|
Research data | Finnish Social and Health Data Permit Authority Findata, data permit THL/XXXX/14.XX.00/20XX |
Example of citation in a reference list
Findata. (Year). Data permit THL/XXXX/14.XX.00/20XX. Finnish Social and Health Data Permit Authority Findata.