Tips for defining the extraction of textual data

Consider necessity: Determine if textual data is essential for conducting your research.
- The need for unstructured data must always be justified. Often, similar information can be extracted in a structured format.
- We mask direct identifiers in textual data, which affects the usability of the data.
- Processing textual data takes time and significantly increases costs.
Request insights from the data controller on appropriate keywords.
- The data controller is best positioned to assess which keywords will yield the most comprehensive results without extraneous information.
- For example, the keyword pressure will produce all records related to pressure, such as blood pressure, eye pressure, etc. If the research focus is on blood pressure, the extracted textual data will contain a substantial amount of unnecessary information.
Specify the length of the text snippet to be extracted on a variable-by-variable basis.
- The text snippet should be as short as possible.
- Request the data controller’s view on the length of the text to be extracted.
  - For instance, keyword +/- 50 characters.
Ensure the extraction is scoped appropriately.
- If the extraction can be limited to the information from a specific department or field of treatment instead of the entire healthcare district’s database, the amount of text to be extracted and the processing costs will be significantly lower.