Description of the topic area “Synthetic Data Generation”

The topic area of synthetic data refers to the creation and use of artificially generated data that mimics real-world data. Synthetic data is designed to have similar statistical properties and patterns as the original data while not containing any sensitive or personally identifiable information (PII). It is often used in various fields such as machine learning, data analysis, application development and data privacy within industries such as healthcare, finance, automobile, robotics and insurance.

Synthetic data serves as a privacy-preserving alternative to using real data in situations where privacy concerns or data protection regulations prohibit the use of actual personal or sensitive data. By creating synthetic data that closely resembles the original dataset, organizations can perform analysis, testing, and development activities without exposing sensitive information.

The process of generating synthetic data involves applying statistical models, algorithms, or machine learning techniques to the original data to create new data points that are statistically similar but do not correspond to any real individuals or entities. Various approaches can be used, such as generative adversarial networks (GANs), Variational Autoencoders (VAE), differential privacy techniques, or rule-based algorithms.

The advantages of synthetic data include privacy protection, reduced risk of data breaches, and the ability to share data more freely for research and development. It also allows organizations to create larger datasets that can capture rare events or edge cases, which may be difficult to obtain from real data alone. However, it is important to note that the synthetic data must be carefully validated to ensure that it maintains the desired statistical properties and accurately represents the original data’s characteristics.

Objectives

The objective is to let the student be exposed to the current research literature on the topic in combination with a range of use cases from the industry or public sector for experimentation and illustration. Below we list a set of knowledge, skills and competences that may be acquired through working on this topic area:

Knowledge

  1. Understanding of synthetic data and its importance in privacy protection and data analysis.
  2. Knowledge of data privacy regulations and ethical considerations in data handling.
  3. Familiarity with statistical concepts and measures used for data evaluation.
  4. Awareness of different techniques and algorithms for generating synthetic data.
  5. Knowledge of machine learning and deep learning concepts related to synthetic data generation.
  6. Understanding of data utility and the preservation of key data characteristics.

Skills

  1. Skill in applying statistical models and algorithms to generate synthetic data.
  2. Proficiency in using generative adversarial networks (GANs), rule-based algorithms and differential privacy techniques for synthetic data generation.
  3. Ability to evaluate and compare the quality of synthetic data against real data using statistical measures and metrics.
  4. Skill in assessing privacy preservation and risk reduction in synthetic data.

Competence

  1. Competence in selecting and applying appropriate synthetic data generation techniques for specific use cases.
  2. Competence in addressing privacy concerns and ethical considerations associated with synthetic data usage.
  3. Proficiency in using synthetic data for machine learning and AI applications, testing algorithms, and model validation.
  4. Competence in data governance and security practices for handling synthetic data.
  5. Ability to critically analyze the limitations and biases of synthetic data and propose mitigation strategies.
  6. Competence in evaluating emerging techniques and research directions in synthetic data generation.

Suggested reading

A good primer on the topic can be found here

From the health sector perspective, we recommend reading some of these articles as a starter: