Synthetic datasets could change the way developers train AI models in healthcare. They have the potential to increase the size of training datasets for AI models, whilst protecting patient privacy, but is this really a solution or just hype?

We’ve set out 7 things you need to know.

1. What is synthetic data? A synthetic dataset is an “artificial” dataset containing computer-generated data instead of real-word records. In the healthcare setting, the term “synthetic data” is often used to refer to data generated from real data using a specifically designed model. This is done in a manner that maintains certain characteristics of the original data.

2. Why all the hype? When implemented well, synthetic datasets are a good representation of the real data, they should be fit for their intended use case, and protect sensitive patient information. They facilitate access to diverse yet realistic data, which may be used to train machine learning models.

3. Good for privacy: Using “real” patient data for product development creates data privacy concerns surrounding the anonymity of the patients. Using synthetic datasets are a potential solution to this issue: to the extent that synthetic data do not relate to any identified or identifiable living individuals, they are not personal data and data protection obligations do not apply. Researchers are potentially free to use these datasets without the compliance burdens imposed by the GDPR.

4. Better than “real” patient data? A great advantage of synthetic data is that it can be used to address specific requirements which may not be met with real data. Synthetic datasets may be used as a “simulation” allowing researchers to account for unexpected results and create a solution, if the initial results are not satisfactory. In addition to being complicated and expensive to collect, real patient data can contain inaccuracies or reveal a bias that may affect the quality of the network used for machine learning. Synthetic data potentially ensure balance, variety and can automatically fill in missing values and apply labels, enabling more accurate predictions.

Further, conducting clinical trials with few patients often leads to inaccurate results. Synthetic datasets can be used to create control groups for clinical trials related to rare or recently discovered diseases that lack sufficient existing data.

5. The view of the ICO: UK regulators are contemplating the potential impact of synthetic data. The Information Commissioner’s Office (ICO) considers synthetic data to be a ‘privacy-enhancing’ technique which reflects the data minimisation principle i.e. the principle that a data controller should limit the collection of personal information to what is directly relevant and necessary to accomplish a specified purpose.

6. But there are still underlying privacy risks: The ICO flags that you will generally need to process at least some real data in order to determine realistic parameters for synthetic datasets. Where that real data can be related to identified or identifiable individuals, then you’ll still be processing personal data, and will need to do so in compliance with data protection laws. In other words, the GDPR may still apply to the researchers’ activities when producing synthetic datasets.

The ICO also highlights that where real-world parameters were used to create a synthetic dataset, further alteration of the synthetic data may be necessary to avoid re-identification. For example, if the real data contains a single individual who has a very unusual or rare medical condition, and your synthetic data contains a similar individual (in order to make the overall dataset statistically realistic), it may be possible to infer that the individual was in the real dataset by analysing the synthetic dataset.

7. Practical limitations: Producing synthetic datasets is a resource-intensive task. One common challenge is that a data scientist’s approach to producing synthetic data is usually configured specifically for a dataset. This is a problem because it means a significant amount of work is needed to update an approach for use with a different data source. Additionally, once a dataset has been produced, it is not clear how useful it will be in practice to researchers and AI developers.