Data, sometimes considered as the black gold of the 21st century, are the essential fuel for artificial intelligence and are already widely used by the pharmaceutical industry. However, and especially because of the particular sensitivity of Health, their use has several limitations. Will synthetic data be one of the solutions to solve these problems?
What is synthetic data and why use it?
Synthetic data are data created artificially through the use of generative algorithms, rather than collected from real events. Originally developed in the 1990s to allow work on U.S. Census data, without disclosing respondents’ personal information, synthetic data have since been developed to generate high-quality, large-scale datasets.
These data are generally generated from real data, for example from patient files in the case of health data, and preserve their statistical distribution. Thus, it is theoretically possible to generate virtual patient cohorts, having no real identity, but corresponding statistically in all points to real cohorts. Researchers have succeeded in synthesizing virtual patient records from publicly available demographic and epidemiological data. In this case, we speak of “fully synthetic data“, as opposed to “partially synthetic data“, which are synthetic data manufactured to replace missing data from real data sets collected in the traditional way.
Currently, and despite various initiatives – such as the Health Data Hub in France, for which we will come back to in future articles – aiming to democratize their use, many problems still limit the optimal and massive use of patient data, despite their ever growing volume. Synthetic data are one of the solutions that can be used.
Health data privacy:
Naturally, health data are particularly sensitive in terms of confidentiality. The need to preserve patient anonymity leads to a certain number of problems in terms of accessibility and data processing costs. Many players do not have an easy access to these data, and even when they do manage to gain access, their processing involves significant regulatory and cybersecurity costs. Access times are also often extremely long, which slows down the research projects. For some databases, it is sometimes a regulatory requirement to hire a third-party company, that is accredited to handle these data.
To allow their use, patient data are generally anonymized using methods such as the deletion of identifying variables; their modification by the addition of noise; or the grouping of categorical variables in order to avoid certain categories containing too few individuals. However, the efficiency of these methods has been regularly questioned by studies showing that it was generally possible to trace the identity of patients, by making matches (probabilistic or deterministic) with other databases. Synthetic data generation can, in this context, be used as a safe and easy-to-use alternative.
The technique of synthetic data generation is commonly used to fill in missing data in real data sets that are impossible or very costly to collect again. These new data are representative of the statistical distribution of variables from the real data set.
The volume of health data datasets is too small to be exploited by artificial intelligence:
The training of Machine or Deep Learning models sometimes requires large volumes of data in order to obtain satisfying predictions: it is commonly accepted that a minimum of about 10 times as many examples as degrees of freedom of the model are required. However, when Machine Learning is used in health care, it is common that the volume of data does not allow good results, for example in rare pathologies that are poorly documented, or sub-populations representing few individuals. In such cases, the use of synthetic data is part of the data scientists’ toolbox.
The use of synthetic data is an emerging field, some experts believe it will help overcoming some of the current limitations of AI. Among the various advantages brought by synthetic data in the field of AI, we can mention: the fact that it is fast and inexpensive to create as much data as you want, without the need to label them by hand as it is often the case with real data, but also that these data can be modified several times in order to make the model as efficient as possible, in its processing of real data.
The different techniques for generating synthetic data
The generation of synthetic data involves several phases:
- The preparation of the sample data from which the synthetic data will be generated: in order to obtain a satisfying result, it is necessary to clean and harmonize the data if they come from different sources
- The actual generation of the synthetic data, we will detail some of these techniques below
- The verification and the evaluation of the confidentiality offered by the synthetic data
Figure 1 – Synthetic Data Generation Schema
The methods of data generation are numerous, and their use depends on the objective one is aiming for and the type of data one wants to create: should we create data from already existing data, and thus follow their statistical distributions? Or fully virtual data following rules, allowing them to be realistic (like text for example)? In the case of “data-driven” methods, taking advantage of existing data, generative Deep Learning models will be used. In the case of “process-driven” methods, allowing mathematical models to generate data from underlying physical processes, it will be a question of what we call agent-based modelling.
Operationally, synthetic data are usually created in the Python language – very well known to Data Scientists. Different Python libraries are used, such as: Scikit-Learn, SymPy, Pydbgen and VirtualDataLab. A future Resolving Pharma article will follow up this introduction by presenting how to create synthetic health data using these libraries.
Evaluation of synthetic data
It is common to evaluate anonymized patient data according to two main criteria: the quality of the use that can be made with the data, and the quality of anonymization that has been achieved. It has been shown that the more the data is anonymized, the more limited the use is, since important but identifying features are removed, or precision is lost by grouping classes of values. There is a balance to be found between the two, depending on the destination of the data.
Synthetic data are evaluated according to three main criteria:
- The fidelity of the data to the base sample
- Fidelity of the data to the distribution of the general population
- The level of anonymization allowed by the data
Different methods and metrics exist to evaluate the criteria:
By ensuring that the quality of the data generated is sufficient for its intended use, evaluation is an essential and central element of the synthetic data generation process.
Which use cases for synthetic data in the pharmaceutical industry?
A few months ago, Accenture Life Sciences and Phesi, two companies providing services to pharmaceutical companies, co-authored a report urging them to integrate more techniques involving synthetic data into their activities. The use case mentioned in this report is about synthetic control arms, which however, generally use real data from different clinical trials and is statistically reworked.
Outside the pharmaceutical industry, in the world of Health, synthetic data are already used to train visual recognition models in imaging: researchers can artificially add pathologies to images of healthy patients and thus test their algorithms on their ability to detect the pathologies. Based on this use-case, it is also possible to create histological section data that could be used to train AI models in preclinical studies.
There is no doubt that the burgeoning synthetic data industry is well on its way to improve artificial intelligence as we currently know it, and its use in the health industry. This is particularly true when handling sensitive and difficult-to-access data. We can imagine, for example, a world where it is easier and more efficient for manufacturers to create their own synthetic data, than to seek access to medical or medico-administrative databases. This technology would then be one of those that would modify the organization of innovation in the health industries, by offering a less central place to real data.
To go further:
- Is synthetic data the key to healthcare clinical and business intelligence? ; Healthcare IT News ; https://www.healthcareitnews.com/news/synthetic-data-key-healthcare-clinical-and-business-intelligence
- Elise Devaux – Statice ; Guide d’introduction à la donnée synthétique ; https://elise-deux.medium.com/guide-dintroduction-%C3%A0-la-donn%C3%A9e-synth%C3%A9tique-ec37dc2fa244
- Faster and cheaper clinical trials, the benefit of synthetic data ; Accenture and Phesi ; https://www.accenture.com/_acnmedia/PDF-148/Accenture-Insilico-Faster-And-Cheaper.pdf#zoom=40
- Andre Goncalves, Priyadip Ray, Braden Soper, Jennifer Stevens, Linda Coyle, and Ana Paula Sales ; Generation and Evaluation of synthetic patient data ; BMC Medical Research Methodology (2020) 20:108 ; https://doi.org/10.1186/s12874-020-00977-1
- Synthetic Data Generation: Techniques, Best Practices and Tools ; Cem Dilmegani ; AI Multiple (2021) ; https://research.aimultiple.com/synthetic-data-generation/
- Practical Synthetic Data Generation: Balancing Privacy and the Broad Availability of Data ; Khaled El Emam, Lucy Mosquera, Richard Hoptroff, O’Reilly, ISBN-13: 978-1492072744
- Dube K., Gallagher T. (2014) Approach and Method for Generating Realistic Synthetic Electronic Healthcare Records for Secondary Use. In: Gibbons J., MacCaull W. (eds) Foundations of Health Information Engineering and Systems. FHIES 2013. Lecture Notes in Computer Science, vol 8315. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53956-5_6
- Chen, R.J., Lu, M.Y., Chen, T.Y. et al. Synthetic data in machine learning for medicine and healthcare. Nat Biomed Eng 5, 493–497 (2021). https://doi.org/10.1038/s41551-021-00751-8