
Topic: Digital Doppelgängers: Ethics of Digital Twins and Synthetic Data
Abstract: Clinical trial design, conduct, and analysis are benefiting from the increased use of AI, rendering it essential to address the ethical considerations that arise when new (or newer) modalities are introduced. In the last few years, AI-enabled synthetic data, retaining the characteristics of original individual-level datasets but containing no actual personal health information, has been used to conduct preliminary hypothesis generation and testing, to model eligibility criteria, and to overcome privacy concerns, particularly in rare disease research. Digital twins are AI-generated models of individual patients that simulate disease progression and treatment response, updated dynamically with real or inferred data from the physical (actual) twin. They have been used for predictive modeling and to modify trial design, the number of enrollees, and/or power calculations, increasing efficiency. They can be used prospectively (e.g., augmenting control arms) or for decision support. In both synthetic data and digital twin settings, key ethical questions arise, but each operates at a different level of abstraction.
Synthetic data are artificially generated datasets created to mimic the characteristics and distributions of patient populations statistically, but they do not correspond to identifiable individuals. Synthetic data can be used, for example, to train generative AI, minimizing privacy concerns early in AI development, and/or for trial simulation. Synthetic data, however, do not support inference at the individual level. The utility of synthetic data depends on the dataset from which it was generated; assessing bias in the source dataset is important but challenging in practice. In ultrarare diseases, for example, sufficient and representative data may not even be available. As a dataset becomes more limited and fails to capture the diversity of the patient population (as in rare and ultrarare diseases), the less generalizable its derivatives and outputs. Does the use of synthetic data reproduce and embed bias and perpetuate health inequalities? How certain are we that synthetic data are – and remain – anonymous?
In contrast, digital twins are constructed from individual participant data, are “personalized,” and acquire, ingest, and use real-time data to refine and update their attributes over time. Do participants have an implicit right to or expectation of consent to the use of their data for the construction of “their” digital twin? Does the creation of a digital twin increase the risk of reidentification and downstream privacy harms? What are the responsibilities for protecting the digital twin data, and the range of permissible future uses of digital twin data? For example, developing a digital twin may yield diagnostic, therapeutic, or other useful information about a person’s future health. Should that be disclosed or communicated? What degree of certainty is necessary for that disclosure to be warranted or permissible? Do the considerations or processes of ethics review bodies need to evolve for the review of protocols involving digital twins or synthetic data, or should additional protections be considered, as a condition of approval?
In clinical research, there is an expectation that the data that inform a trial’s results will be available for independent reanalysis and validation of the results, and to facilitate discovery. The future use of synthetic data and digital twin data, however, is complicated. At a minimum, the synthetic data represents a stable dataset that can benefit from a persistent data object identifier. But does the metadata always reflect that the dataset is derived, and should the original training dataset be available for query, reintroducing privacy risks? Digital twin data, however, is consistently updated based on the acquisition of dynamic, real-world data. How will that be reflected in the trial data? Should the original informed consent include permission for its potential secondary use, particularly as more mature digital twin data becomes increasingly identifiable?
This meeting is open to sponsors of the MRCT Center Bioethics Collaborative and select invited guests. For more information about the Bioethics Collaborative and how to become a sponsor, click here.