Generating Synthetic Medical Data
A major use of federated learning is the (virtual) aggregation of distributed datasets in order to have enough data to train deep networks. This could possibly be circumvented by generating synthetic data, which is then not private, personal data in the sense that it does not fit any particular individual, and using this dataset as basis for model training. Generative models such as Variational Autoencoders (VAEs) [7] or Generative Adversarial Nets (GANs) [8] have shown promise in creating high quality image data from noise. Utilising the federated learning framework, it is possible to train these models without violating medical data privacy.
My focus so far is on GAN models, which consist of two networks working against each other. The generator takes a noise vector as input and generates an output in the data space. The discriminator receives this generated sample and actual data and produces a probability of the presented sample coming from the real dataset. During training both components participate in a min-max game, where the discriminator gets better at identifying real samples, and the generator improves its capability of generating data that is mistaken as being real. Formally this is formulated as $$\min_G\max_D V(G, D) = \mathbb{E}_{\mathbf{x}\sim p_{data}(\mathbf{x})}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z}\sim \mathcal{N}(0,1)}[\log (1-D(G(\mathbf{z}))]$$
In practice, I am using the optimised formulation GANs, called Wasserstein GAN with Gradient Penalty (WGAN-GP) [9], which has better properties during training. Regular GANs can fall into cases of vanishing gradients, which halts training progress due to gradients close to zero. Another problem of GAN training is mode collapse, meaning the generator is not able to generate samples with high variance, such that the output looks almost the same for any noise input. The WGAN-GP circumvents these issues by changing the components' loss functions and restructuring the discriminator to a so-called critic, where the only difference is the output layer having no activation function (instead of sigmoid). Now a positive output means that the critic has identified a real sample, and a negative output corresponds to a fake sample.
For federated training of the model, there are two approaches. Either both components are locally trained and centrally aggregated like shown by [10]. Alternatively, one can use the fact that the generator training does not require real data, but is only reliant on the critic. Consequently, only the discriminator has to be trained in a federated manner and the generator training can be done solely by the server [11]. Both approaches have not been evaluated against each other so far, which is something I am currently working on.
The first use case is the generation of high resolution chest xray images. After successful federated training of a WGAN model, it can be further parameterised with a data label to make it possible to generate images showing healthy patients and those of sick patients with different diseases. I opted for image generation, since this is the main application domain of GAN models and works well as a proof of concept, however looking at usability, it is better to generate tabular medical data, so-called Electronic Health Records (EHRs). There already exist large public databases of medical image data, because they are comparably easy to anonymise. EHRs, on the other hand, are highly sensitive and patient-specific. An additional challenge is the interdependence of EHR columns: in order to be plausible, records need to include values that match together. A person cannot be 20 years old and also have been a smoker for 30 years (as a very simplified example of this). Thus, a specific model structure has to be used, that takes these constraints into consideration, like the previously proposed TableGAN [12].
The evaluation of the approach has multiple facets. First and foremost, there needs to be a guarantee that no exact matches between real patient data and generated data exist. This can be done by employing the aforementioned differential privacy method, which includes a trade-off between security and usability of generated samples that will be investigated. On the other hand, the generator has to have sufficient variance in generated samples in order to generated a large enough dataset that can be used for machine learning (not including duplicate samples).