Hasso-Plattner-Institut
Prof. Dr.-Ing. Bert Arnrich
 

Privacy-Preserving Federated Learning

Machine learning algorithms, and especially deep models, benefit significantly from large datasets. Oftentimes, these datasets do not exist centrally, but are scattered across many databases. While it is not a problem in some application domains to combine datasets from different locations and use a central data store for model training, some other domains, such as medicine, prohibit this procedure. This is due to people's rights to their personal data defined in the General Data Protection Regulation (GDPR) [1] for EU citizens. As a possible solution, federated learning[2] was proposed, which allows leaving the sensitive data where it was collected and only sharing models between multiple parties. This way, the data privacy can be kept while still enabling machine learning.

Federated Learning

The process of federated learning relies on a client-server architecture with the data owners as clients and the so-called parameter server as the server. The training process works as follows (also shown in the figure below). First, in the preparation step, everyone agrees on a model architecture to be trained, which is then initialised and sent to all participants (0). Then, the training process starts. The server selects a subset of clients to participate in the current global update round and sends them the current global model (1). The selected clients then train the model on their own data for a small number of update steps (not until model convergence) (2). Afterwards, model updates (the difference between optimised and received model) are sent back to the parameter server (3). Finally, all received model updates are aggregated by the server, for instance using a simple mean, and applied to the global model (4). This concludes a round of global training and steps 1-4 are repeated.

Federated Learning with Differential Privacy

Although federated learning improves data privacy for its participants, there is still the possibility to infer sensitive data from the transmitted model updates, for instance with so-called reconstruction attacks [3][4]. To prevent those types of attacks, a number of defence concepts are commonly employed in federated learning systems. We are most interested in the addition of differential privacy, but others use (additive) homomorphic encryption schemes, such as the Paillier Encryption [5], or secure multi-party computation protocols.

Differential privacy was developed in the data science and database domain and describes the introduction of noise into a system which prevents conclusions about particular samples in a database by repeatedly querying it [6]. It has been transferred to the federated learning domain to hide the impact of client-specific data in finding the updated model weights. Specificially, participants clip their model updates to a pre-defined L2-norm and add (usually) Gaussian noise which has been scaled for their local data. This allows the calculation and tracking of the privacy spending over the course of the training. The privacy budget determines how many rounds of training are possible and how much noise has to be added in order to guarantee data privacy.

Current Research

Generating Synthetic Medical Data

A major use of federated learning is the (virtual) aggregation of distributed datasets in order to have enough data to train deep networks. This could possibly be circumvented by generating synthetic data, which is then not personal data in the sense that it does not fit any particular individual. This dataset can then be used as sole basis for centralised model training, or to augment existing, private datasets. Generative models such as Variational Autoencoders (VAEs) [7] or Generative Adversarial Nets (GANs) [8] have shown promise in creating high quality image data from noise. Utilising the federated learning framework, it is possible to train these models without violating medical data privacy.

Our focus so far has been on VAE models, which consist of two sub-networks: The encoder takes some data and transforms it into a latent representation, while the decoder reconstructs the original image from the latent representation. Compared to regular Autoencoders (AEs), VAEs use a Gaussian distribution as the latent representation, meaning the encoder emits a mean and covariance and the decoder receives a sample of the corresponding distribution. This allows the VAE to generate new data, instead of merely being able to reconstruct training samples.

Usually both components are trained and synchonised using federated learning and added differential privacy to ensure that no exact replicas of patient data are synthesised. Depending on the number of clients and the amount of data, however, the addition of differential privacy makes the model a lot weaker and the synthesised data less usable.

That is why we developed DPD-fVAE, a federated VAE training procedure with differentially-private decoder. We propose to keep the decoder components private, while only synchronising the encoder with all other participants. As we have shown in our paper (see below under Publications, or check out the corresponding poster), this approach reduces the negative impact of differential privacy on the model performance and thus enables more effective generative models. 

References

  1. General Data Protection Regulation (GDPR) – Official Legal Text", General Data Protection Regulation (GDPR), 2020. [Online]. Available: https://gdpr-info.eu/. [Accessed: 28- Sep- 2020]
  2. H. Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Agüeray Arcas. 2016. Federated Learning of Deep Networks using Model Averaging. CoRR abs/1602.05629 (2016). arXiv:1602.05629 http://arxiv.org/abs/1602.05629

  3. Zhibo Wang, Mengkai Song, Zhifei Zhang, Yang Song, Qian Wang, and Hairong Qi. 2018. Beyond Inferring Class Representatives: User-Level Privacy Leakage From Federated Learning. CoRR abs/1812.00535 (2018). arXiv:1812.00535 http://arxiv.org/abs/1812.00535

  4. Briland Hitaj, Giuseppe Ateniese, and Fernando Pérez-Cruz. 2017. Deep Models Under the GAN: Information Leakage from Collaborative Deep Learning. CoRR abs/1702.07464 (2017). arXiv:1702.07464 http://arxiv.org/abs/1702.07464

  5. Pascal Paillier. 1999. Public-key Cryptosystems Based on Composite Degree Residuosity Classes. In Proceedings of the 17th International Conference on Theory and Application of Cryptographic Techniques (Prague, Czech Republic) (EUROCRYPT’99). Springer-Verlag, Berlin, Heidelberg, 223–238. http://dl.acm.org/citation.cfm?id=1756123.1756146

  6. Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (Vienna, Austria) (CCS ’16). ACM, New York, NY, USA, 308–318. https://doi.org/10.1145/2976749.2978318

  7. Pu, Y., Gan, Z., Henao, R., Yuan, X., Li, C., Stevens, A., & Carin, L. 2016. Variational autoencoder for deep learning of images, labels and captions. In Advances in neural information processing systems. 2352-2360. https://papers.nips.cc/paper/6528-variational-autoencoder-for-deep-learning-of-images-labels-and-captions.pdf
  8. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2672–2680. http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf

Publications

  • DPD-fVAE: Synthetic Data Generation Using Federated Variational Autoencoders With Differentially-Private Decoder Pfitzner, Bjarne; Arnrich, Bert (2022).
     
  • Defending against Reconstruction Attacks through Differentially Private Federated Learning for Classification of Heterogeneous Chest X-ray Data. Ziegler, Joceline; Pfitzner, Bjarne; Schulz, Heinrich; Saalbach, Axel; Arnrich, Bert in Sensors, (F. Marulli; L. Verde, reds.) (2022). 22(14)
     
  • Computational Approaches to Alleviate Alarm Fatigue in Intensive Care Medicine: A Systematic Literature Review. Chromik, Jonas; Klopfenstein, Sophie Anne Ines; Pfitzner, Bjarne; Sinno, Zeena-Carola; Arnrich, Bert; Balzer, Felix; Poncette, Akira-Sebastian in Frontiers in Digital Health (2022). 4
     
  • Forecasting Thresholds Alarms in Medical Patient Monitors using Time Series Models. Chromik., Jonas; Pfitzner., Bjarne; Ihde., Nina; Michaelis., Marius; Schmidt., Denise; Klopfenstein., Sophie; Poncette., Akira-Sebastian; Balzer., Felix; Arnrich., Bert (2022). 26–34.
     
  • Extracting Alarm Events from the MIMIC-III Clinical Database. Chromik., Jonas; Pfitzner., Bjarne; Ihde., Nina; Michaelis., Marius; Schmidt., Denise; Klopfenstein., Sophie; Poncette., Akira-Sebastian; Balzer., Felix; Arnrich., Bert (2022). 328–335.
     
  • Implicit Model Specialization through Dag-Based Decentralized Federated Learning. Beilharz, Jossekin; Pfitzner, Bjarne; Schmid, Robert; Geppert, Paul; Arnrich, Bert; Polze, Andreas in Middleware ’21 (2021). 310–322.
     
  • Perioperative Risk Assessment in Pancreatic Surgery Using Machine Learning. Pfitzner, Bjarne; Chromik, Jonas; Brabender, Rachel; Fischer, Eric; Kromer, Alexander; Winter, Axel; Moosburner, Simon; Sauer, Igor M.; Malinka, Thomas; Pratschke, Johann; Arnrich, Bert; Maurer, Max M. (2021). 2211–2214.
     
  • Sensor-Based Obsessive-Compulsive Disorder Detection With Personalised Federated Learning. Kirsten, Kristina; Pfitzner, Bjarne; Löper, Lando; Arnrich, Bert (2021). 333–339.
     
  • Differentially Private Federated Learning for Anomaly Detection in EHealth Networks. Cholakoska, Ana; Pfitzner, Bjarne; Gjoreski, Hristijan; Rakovic, Valentin; Arnrich, Bert; Kalendar, Marija in UbiComp ’21 (2021). 514–518.
     
  • Data Augmentation of Kinematic Time-Series From Rehabilitation Exercises Using GANs. Albert, Justin; Glöckner, Pawel; Pfitzner, Bjarne; Arnrich, Bert (2021). 1–6.
     
  • Tangle Ledger for Decentralized Learning. Schmid, R.; Pfitzner, B.; Beilharz, J.; Arnrich, B.; Polze, A. (2020). 852–859.
     
  • Federated Learning in a Medical Context: A Systematic Literature Review. Pfitzner, Bjarne; Steckhan, Nico; Arnrich, Bert in ACM Transactions on Internet Technology (TOIT) Special Issue on Security and Privacy of Medical Data for Smart Healthcare (2020).
     
  • Poisoning Attacks with Generative Adversarial Nets. Muñoz-González, Luis; Pfitzner, Bjarne; Russo, Matteo; Carnerero-Cano, Javier; Lupu, Emil C (2019).
     
  • Unobtrusive Measurement of Blood Pressure During Lifestyle Interventions. Morassi Sasso, Ariane; Datta, Suparno; Pfitzner, Bjarne; Zhou, Lin; Steckhan, Nico; Boettinger, Erwin; Arnrich, Bert (2019).