Cross-silo and cross-device federated learning on Google Cloud

Last reviewed 2024-06-03 UTC

This document describes two reference architectures that help you create afederated learning platform on Google Cloud using Google Kubernetes Engine (GKE). The referencearchitectures and associated resources that are described in this documentsupport the following:

Cross-silo federated learning
Cross-device federated learning, building upon the cross-silo architecture

The intended audiences for this document are cloud architects and AI and MLengineers who want to implement federated learning use cases onGoogle Cloud. It's also intended for decision-makers who are evaluatingwhether to implement federated learning on Google Cloud.

Architecture

The diagrams in this section show a cross-silo architecture and a cross-devicearchitecture for federated learning. To learn about the different applicationsfor these architectures, seeUse cases.

Cross-silo architecture

The following diagram shows an architecture that supports cross-silo federatedlearning:

Cross-silo architecture, components described in following text.

The preceding diagram shows a simplistic example of a cross-silo architecture.In the diagram, all of the resources are in the same project in a Google Cloudorganization. These resources include thelocal client model, theglobalclient model, and their associated federated learning workloads.

This reference architecture can be modified to support several configurationsfor data silos. Members of the consortium can host their data silos in thefollowing ways:

On Google Cloud, in the same Google Cloud organization, and same Google Cloudproject.
On Google Cloud, in the same Google Cloud organization, in differentGoogle Cloud projects.
On Google Cloud, in different Google Cloud organizations.
In private, on-premises environments, or in other public clouds.

For participating members to collaborate, they need to establish securecommunication channels between their environments. For more information aboutthe role of participating members in the federated learning effort, how theycollaborate, and what they share with each other, seeUse cases.

The architecture includes the following components:

AVirtual Private Cloud (VPC) network and subnet.
Aprivate GKE cluster that helps you to do the following:
- Isolate cluster nodes from the internet.
- Limit exposure of your cluster nodes and control plane to theinternet by creating a private GKE cluster withauthorized networks.
- Use shielded cluster nodes that use a hardened operating system image.
- EnableDataplane V2 for optimized Kubernetes networking.
Dedicated GKEnode pools:You create a dedicated node pool to exclusively host tenant apps andresources. The nodes have taints to ensure that only tenant workloads arescheduled onto the tenant nodes. Other cluster resources are hosted in themain node pool.
Data encryption (enabled by default):
- Dataat rest.
- Datain transit.
- Cluster secrets at the application layer.
In-use data encryption, by optionally enablingConfidential Google Kubernetes Engine Nodes.
VPC Firewall rules which apply the following:
- Baseline rules that apply to all nodes in the cluster.
- Additional rules that only apply to nodes in the tenant nodepool. These firewall rules limit ingress to and egress from tenant nodes.
Cloud NAT to allow egress to the internet.
Cloud DNS records to enablePrivate Google Access such that apps within the cluster can access Google APIs without going overthe internet.
Service accounts which are as follows:
- A dedicated service account for the nodes in the tenant node pool.
- A dedicated service account for tenant apps to use with Workload Identity Federation.
Support for usingGoogle Groups for Kubernetes role-based access control (RBAC).
A Git repository to store configuration descriptors.
AnArtifact Registry repository to store container images.
Config Sync andPolicy Controller to deploy configuration and policies.
Cloud Service Mesh gateways to selectively allow cluster ingress and egress traffic.
Cloud Storage buckets to store global and local model weights.
Access to other Google and Google Cloud APIs. For example, a trainingworkload might need to access training data that's stored inCloud Storage, BigQuery, or Cloud SQL.

Cross-device architecture

The following diagram shows an architecture that supports cross-devicefederated learning:

Cross-device architecture, components explained in following text.

The preceding cross-device architecture builds upon the cross-silo architecturewith the addition of the following components:

A Cloud Run service that simulates devices connecting to theserver
A Certificate Authority Service that creates private certificates for the serverand clients to run
A Vertex AI TensorBoard to visualize the result of the training
A Cloud Storage bucket to store the consolidated model
The private GKE cluster that usesconfidential nodes as its primary pool to help secure the data in use

The cross-device architecture uses components from the open sourceFederated Compute Platform (FCP) project.This project includes the following:

Client code for communicating with a server and executing tasks on thedevices
A protocol for client-server communication
Connection points with TensorFlow Federated to make it easierto define your federated computations

The FCP components shown in the preceding diagram can be deployed as a set ofmicroservices. These components do the following:

Aggregator: This job reads device gradients and calculates aggregatedresult with Differential Privacy.
Collector: This job runs periodically to query active tasks andencrypted gradients. This information determines when aggregation starts.
Model uploader: This job listens to events and publishes results so thatdevices can download updated models.
Task-assignment: This frontend service distributes training tasks todevices.
Task-management: This job manages tasks.
Task-scheduler: This job either runs periodically or is triggered byspecific events.

Products used

The reference architectures for both federated learning use cases use thefollowing Google Cloud components:

Google Cloud Kubernetes engine (GKE):GKE provides the foundational platform for federatedlearning.
TensorFlow Federated (TFF):TFF provides an open-source framework for machine learning and othercomputations on decentralized data.

GKE also provides the following capabilities to yourfederated learning platform:

Hosting the federated learning coordinator: The federated learningcoordinator is responsible for managing the federated learning process.This management includes tasks such as distributing the global model toparticipants, aggregating updates from participants, and updating theglobal model. GKE can be used to host the federatedlearning coordinator in a highly available and scalable way.
Hosting federated learning participants: Federated learning participantsare responsible for training the global model on their local data.GKE can be used to host federated learningparticipants in a secure and isolated way. This approach can help ensurethat participants' data is kept local.
Providing a secure and scalable communication channel: Federatedlearning participants need to be able to communicate with the federatedlearning coordinator in a secure and scalable way. GKEcan be used to provide a secure and scalable communication channel betweenparticipants and the coordinator.
Managing the lifecycle of federated learning deployments:GKE can be used to manage the lifecycle of federatedlearning deployments. This management includes tasks such as provisioningresources, deploying the federated learning platform, and monitoring theperformance of the federated learning platform.

In addition to these benefits, GKE also provides a numberof features that can be useful for federated learning deployments, such as thefollowing:

Regional clusters: GKE lets you createregional clusters, helping you to improve the performance of federatedlearning deployments by reducing latency between participants and thecoordinator.
Network policies: GKE lets you create networkpolicies, helping to improve the security of federated learning deploymentsby controlling the flow of traffic between participants and the coordinator.
Load balancing: GKE provides a number of loadbalancing options, helping to improve the scalability of federated learningdeployments by distributing traffic between participants and the coordinator.

TFF provides the following features to facilitate the implementation offederated learning use cases:

The ability to declaratively express federated computations, which area set of processing steps that run on a server and set of clients. Thesecomputations can be deployed to diverse runtime environments.
Custom aggregators can be built using TFF open source.
Support for a variety of federated learning algorithms, including thefollowing algorithms:
- Federated averaging: An algorithm that averages the model parameters of participating clients. It's particularly well-suited for use cases where the data is relatively homogeneous and the model is not too complex. Typical use cases are as follows:
  - Personalized recommendations: A company can usefederated averaging to train a model that recommends products tousers based on their purchase history.
  - Fraud detection: A consortium of banks can use federatedaveraging to train a model that detects fraudulent transactions.
  - Medical diagnosis: A group of hospitals can usefederated averaging to train a model that diagnoses cancer.
- Federated stochastic gradient descent (FedSGD): An algorithm that uses stochastic gradient descent to update the modelparameters. It's well-suited for use cases where the data is heterogeneousand the model is complex. Typical use cases are as follows:
  - Natural language processing: A company can use FedSGDto train a model that improves the accuracy of speech recognition.
  - Image recognition: A company can use FedSGD to train amodel that can identify objects in images.
  - Predictive maintenance: A company can use FedSGD totrain a model that predicts when a machine is likely to fail.
- Federated Adam: An algorithm that uses the Adam optimizer to update the model parameters. Typical use cases are as follows:
  - Recommender systems: A company can use federated Adam to traina model that recommends products to users based on their purchase history.
  - Ranking: A company can use federated Adam to train a model that rankssearch results.
  - Click-through rate prediction: A company can use federated Adam to traina model that predicts the likelihood that a user clicks an advertisement.

Use cases

This section describes use cases for which the cross-silo and cross-devicearchitectures are appropriate choices for your federated learning platform.

Federated learning is a machine learning setting where many clientscollaboratively train a model. This process is led by a central coordinator, andthe training data remains decentralized.

In the federated learning paradigm, clients download a global model and improvethe model by training locally on their data. Then, each client sends itscalculated model updates back to the central server where the model updates areaggregated and a new iteration of the global model is generated. In thesereference architectures, the model training workloads run onGKE.

Federated learningembodies the privacy principle of data minimization,by restricting what data is collected at each stage of computation, limitingaccess to data, and processing then discarding data as early as possible.Additionally, the problem setting of federated learning is compatible withadditional privacy preserving techniques, such as using differential privacy(DP) toimprove the model anonymizationto ensure the final model does not memorize individual user's data.

Depending on the use case, training models with federated learning can haveadditional benefits:

Compliance: In some cases, regulations might constrain how data can beused or shared. Federated learning might be used to comply with these regulations.
Communication efficiency: In some cases, it's more efficient to train amodel on distributed data than to centralize the data. For example, thedatasets that the model needs to be trained on are too large to move centrally.
Making data accessible: Federated learning allows organizations tokeep the training data decentralized in per-user or per-organization data silos.
Higher model accuracy: By training on real user data (while ensuringprivacy) rather than synthetic data (sometimes referred to as proxy data),it often results in higher model accuracy.

There are different kinds of federated learning, which are characterized bywhere the data originates and where the local computations occur. Thearchitectures in this document focus on two types of federated learning:cross-silo and cross-device. Other types of federated learning are out of scopefor this document.

Federated learning is further categorized by how the datasets are partitioned,which can be as follows:

Horizontal federated learning (HFL): Datasets with the same features(columns) but different samples (rows). For example, multiple hospitalsmight have patient records with the same medical parameters but differentpatient populations.
Vertical federated learning (VFL): Datasets with the same samples (rows)but different features (columns). For example, a bank and an ecommercecompany might have customer data with overlapping individuals but differentfinancial and purchasing information.
Federated Transfer Learning (FTL): Partial overlap in both samples andfeatures among the datasets. For example, two hospitals might have patientrecords with some overlapping individuals and some shared medicalparameters, but also unique features in each dataset.

Cross-silo federated computation is where the participating members areorganizations or companies. In practice, the number of members is usually small(for example, within one hundred members). Cross-silo computation is typicallyused in scenarios where the participating organizations have different datasets,but they want to train a shared model or analyze aggregated resultswithout sharing their raw data with each other. For example, participatingmembers can have their environments in different Google Cloud organizations,such as when they represent different legal entities, or in the sameGoogle Cloud organization, such as when they represent different departments ofthe same legal entity.

Participating members might not be able to consider each other's workloads astrusted entities. For example, a participating member might not have access tothe source code of a training workload that they receive from a third party,such as the coordinator. Because they can't access this source code, theparticipating member can't ensure that the workload can be fully trusted.

To help you prevent an untrusted workload from accessing your data or resourceswithout authorization, we recommend that you do the following:

Deploy untrusted workloads in an isolated environment.
Grant untrusted workloads only the strictly necessary access rights andpermissions to complete the training rounds assigned to the workload.

To help you isolate potentially untrusted workloads, these referencearchitectures implement security controls, such as configuring isolatedKubernetes namespaces, where each namespace has a dedicatedGKE node pool. Cross-namespace communication and clusterinbound and outbound traffic are forbidden by default, unless you explicitlyoverride this setting.

Example use cases for cross-silo federated learning are as follows:

Fraud detection: Federated learning can be used to train a frauddetection model on data that is distributed across multiple organizations.For example, a consortium of banks could use federated learning to train amodel that detects fraudulent transactions.
Medical diagnosis: Federated learning can be used to train a medicaldiagnosis model on data that is distributed across multiple hospitals. Forexample, a group of hospitals could use federated learning to train a modelthat diagnoses cancer.

Cross-device federated learningis a type of federated computation where the participating members are end-userdevices such as mobile phones, vehicles, or IoT devices. The number of memberscan reach up to a scale of millions or even tens of millions.

The process for cross-device federated learning is similar to that ofcross-silo federated learning. However, it also requires you to adapt the referencearchitecture to accommodate some of the extra factors that you must considerwhen you are dealing with thousands to millions of devices. You must deployadministrative workloads to handle scenarios that are encountered incross-device federated learning use cases. For example, the need tocoordinate a subset of clientsthat will take place in the round of training. The cross-device architectureprovides this ability by letting you deploy the FCP services. These serviceshave workloads that have connection points with TFF. TFF is used to write thecode that manages this coordination.

Example use cases for cross-device federated learning are as follows:

Personalized recommendations: You can use cross-device federatedlearning to train a personalized recommendation model on data that'sdistributed across multiple devices. For example, a company could usefederated learning to train a model that recommends products to users basedon their purchase history.
Natural language processing: Federated learning can be used to train anatural language processing model on data that is distributed acrossmultiple devices. For example, a company could use federated learning totrain a model that improves the accuracy of speech recognition.
Predicting vehicle maintenance needs: Federated learning can be used totrain a model that predicts when a vehicle is likely to need maintenance.This model could be trained on data that is collected from multiplevehicles. This approach lets the model learn from the experiences of allthe vehicles, without compromising the privacy of any individual vehicle.

The following table summarizes the features of the cross-silo and cross-devicearchitectures, and shows you how to categorize the type of federated learningscenario that is applicable for your use case.

Feature	Cross-silo federated computations	Cross-device federated computations
Population size	Usually small (for example, within one hundred devices)	Scalable to thousands, millions, or hundreds of millions ofdevices
Participating members	Organizations or companies	Mobile devices, edge devices, vehicles
Most common data partitioning	HFL, VFL, FTL	HFL
Data sensitivity	Sensitive data that participants don't want to share with each otherin raw format	Data that's too sensitive to be shared with a central server
Data availability	Participants are almost always available	Only a fraction of participants are available at any time
Example use cases	Fraud detection, medical diagnosis, financial forecasting	Fitness tracking, voice recognition, image classification

Design considerations

This section provides guidance to help you use this reference architecture todevelop one or more architectures that meet your specific requirements forsecurity, reliability, operational efficiency, cost, and performance.

Cross-silo architecture design considerations

To implement a cross-silo federated learning architecture inGoogle Cloud, you must implement the following minimum prerequisites,which are explained in more detail in the following sections:

Establish a federated learning consortium.
Determine the collaboration model for the federated learning consortiumto implement.
Determine the responsibilities of the participant organizations.

In addition to these prerequisites, there are other actions that the federationowner must take which are outside the scope of this document, such as thefollowing:

Manage the federated learning consortium.
Design and implement a collaboration model.
Prepare, manage, and operate the model training data and the model thatthe federation owner intends to train.
Create, containerize, and orchestrate federated learning workflows.
Deploy and manage federated learning workloads.
Set up the communication channels for the participant organizations tosecurely transfer data.

Establish a federated learning consortium

A federated learning consortium is the group of organizations that participatein a cross-silo federated learning effort. Organizations in the consortium onlyshare the parameters of the ML models, and you can encrypt these parameters toincrease privacy. If the federated learning consortium allows the practice,organizations can also aggregate data that don't contain personally identifiableinformation (PII).

Determine a collaboration model for the federated learning consortium

The federated learning consortium can implement different collaboration models,such as the following:

Acentralized model that consists of a single coordinatingorganization, called thefederation owner ororchestrator, and a set ofparticipant organizations ordata owners.
Adecentralized model that consists of organizations that coordinateas a group.
Aheterogeneous model that consists of a consortium of diverseparticipating organizations, all of which bring different resources to theconsortium.

This document assumes that the collaboration model is a centralizedmodel.

Determine the responsibilities of the participant organizations

After choosing a collaboration model for the federated learning consortium, thefederation owner must determine the responsibilities for the participantorganizations.

The federation owner must also do the following when they begin to build afederated learning consortium:

Coordinate the federated learning effort.
Design and implement the global ML model and the ML models to share withthe participant organizations.
Define thefederated learning rounds—the approach for the iteration ofthe ML training process.
Select the participant organizations that contribute to any givenfederated learning round. This selection is called acohort.
Design and implement a consortium membership verification procedure forthe participant organizations.
Update the global ML model and the ML models to share with theparticipant organizations.
Provide the participant organizations with the tools to validate thatthe federated learning consortium meets their privacy, security, andregulatory requirements.
Provide the participant organizations with secure and encryptedcommunication channels.
Provide the participant organizations with all the necessarynon-confidential, aggregated data that they need to complete each federatedlearning round.

The participant organizations have the following responsibilities:

Provide and maintain a secure, isolated environment (asilo). Thesilo is where participant organizations store their own data, and where MLmodel training is implemented. Participant organizations don't share theirown data with other organizations.
Train the models supplied by the federation owner using their owncomputing infrastructure and their own local data.
Share model training results with the federation owner in the form ofaggregated data, after removing any PII.

The federation owner and the participant organizations can use Cloud Storageto share updated models and training results.

The federation owner and the participant organizations refine the ML modeltraining until the model meets their requirements.

Implement federated learning on Google Cloud

After establishing the federated learning consortium and determining how thefederated learning consortium will collaborate, we recommend that participantorganizations do the following:

Provision and configure the infrastructure for the federated learning consortium

When provisioning and configuring the infrastructure for the federated learningconsortium, it's the responsibility of the federation owner to create anddistribute the workloads that train the federated ML models to the participantorganizations. Because a third party (the federation owner) created and providedthe workloads, the participant organizations must take precautions whendeploying those workloads in their runtime environments.

Participant organizations must configure their environments according to theirindividual security best practices, and apply controls that limit the scope andthe permissions granted to each workload. In addition to following theirindividual security best practices, we recommend that the federation owner andthe participant organizations considerthreat vectors that are specific to federated learning.

Implement the collaboration model

After the federated learning consortium infrastructure is prepared, thefederation owner designs and implements the mechanisms that let the participantorganizations interact with each other. The approach follows the collaborationmodel that the federation owner chose for the federated learning consortium.

Start the federated learning effort

After implementing the collaboration model, the federation owner implements theglobal ML model to train, and the ML models to share with the participantorganization. After those ML models are ready, the federation owner starts thefirst round of the federated learning effort.

During each round of the federated learning effort, the federation owner doesthe following:

Distributes the ML models to share with the participant organizations.
Waits for the participant organizations to deliver the results of thetraining of the ML models that the federation owner shared.
Collects and processes the training results that the participantorganizations produced.
Updates the global ML model when they receive appropriate trainingresults from participating organizations.
Updates the ML models to share with the other members of the consortiumwhen applicable.
Prepares the training data for the next round of federated learning.
Starts the next round of federated learning.

Security, privacy, and compliance

This section describes factors that you should consider when you use thisreference architecture to design and build a federated learning platform onGoogle Cloud. This guidance applies to both of the architectures thatthis document describes.

The federated learning workloads that you deploy in your environments mightexpose you, your data, your federated learning models, and your infrastructureto threats that might impact your business.

To help you increase the security of your federated learning environments,these reference architectures configureGKE security controls that focus on the infrastructure of your environments. These controlsmight not be enough to protect you from threats that are specific to yourfederated learning workloads and use cases. Given the specificity of eachfederated learning workload and use case, security controls aimed at securingyour federated learning implementation are out of the scope of this document.For more information and examples about these threats, seeFederated Learning security considerations.

GKE security controls

This section discusses the controls that you apply with these architectures tohelp you secure your GKE cluster.

Enhanced security of GKE clusters

These reference architectures help you create a GKE cluster whichimplements the following security settings:

Limit exposure of your cluster nodes and control plane to the internetby creating aprivate GKE cluster withauthorized networks.
Useshielded nodes that use a hardened node image with thecontainerd runtime.
Increased isolation of tenant workloads usingGKE Sandbox.
Encrypt data at rest by default.
Encrypt data in transit by default.
Encrypt cluster secrets at the application layer.
Optionally encrypt data in-use by enablingConfidential Google Kubernetes Engine Nodes.

For more information about GKE security settings, seeHarden your cluster's security andAbout the security posture dashboard.

VPC firewall rules

Virtual Private Cloud (VPC) firewall rules govern which traffic is allowed to or from Compute Engine VMs. The rules letyou filter traffic at VM granularity, depending onLayer 4 attributes.

You create a GKE cluster with thedefault GKE cluster firewall rules.These firewall rules enable communication between the cluster nodes andGKE control plane, and between nodes and Pods in thecluster.

You apply additional firewall rules to the nodes in the tenant node pool. Thesefirewall rules restrict egress traffic from the tenant nodes. This approach canincrease isolation of tenant nodes. By default, all egress traffic from thetenant nodes is denied. Any required egress must be explicitly configured. Forexample, you create firewall rules to allow egress from the tenant nodes to theGKE control plane, and to Google APIs usingPrivate Google Access.The firewall rules are targeted to the tenant nodes by using theservice account for the tenant node pool.

Namespaces

Namespaces let you provide a scope for related resources within a cluster—forexample, Pods, Services, and replication controllers. By using namespaces, youcan delegate administration responsibility for the related resources as a unit.Therefore, namespaces are integral to most security patterns.

Namespaces are an important feature for control plane isolation. However, theydon't provide node isolation, data plane isolation, or network isolation.

A common approach is to create namespaces for individual applications. Forexample, you might create the namespacemyapp-frontend for the UI component ofan application.

These reference architectures help you create a dedicated namespace to host thethird-party apps. The namespace and its resources are treated as a tenant withinyour cluster. You apply policies and controls to the namespace to limit thescope of resources in the namespace.

Network policies

Network policiesenforce Layer 4 network traffic flows by using Pod-level firewall rules. Networkpolicies arescoped to a namespace.

In the reference architectures that this document describes, you apply networkpolicies to the tenant namespace that hosts the third-party apps. By default,the network policy denies all traffic to and from pods in the namespace. Anyrequired traffic must be explicitly added to an allowlist. For example, thenetwork policies in these reference architectures explicitly allow traffic torequired cluster services, such as the cluster internal DNS and the Cloud Service Meshcontrol plane.

Config Sync

Config Sync keeps your GKE clusters in sync with configs stored in asource of truth.The Git repository acts as the single source of truth for your clusterconfiguration and policies. Config Sync is declarative. It continuouslychecks cluster state and applies the state declared in the configuration file toenforce policies, which helps to prevent configuration drift.

You install Config Sync into your GKE cluster. You configureConfig Sync to sync cluster configurations and policies from asource of truth.The synced resources include the following:

Cluster-level Cloud Service Mesh configuration
Cluster-level security policies
Tenant namespace-level configuration and policy including networkpolicies, service accounts, RBAC rules, and Cloud Service Mesh configuration

Policy Controller

Policy Controller is adynamic admission controllerfor Kubernetes that enforcesCustomResourceDefinition-based (CRD-based)policies that are executed by theOpen Policy Agent (OPA).

Admission controllersare Kubernetes plugins that intercept requests to the Kubernetes API serverbefore an object is persisted, but after the request is authenticated andauthorized. You can use admission controllers to limit how a cluster is used.

You install Policy Controller into your GKE cluster. Thesereference architectures include example policies to help secure your cluster.You automatically apply the policies to your cluster using Config Sync. Youapply the following policies:

Selected policies to helpenforce Pod security.For example, you apply policies that prevent pods fromrunning privileged containers and that require aread-only root file system.
Policies from the Policy Controllertemplate library.For example, you apply a policy thatdisallows services with type NodePort.

Cloud Service Mesh

Cloud Service Mesh is a service mesh that helps you simplify the management of securecommunications across services. These reference architectures configureCloud Service Mesh so that it does the following:

Automatically injects sidecar proxies.
EnforcesmTLS communication between services in the mesh.
Limits outbound mesh traffic to only known hosts.
Limits inbound traffic only from certain clients.
Lets you configure network security policies based on service identityrather than based on the IP address of peers on the network.
Limitsauthorized communication between services in the mesh. For example, apps in the tenant namespace areonly allowed to communicate with apps in the same namespace, or with a setof known external hosts.
Routes all inbound and outbound traffic throughmesh gateways where you can apply further traffic controls.
Supports secure communication between clusters.

Node taints and affinities

Node taintsandnode affinityare Kubernetes mechanisms that let you influence how pods are scheduled ontocluster nodes.

Tainted nodes repel pods. Kubernetes won't schedule a Pod onto a tainted nodeunless the Pod has atoleration for the taint. You can use node taints toreserve nodes for use only by certain workloads or tenants. Taints andtolerations are often used in multi-tenant clusters. For more information, see thededicated nodes with taints and tolerations documentation.

Node affinity lets you constrain pods to nodes with particular labels. If a podhas a node affinity requirement, Kubernetes doesn't schedule the Pod onto anode unless the node has a label that matches the affinity requirement. You canuse node affinity to ensure that pods are scheduled onto appropriate nodes.

You can use node taints and node affinity together to ensure tenant workloadpods are scheduled exclusively onto nodes reserved for the tenant.

These reference architectures help you control the scheduling of the tenant appsin the following ways:

Creating a GKE node pool dedicated to the tenant.Each node in the pool has a taint related to the tenant name.
Automatically applying the appropriate toleration and node affinity toany Pod targeting the tenant namespace. You apply the toleration andaffinity usingPolicyController mutations.

Least privilege

It's a security best practice to adopt a principle of least privilege for yourGoogle Cloud projects and resources like GKE clusters. By using this approach, theapps that run inside your cluster, and the developers and operators that use thecluster, have only the minimum set of permissions required.

These reference architectures help you use least privilege service accounts inthe following ways:

Each GKE node pool receives its own serviceaccount. For example, the nodes in the tenant node pool use a serviceaccount dedicated to those nodes. The node service accounts are configuredwith theminimum required permissions.
The cluster usesWorkload Identity Federation for GKE to associate Kubernetes service accounts with Google service accounts. Thisway, the tenant apps can be granted limited access to any required GoogleAPIs without downloading and storing a service account key. For example,you can grant the service account permissions to read data from aCloud Storage bucket.

These reference architectures help yourestrict access to cluster resources in the following ways:

You create a sample Kubernetes RBAC role with limited permissions tomanage apps. You can grant this role to the users and groups who operatethe apps in the tenant namespace. By applying this limited role of users andgroups, those users only have permissions to modify app resources in the tenantnamespace. They don't have permissions to modify cluster-level resources orsensitive security settings like Cloud Service Mesh policies.

Binary Authorization

Binary Authorization lets you enforce policies that you define about the container images that arebeing deployed in your GKE environment. Binary Authorization allowsonly container images that conform with your defined policies to be deployed. Itdisallows the deployment of any other container images.

In this reference architecture, Binary Authorization is enabled with its defaultconfiguration. To inspect the Binary Authorization defaultconfiguration,seeExport the policy YAML file.

For more information about how to configure policies, see the following specificguidance:

Cross-organization attestation verification

You can use Binary Authorization to verify attestations generated by a third-partysigner. For example, in a cross-silo federated learninguse case, you can verify attestations that another participant organizationcreated.

To verify the attestations that a third party created, you do the following:

Receive the public keys that the third party used to create the attestationsthat you need to verify.
Create the attestors to verify the attestations.
Add the public keys that you received from the third party to the attestorsyou created.

For more information about creating attestors, see the following specificguidance:

Federated learning security considerations

Despite its strict data sharing model, federated learning isn't inherentlysecure against all targeted attacks, and you should take these risks intoaccount when you deploy either of the architectures described in this document.There's also the risk of unintended information leaks about ML models or modeltraining data. For example, an attacker might intentionally compromise theglobal ML model or rounds of the federated learning effort, or they mightexecute atiming attack (a type of side-channel attack)to gather information about the size of the training datasets.

The most common threats against a federated learning implementation are asfollows:

Intentional or unintentional training data memorization. Your federatedlearning implementation or an attacker might intentionally orunintentionally store data in ways that might be difficult to work with. Anattacker might be able to gather information about the global ML model orpast rounds of the federated learning effort by reverse engineering thestored data.
Extract information from updates to the global ML model. During the federatedlearning effort, an attacker might reverse engineer the updates to the globalML model that the federation owner collects from participant organizationsand devices.
The federation owner might compromise rounds. A compromised federationowner might control a rogue silo or device and start a round of thefederated learning effort. At the end of the round, the compromisedfederation owner might be able to gather information about the updates thatit collects from legitimate participant organizations and devices bycomparing those updates to the one that the rogue silo produced.
Participant organizations and devices might compromise the global MLmodel. During the federated learning effort, an attacker might attempt tomaliciously affect the performance, the quality, or the integrity of theglobal ML model by producing rogue or inconsequential updates.

To help mitigate the impact of the threats described in this section, werecommend the following best practices:

Tune the model to reduce the memorization of training data to a minimum.
Implement privacy-preserving mechanisms.
Regularly audit the global ML model, the ML models that you intend toshare, the training data, and infrastructure that you implemented toachieve your federated learning goals.
Implement asecure aggregationalgorithm to process the training results that participant organizationsproduce.
Securely generate and distribute data encryption keys using apublic key infrastructure.
Deploy infrastructure to aconfidential computing platform.

Federation owners must also take the following additional steps:

Verify the identity of each participant organization and the integrityof each silo in case of cross-silo architectures, and the identity andintegrity of each device in case of cross-device architectures.
Limit the scope of the updates to the global ML model that participantorganizations and devices can produce.

Reliability

This section describes design factors that you should consider when you useeither of the references architectures in this document to designand build a federated learning platform on Google Cloud.

When designing your federated learning architecture on Google Cloud, werecommend that you follow the guidance in this section to improve theavailability and scalability of the workload, and help make your architectureresilient to outages and disasters.

GKE: GKE supports severaldifferent cluster types that you can tailor to the availability requirements ofyour workloads and to your budget. For example, you can create regional clustersthat distribute the control plane and nodes across several zones within aregion, or zonal clusters that have the control plane and nodes in a singlezone. Both cross-silo and cross-device reference architectures rely on regionalGKE clusters. For more information on the aspects to consider when creatingGKE clusters, seecluster configuration choices.

Depending on the cluster type and how the control plane and cluster nodes aredistributed across regions and zones, GKE offers differentdisaster recovery capabilities to protect your workloads against zonal andregional outages. For more information on GKE's disasterrecovery capabilities, seeArchitecting disaster recovery for cloud infrastructure outages: Google Kubernetes Engine.

Google Cloud Load Balancing: GKE supports severalways of load balancing traffic to your workloads. The GKEimplementations of theKubernetes Gateway andKubernetes Service APIs let you automatically provision and configureCloud Load Balancing to securely and reliably expose the workloads running in yourGKE clusters.

In these reference architectures, all the ingress and egress traffic goes throughCloud Service Mesh gateways. These gateways mean that you can tightly controlhow traffic flows inside and outside your GKE clusters.

Reliability challenges for cross-device federated learning

Cross-device federated learning has a number of reliability challenges that arenot encountered in cross-silo scenarios. These include the following:

Unreliable or intermittent device connectivity
Limited device storage
Limited compute and memory

Unreliable connectivity can lead to issues such as the following:

Stale updates and model divergence: When devices experienceintermittent connectivity, their local model updates might become stale,representing outdated information compared to the current state of theglobal model. Aggregating stale updates can lead to model divergence, wherethe global model deviates from the optimal solution due to inconsistenciesin the training process.
Imbalanced contributions and biased models: Intermittent communicationcan result in an uneven distribution of contributions from participatingdevices. Devices with poor connectivity might contribute fewer updates,leading to an imbalanced representation of the underlying datadistribution. This imbalance can bias the global model towards the datafrom devices with more reliable connections.
Increased communication overhead and energy consumption: Intermittentcommunication can lead to increased communication overhead, as devices mightneed to resend lost or corrupted updates. This issue can also increase theenergy consumption on devices, especially for those with limited battery life, asthey might need to maintain active connections for longer periods to ensuresuccessful transmission of updates.

To help mitigate some of the effects caused by intermittent communication, thereference architectures in this document can be used with the FCP.

A system architecture that executes the FCP protocol can be designed to meetthe following requirements:

Handle long running rounds.
Enable speculative execution (rounds can start before the requirednumber of clients is assembled in anticipation of more checking in soon enough).
Enable devices to choose which tasks they want to participate in. This approachcan enable features likesampling without replacement, which is asampling strategy where each sample unit of a population has only one chanceto be selected. This approach helps to mitigate against unbalancedcontributions and biased models
Extensible for anonymization techniques like differential privacy (DP)and trusted aggregation (TAG).

To help mitigate limited device storage and compute capabilities, the followingtechniques can help:

Understand what is the maximum capacity available to run the federatedlearning computation
Understand how much data can be held at any particular time
Design the client-side federated learning code to operate within theavailable compute and RAM available on the clients
Understand the implications of running out of storage and implementprocess to manage this

Cost optimization

This section provides guidance to optimize the cost of creating and running thefederated learning platform on Google Cloud that you establish by usingthis reference architecture. This guidance applies for both of the architecturesthat this document describes.

Running workloads on GKE can help you make yourenvironment more cost-optimized by provisioning and configuring your clustersaccording to your workloads' resource requirements. It also enables featuresthat dynamically reconfigure your clusters and cluster nodes, such asautomatically scaling cluster nodes and Pods, and by right-sizing yourclusters.

For more information about optimizing the cost of your GKEenvironments, seeBest practices for running cost-optimized Kubernetes applications on GKE.

Operational efficiency

This section describes the factors that you should consider to optimizeefficiency when you use this reference architecture to create and run afederated learning platform on Google Cloud. This guidance applies forboth of the architectures that this document describes.

To increase the automation and monitoring of your federated learningarchitecture, we recommend that you adoptMLOps principles, which are DevOpsprinciples in the context of machine learning systems. Practicing MLOps meansthat you advocate for automation and monitoring at all steps of ML systemconstruction, including integration, testing, releasing, deployment andinfrastructure management. For more information about MLOps, seeMLOps: Continuous delivery and automation pipelines in machine learning.

Performance optimization

This section describes the factors that you should consider to optimize performanceof your workloads when you use this reference architecture to create and run afederated learning platform on Google Cloud. This guidance applies forboth of the architectures that this document describes.

GKE supports several features to automatically andmanually right-size and scale your GKE environment to meetthe demands of your workloads, and help you avoid over-provisioning resources.For example, you canuse Recommender to generate insights and recommendations to optimize your GKE resource usage.

When thinking about how to scale your GKE environment, werecommend that you design short, medium, and long term plans for how you intendto scale your environments and workloads. For example, how do you intend to growyour GKE footprint in a few weeks, months, and years?Having a plan ready helps you take the most advantage of the scalabilityfeatures that GKE provides, optimize yourGKE environments, and reduce costs. For more informationabout planning for cluster and workload scalability, seeAbout GKE Scalability.

To increase the performance of your ML workloads, you can adoptCloud Tensor Processing Units (Cloud TPUs), Google-designed AI accelerators that are optimized for trainingand inference of large AI models.

Deployment

To deploy the cross-silo and cross-device reference architectures that thisdocument describes, see theFederated Learning on Google CloudGitHub repository.

What's next

Explore how you can implement your federated learning algorithms on theTensorFlow Federated platform.
Learn aboutAdvances and Open Problems in Federated Learning.
Read aboutfederated learning on the Google AI Blog.
Watch howGoogle uses keeps privacy intact when using federated learning with de-identified, aggregated information to improve ML models.
ReadTowards Federated learning at scale.
Explore how you can implement anMLOps pipeline to manage the lifecycle of the machine learning models.
For an overview of architectural principles and recommendations that are specific to AIand ML workloads in Google Cloud, see theAI and ML perspectivein the Well-Architected Framework.
For more reference architectures, diagrams, and best practices, explore theCloud Architecture Center.

Contributors

Authors:

Grace Mollison | Solutions Lead
Marco Ferrari | Cloud Solutions Architect

Other contributors:

Chloé Kiddon | Staff Software Engineer and Manager
Laurent Grangeau | Solutions Architect
Lilian Felix | Cloud Engineer
Christiane Peters | Cloud Security Architect

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-06-03 UTC.

Movatterモバイル変換

Cross-silo and cross-device federated learning on Google Cloud Stay organized with collections Save and categorize content based on your preferences.