Remote attestation of disaggregated machines Stay organized with collections Save and categorize content based on your preferences.
This content was last updated in May 2024, and represents the status quoas of the time it was written. Google's security policies and systems may changegoing forward, as we continually improve protection for our customers.
This document describes the Google approach to data center machine attestation.The architecture described in this document is designed to be integrated withopen standards such asTrusted Platform Module (TPM),Security Protocol and Data Model (SPDM),andRedfish.For new standards or reference implementations that are proposed by Google andrelated to data center machine attestation, see ourPlatform Integrity (PINT) project in GitHub. This document is intended for security executives, securityarchitects, and auditors.
Overview
Increasingly, Google designs and deploys disaggregated data center machines. Insteadof a single root of trust, many machines contain separate roots of trust, includingroots of trust for measurement (RTM), storage, update, and recovery. Each RTMserves a subsection of the entire machine. For example, a machine might have oneRTM that measures and attests to what was booted on the main CPU, and anotherRTM that measures and attests to what was booted on a SmartNIC that is plugged into aPCIe slot. The following diagram shows an example machine.
The complexity of multiple RTMs in a machine adds to the enormous scale andexpectations of data center machines, and many complications that canoccur because of human, hardware, or software faults. In summary, ensuringfirmware integrity of our fleet is a non-trivial endeavor.
The system described in this document is designed to make the problem of remoteattestation for disaggregated machines more manageable. This attestationinfrastructure is extensible, letting it adapt to serve ever-more-complexmachines as they appear in the data center.
By sharing this work, we aim to provide our perspective on how disaggregatedmachine attestation can be done at scale. Through collaboration with industrypartners and contributions to standards bodies such as Distributed ManagementTask Force (DMTF), Trusted Computing Group (TCG), and Open Compute Project(OCP), we intend to continue supporting security innovation in this space.
Recommended RTM properties
This section introduces some properties that we recommend for RTMs.
RTM hardware integration
When a processor is paired with an RTM, the RTM should capture measurementsover the first mutable code that runs on that processor. Subsequent mutable codeshould have its measurements captured and reported to a root of trust before the coderuns. This arrangement produces a measured boot chain that allows for robustattestation of the security-critical state of the processor.
RTM hardware and firmware identity attestation
Each RTM should have a signing key pair that is used to emit attestations forexternal validation. The certificate chain for this key pair should includecryptographic evidence of the RTM's uniquehardware identity and thefirmwareidentity for any mutable code that runs within the RTM. The certificate chainshould be rooted in the RTM manufacturer. This approach lets machines recoverfrom critical RTM firmware vulnerabilities.
TheDevice Identifier Composition Engine (DICE) specification is a formalization of the pattern that is used in our attestationsolution. The RTM manufacturer certifies a unique device key pair, whichcertifies an alias key pair that is specific to the RTM's hardware identity andfirmware image. The alias key certificate chain contains a measurement of theRTM firmware and the RTM's serial number. Verifiers can be confident that anydata signed by a given alias key was emitted from an RTM that is described bythe cryptographic hardware and firmware identity measurements that are embeddedwithin that alias key's certificate chain.
Remote attestation operations
The attestation scheme is designed to ensure that user data and jobs are onlyissued to machines that are running their intended boot stack, while stillallowing fleet maintenance automation to occur at scale to remediate issues. Thejob scheduler service, hosted in our internal cloud, can challenge thecollection of RTMs within the machine, and compare the resulting attestedmeasurements to a policy that is unique to that machine. The scheduler onlyissues jobs and user data to machines if the attested measurements conform tothe machine's policy.
Remote attestation includes the following two operations:
Attestation policy generation, which occurs whenever a machine'sintended hardware or software is changed.
Attestation verification, which occurs at defined points in our machinemanagement flows. One of these points is just before work is scheduled on amachine. The machine only gains access to jobs and user data afterattestation verification passes.
The attestation policy
Google uses a signed machine-readable document, referred to as apolicy, todescribe the hardware and software that is expected to be running within amachine. This policy can be attested by the machine's collection of RTMs. Thefollowing details for each RTM are represented in the policy:
- The trustedidentity root certificate that can validateattestations that are emitted by the RTM.
- The globally uniquehardware identity that uniquely identifies the RTM.
- Thefirmware identity that describes the expected version that theRTM should be running.
- Themeasurement expectations for each boot stage that is reported bythe RTM.
- Anidentifier for the RTM, analogous to aRedfish resource name.
- Anidentifier that links the RTM to its physical location within amachine. This identifier is analogous to aRedfish resource name, and is used by automated machine repair systems.
In addition, the policy also contains a globally unique revocation serialnumber that helps prevent unauthorized policy rollback. The following diagramshows a policy.
The diagram shows the following items in the policy:
- The signature provides policy authentication.
- The revocation serial number provides policy freshness to help preventrollback.
- The RTM expectations enumerate details for each RTM in the machine.
The following sections describe these items in more detail.
Policy assembly
When a machine's hardware is assembled or repaired, a hardware model is createdthat defines the expected RTMs on that machine. Our control plane helps ensurethat this information remains current across events such as repairs that involvepart swaps or hardware upgrades.
In addition, the control plane maintains a set of expectations about thesoftware that is intended to be installed on a machine, along with expectationsabout which RTMs should measure which software. The control plane uses theseexpectations, along with the hardware model, to generate a signed and revocableattestation policy that describes the expected state of the machine.
The signed policy is then written to persistent storage on the machine that itdescribes. This approach helps reduce the number of network and servicedependencies that are needed by the remote verifier when attesting a machine.Rather than query a database for the policy, the verifier can fetch the policyfrom the machine itself. This approach is an important design feature, as thejob schedulers have strict SLO requirements and must remain highly available.Reducing the network dependencies of these machines on other services helps toreduce the risk of outages. The following diagram shows this flow of events.
The diagram describes the following steps that the control plane completes inthe policy assembly process:
- Derives the attestation policy from the software package assignment andmachine hardware model.
- Signs the policy.
- Stores the policy on the data center machine.
Policy revocation
The hardware and software intent for a given machine changes over time. Whenthe intent changes, old policies must be revoked. Each signed attestation policyincludes a unique revocation serial number. Verifiers obtain the appropriatepublic key for authenticating a signed policy, and the appropriate certificaterevocation list for ensuring that the policy is still valid.
Interactively querying a key server or revocation database affects the jobschedulers' availability. Instead, Google uses an asynchronous model. The set ofpublic keys that are used to authenticate signed attestation policies are pushedas part of each machine's base operating system image. The CRL is pushedasynchronously using the same centralized revocation deployment system thatGoogle uses for other credential types. This system is already engineered forreliable operation during normal conditions, with the ability to perform rapidemergency pushes during incident response conditions.
By using verification public keys and CRL files that are stored locally on theverifier's machine, verifiers can validate attestation statements from remotemachines without having any external services in the critical path.
Retrieving attestation policies and validating measurements
The process of remotely attesting a machine consists of the following stages:
- Retrieving and validating the attestation policy.
- Obtaining attested measurements from the machine's RTMs.
- Evaluating the attested measurements against the policy.
The following diagram and sections describe these stages further.
Retrieving and validating the attestation policy
The remote verifier retrieves the signed attestation policy for the machine. Asmentioned inPolicy assembly,for availability reasons, the policy is stored as a signed document on thetarget machine.
To verify that the returned policy is authentic, the remote verifier consultsthe verifier's local copy of the relevant CRL. This action helps ensure that theretrieved policy was cryptographically signed by a trusted entity and that thepolicy wasn't revoked.
Obtaining attested measurements
The remote verifier challenges the machine, requesting measurements from eachRTM. The verifier ensures freshness by including cryptographic nonces in theserequests. An on-machine entity, such as a baseboard management controller (BMC),routes each request to its respective RTM, gathers the signed responses, andsends them back to the remote verifier. This on-machine entity is unprivilegedfrom an attestation perspective, as it serves only as a transport for the RTM'ssigned measurements.
Google uses internal APIs for attesting to measurements. We also contributeenhancements to Redfish to enable off-machine verifiers to challenge a BMC foran RTM's measurements by using SPDM. Internal machine routing is done overimplementation-specific protocols and channels, including the following:
- Redfish over subnet
- Intelligent Platform Management Interface (IPMI)
- Management Component Transport Protocol (MCTP) over i2c/i3c
- PCIe
- Serial Peripheral Interface (SPI)
- USB
Evaluating attested measurements
Google's remote verifier validates the signatures that are emitted by each RTM,ensuring that they root back to the RTM's identity that is included in themachine's attestation policy. Hardware and firmware identities that arepresent in the RTM's certificate chain are validated against the policy,ensuring that each RTM is the correct instance and runs the correct firmware. Toensure freshness, the signed cryptographic nonce is checked. Finally, the attestedmeasurements are evaluated to ensure that they correspond with the policy'sexpectations for that device.
Reacting to remote attestation results
After attestation is complete, the results must be used to determine the fateof the machine being attested. As shown in the diagram, there are two possibleresults: the attestation is successful and the machine is issued taskcredentials and user data, or the attestation fails and alerts are sent to therepairs infrastructure.
The following sections provide more information about these processes.
Failed attestation
If attestation of a machine isn't successful, Google doesn't use the machine toserve customer jobs. Instead, an alert is sent to the repairs infrastructure,which attempts to automatically reimage the machine. Although attestation failuresmight be due to malicious intent, most attestation failures are due tobugs in software rollouts. Therefore, rollouts with risingattestation failures are stopped automatically to help prevent moremachines from failing attestation. When this event occurs, an alert is sent to SREs. For machines that aren't fixed by automated reimaging, the rollout is rolled back, or there is a rollout of fixed software. Until a machine undergoessuccessful remote attestation again, it isn't used to serve customer jobs.
Successful attestation
If remote attestation of a machine is successful, Google uses the machine toserve production jobs such as VMs for Google Cloud customers or image processingfor Google Photos. Google requires meaningful job actions thatinvolve networked services to be gated behind short-livedLOAS task credentials. These credentials are granted over a secure connection after asuccessful attestation challenge, and provide privileges required by the job.For more information about these credentials, seeApplication Layer Transport Security.
Software attestation is only as good as the infrastructure that builds thatsoftware. To help ensure that resulting artifacts are an accurate reflection ofour intent, we have invested significantly in the integrity of our buildpipeline. For more information about a standard that was proposed by Google toaddress software supply chain integrity and authenticity, seeSoftware Supply Chain Integrity.
What's next
Learn howBeyondProd helps Google's data center machines establish secure connections.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.