docs: Add initdata specification

Signed-off-by: Xynnn007 <[email protected]>
confidential-containers · Mar 8, 2024 · e6163a7 · e6163a7
1 parent 9b8ef6c
commit e6163a7
Showing 1 changed file with 281 additions and 0 deletions.
diff --git a/kbs/docs/initdata.md b/kbs/docs/initdata.md
@@ -0,0 +1,281 @@
+# Initdata Specification
+
+The Initdata Specification defines the key data structure and algorithms
+to inject arbitrary data from untrusted host into TEE. To guarantee the
+integrity of that data, TEE evidence's hostdata ability or (v)TPM dynamic
+measurement ability will be leveraged.
+
+## Introduction
+
+TEE gives users an isolated execution environment to prevent untrusted
+hosts and external software stacks from eavesdropping and tampering with user
+data in use within the TEE.
+
+Remote attestation technology verifies whether the footprint of software
+running in the TEE meets expectations. The softwares to be measured are
+often provided by hardware vendor (like the firmware and tcb security version
+of TEE hardware) or software vendor (like guest kernel for VMs). These
+components are relatively static, which means they may be the same among
+multiple deployments
+
+In some scenarios, users would inject some other information like
+[policy files for kata](https://github.com/kata-containers/kata-containers/blob/main/docs/how-to/how-to-use-the-kata-agent-policy.md),
+[configuration files for components running in guest](https://github.com/confidential-containers/guest-components/tree/main/confidential-data-hub#configuration-file),
+[identity files to specify the identity of TEE](https://github.com/keylime/rust-keylime/blob/master/keylime-agent.conf)
+into the TEE guest when launching.
+
+Compared with static software running in TEE (like guest firmware for TDX VM,
+libos for SGX enclave), these information changes dynamically between different
+deployments and are usually configurations
+We call these information or configurations _Initdata_. Initdata mechanism will
+provides a way to protect their integrity by remote attestation. One thing to note
+is that the confidentiality will not be protected by initdata mechanism because
+the untrusted host can see the plaintext of the data.
+
+To achieve this goal, we defined the following things
+- A data structure named **Initdata Metadata**. This structure is provided by the
+user to contain any initdata in key-value format to untrusted host to inject into
+the TEE when launching. We do not limit the encoding of this data structure, which
+means that JSON and TOML are optional. This will be introduced in [Initdata Metadata](#initdata-metadata)
+- A data integrity binding mechanism. It will guide the untrusted host to bind the
+digest of the `data` part in the Initdata Metadata to the hardware TEE-specific 
+field in evidence. This field will be checked by the verifier during the remote
+attestation. This will be introduced in 
+[Integrity Binding for Different TEEs](#integrity-binding-for-different-tees)
+- A data serialization method. This method is used to serialize the Initdata Metadata
+into a canonicalized one, which will help to get consistent cryptographic hash for
+different forms of a same Initdata Metadata. This will be introduced in
+[Data Canonicalization Algorithm](#data-canonicalization-algorithm).
+
+This spec does not define how the Initdata Metadata will be delivered into the TEE.
+Different projects will have its own way to do this. For Confidential Containers,
+we will use kata-runtime and kata-agent to collaborate to achieve this function.
+
+## Terminology
+
+This section will introduce the terminology used in this spec to avoid ambiguity.
+
+- `Initdata`: Data that needs to be injected when the TEE is started. This data
+requires integrity protection but does not require confidentiality protection.
+- `Initdata Metadata`: A data structure that includes initdata and other information
+that will help to calculate the initdata digest.
+- `Initdata digest`: Digest of the initdata. It will be used as the value of the
+TEE HOSTDATA field.
+- `HOSTDATA`: Fields that can be bound to a specific TEE instance. This field
+information will be included in the TEE-signed remote attestation report. Typically,
+Intel TDX's `mr_config_id`, AMD SNP's `hostdata` and Arm CCA's `CCA_REALM_PERSONALIZATION_VALUE`.
+In order to avoid confusion with the hostdata field of AMD SNP, when we do not
+emphasize a specific platform s.t. SNP, we are referring to the corresponding fields of
+various TEE platforms.
+
+
+## Specifications
+
+### Initdata Metadata
+
+Initdata Metadata defines a standardized initdata data format. Please note that it
+does not indicate the specific encoding format, but requires that the encoding format
+must support the expression of key-value data pairs. Typical encodings that meet
+this requirement include JSON, TOML and YAML, etc.
+
+An Initdata Metadata SHOULD have the following fields
+- `version`: The format version of the initdata metadata. Version number will provide
+extensibility. The definition in this spec is all `0.1.0`.
+- `algorithm`: The hash algorithm to calculate the value to set as `HOSTDATA`. The typical
+algorithms are `sha-256`, `sha-384`, `sha-512`. The name follows 
+[IANA Hash Function Textual Names](https://www.iana.org/assignments/hash-function-text-names/hash-function-text-names.xhtml)
+- `data`: a key-value map. Including the concrete content of initdata. Nested maps[^1] are
+also supported.
+- `digest`: the digest of the canonicalized `data` field using hash algorithm specified by
+`algorithm`. Note that the `digest` itself is not integrity protected in the current
+specification, so it may be tampered with. Therefore, this field itself can only be used as
+a reference. It is **NOT** recommended to rely on this field in the software stack in a specific TEE.
+
+[^1]: Nested maps looks like
+```json
+{
+  "data": {
+    "key1": "value1",
+    "object": {
+      "key2": "value2"
+    }
+  }
+}
+```
+
+#### Examples for Different Encodings
+
+Suppose there is an initdata metadata with the following values
+- `version`: `0.1.0`
+- `algorithm`: `sha384`
+- `data`: there are two objects. The first's key name is `attestation-agent.json` and the value
+is a string of a JSON. The second's key name is `policy.rego` and the value is a string of a rego file.
+- `digest`: related digest of `data`.
+
+##### JSON version
+
+The JSON version initdata metadata looks like the following
+```json
+{
+  "algorithm": "sha384",
+  "version": "0.1.0",
+  "data": {
+    "attestation-agent.json": "{\"aa_kbc_params\": \"cc_kbc::http://127.0.0.1:8080\"}",
+    "policy.rego": "package agent_policy\nimport future.keywords.in\nimport future.keywords.every\nimport input\n\n# Default values, returned by OPA when rules cannot be evaluated to true.\ndefault CopyFileRequest := false\ndefault CreateContainerRequest := false\ndefault CreateSandboxRequest := true\ndefault DestroySandboxRequest := true\ndefault ExecProcessRequest := false\ndefault GetOOMEventRequest := true\ndefault GuestDetailsRequest := true\ndefault OnlineCPUMemRequest := true\ndefault PullImageRequest := true\ndefault ReadStreamRequest := false\ndefault RemoveContainerRequest := true\ndefault RemoveStaleVirtiofsShareMountsRequest := true\ndefault SignalProcessRequest := true\ndefault StartContainerRequest := true\ndefault StatsContainerRequest := true\ndefault TtyWinResizeRequest := true\ndefault UpdateEphemeralMountsRequest := true\ndefault UpdateInterfaceRequest := true\ndefault UpdateRoutesRequest := true\ndefault WaitProcessRequest := true\ndefault WriteStreamRequest := false"
+  },
+  "digest": "e4744fddbb00dd4201326d16ee5a647debf86562a77f1aca176d23017c1cf88821af5c05c03b1ac9b46afa93b9a1d368"
+}
+```
+
+it would involve a lot of escape characters. JSON is better to set simple key
+values, like the following
+```json
+{
+  "algorithm": "sha384",
+  "version": "0.1.0",
+  "data": {
+    "key1": "value1",
+    "key2": "value2"
+  },
+  "digest": "0d25829cc97872f0086fbb7eec3a08c0b76899986dc135e9545434ae9eba7d48196efa7323519f00384881b659c02087"
+}
+```
+
+##### TOML version
+
+If you want to avoid escape characters and use the plaintext of ascii file contents as initdata,
+TOML format will be better. A TOML version initdata metadata looks like the following
+```toml
+algorithm = "sha384"
+version = "0.1.0"
+digest = "e4744fddbb00dd4201326d16ee5a647debf86562a77f1aca176d23017c1cf88821af5c05c03b1ac9b46afa93b9a1d368"
+
+[data]
+"attestation-agent.json" = '''
+{
+"kbs_addr": "http://172.18.0.1:8080"
+}
+'''
+
+"policy.rego" = '''
+package agent_policy
+
+import future.keywords.in
+import future.keywords.every
+
+import input
+
+# Default values, returned by OPA when rules cannot be evaluated to true.
+default CopyFileRequest := false
+default CreateContainerRequest := false
+default CreateSandboxRequest := true
+default DestroySandboxRequest := true
+default ExecProcessRequest := false
+default GetOOMEventRequest := true
+default GuestDetailsRequest := true
+default OnlineCPUMemRequest := true
+default PullImageRequest := true
+default ReadStreamRequest := false
+default RemoveContainerRequest := true
+default RemoveStaleVirtiofsShareMountsRequest := true
+default SignalProcessRequest := true
+default StartContainerRequest := true
+default StatsContainerRequest := true
+default TtyWinResizeRequest := true
+default UpdateEphemeralMountsRequest := true
+default UpdateInterfaceRequest := true
+default UpdateRoutesRequest := true
+default WaitProcessRequest := true
+default WriteStreamRequest := false'''
+```
+
+### Integrity Binding for Different TEEs
+
+There are multiple ways to binding the integrity of initdata to the TEE evidence.
+Many TEE platforms supports HOSTDATA field. The HOSTDATA field could be set by
+the untrusted host when launching the TEE, and the field will be included in the
+TEE evidence for remote attestation.
+
+Platforms and corresponding field of the evidence
+- Intel TDX: `mr_config_id`, 48 bytes. Actually `mr_owner` and `mr_owner_config` have similiar
+attributes, but we select only `mr_config_id` for such use.
+- AMD SNP: `hostdata`, 32 bytes.
+- Arm CCA: `CCA_REALM_PERSONALIZATION_VALUE`, 64 bytes.
+- Intel SGX: `CONFIGID`, 64 bytes.
+
+When users want to deploy a TEE, they need to prepare an initdata metadata. The host
+(probably untrusted) SHOULD start TEE instance with metadata's `digest` as hostdata
+(or calculate `digest` with `algorithm` and `data`).
+
+The software outside the TEE would deliver the initdata metadata into the TEE in
+some way due to concrete architecture design. This spec does not define the concrete
+way to deliver initdata metadata, but strongly recommend that before using the received
+initdata, software stack inside the TEE MUST check whether the digest of the received
+initdata is aligned with the one inside evidence.
+
+Other platforms, such as (v)TPM based platforms, can record the initdata digest
+by extending the PCR before using it. This way will also accomplish the integrity binding
+to the TEE evidence.
+
+### Data Canonicalization Algorithm
+
+When we use JSON or Toml to represent an initdata metadata, control characters like
+line breaks that do not affect the semantics but will cause the original initdata
+metadata content to change. This fact will make it difficult for us to calculate a
+determined digest from different forms of a same initdata. We need a canonicalization
+algorithm. With this algorithm, all different forms of a same initdata will be calculated
+into a same digest. The difference of forms at least include the following
+
+1. The encoding of the initdata metadata, e.g. JSON and TOML.
+2. Initdata metadata with different expressions but the same semantics, e.g. multiple line
+JSON vs one line JSON like the following
+```json
+{
+  "algorithm": "sha384",
+  "version": "0.1.0",
+  "data": {
+    "key1": "value1",
+    "key2": "value2"
+  },
+  "digest": "0d25829cc97872f0086fbb7eec3a08c0b76899986dc135e9545434ae9eba7d48196efa7323519f00384881b659c02087"
+}
+```
+and
+```json
+{"algorithm":"sha384","version":"0.1.0","data":{"key1":"value1","key2":"value2"},"digest":"0d25829cc97872f0086fbb7eec3a08c0b76899986dc135e9545434ae9eba7d48196efa7323519f00384881b659c02087"}
+```
+
+The algorithm has 2 steps as following
+
+1. Transform the original initdata metadata's `data` field into JSON. This step requires converting
+different forms of initdata metadata into JSON without changing the semantics. This is practical
+because JSON, TOML and YAML all support key-value, string, and nested formats.
+2. Canonicalize the JSON resulted in step 1. The canonicalization follows [RFC 8785](https://www.rfc-editor.org/rfc/rfc8785). This includes removing extra whitespace characters and ordering of object members.
+
+After performing the algorithm, the result string can be used to apply hash `algorithm` and get the `digest`.
+
+## Use cases
+
+### Confidential Containers
+
+In Confidential Containers, we need initdata to deliver the following configurations into the
+guest.
+- KBS public key and address
+- Kata agent policy
+- Configuration files for Attestation Agent, Confidential Data Hub and probably kata-agent in future.
+
+The initdata metadata will be inject into guest by kata-runtime via kata-agent's API. After receiving
+initdata metadata, kata-agent will calculate the digest it self due to the metadata, and call Attestation
+Agent's `CheckInitData()` api to check if the received initdata's integrity is aligned with the one
+inside the evidence. Next, it will use the key-value pairs recorded in the data as the file name and
+content, and place them in the `/run/confidential-containers/initdata` directory.
+
+The encoding format is TOML in order to facilitate viewing and modifying specific content.
+
+#### Fields of Data
+
+To prevent malicious host inject any files especially binaries into the TEE, only defined `key`s in
+`data` can be accepted. The defined `key`s including the following:
+- `attestation-agent.json`: The configuration file of Attestation Agent.
+- `confidential-data-hub.toml`: The configuration file of Confidential Data Hub.
+- `policy.rego`: The rego policy file used by kata.