From afaeb74df7ec7bbc63b1e7b2660a87b390af7673 Mon Sep 17 00:00:00 2001
From: Xynnn007 <xynnn@linux.alibaba.com>
Date: Thu, 7 Mar 2024 16:52:16 +0800
Subject: [PATCH] docs: Add initdata specification

Signed-off-by: Xynnn007 <xynnn@linux.alibaba.com>
---
 kbs/docs/initdata.md | 247 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 247 insertions(+)
 create mode 100644 kbs/docs/initdata.md

diff --git a/kbs/docs/initdata.md b/kbs/docs/initdata.md
new file mode 100644
index 0000000000..28af2177bc
--- /dev/null
+++ b/kbs/docs/initdata.md
@@ -0,0 +1,247 @@
+# Initdata Specification
+
+The Initdata Specification defines the key data structure and algorithms
+to inject arbitrary data from untrusted host into TEE. To guarantee the
+integrity of that data, TEE evidence's hostdata ability or (v)TPM dynamic
+measurement ability will be leveraged.
+
+## Introduction
+
+TEE gives users an isolated execution environment to prevent untrusted
+hosts and external software stacks from eavesdropping and tampering with user
+data in use within the TEE.
+
+Remote attestation technology verifies whether the footprint of software
+running in the TEE meets expectations. The softwares to be measured are
+often provided by hardware vendor (like the firmware and tcb security version
+of TEE hardware) or software vendor (like guest kernel for VMs). These
+components are relatively static, which means they may be the same among
+multiple deployments
+
+In some scenarios, users would inject some other information like
+[policy files for kata](https://github.com/kata-containers/kata-containers/blob/main/docs/how-to/how-to-use-the-kata-agent-policy.md),
+[configuration files for components running in guest](https://github.com/confidential-containers/guest-components/tree/main/confidential-data-hub#configuration-file),
+[identity files to specify the identity of TEE](https://github.com/keylime/rust-keylime/blob/master/keylime-agent.conf)
+into the TEE guest when launching.
+
+Compared with static software running in TEE (like guest firmware for TDX VM,
+libos for SGX enclave), these information changes dynamically between different
+deployments and are usually configurations
+We call these information or configurations _Initdata_. Initdata mechanism will
+provides a way to protect their integrity by remote attestation. One thing to note
+is that the confidentiality will not be protected by initdata mechanism because
+the untrusted host can see the plaintext of the data.
+
+To achieve this goal, we defined the following things
+- A data structure named **Initdata**. This structure is provided by the
+user to contain any data in key-value format to untrusted host to inject into
+the TEE when launching. We do not limit the encoding of this data structure, which
+means that JSON, TOML and YAML are optional. This will be introduced in [Initdata](#initdata)
+- A data integrity binding mechanism. It will guide the untrusted host to bind the
+digest of the `data` part (initdata data) in the Initdata to the hardware TEE-specific 
+field in evidence. This field will be checked by the verifier during the remote
+attestation. This will be introduced in 
+[Integrity Binding for Different TEEs](#integrity-binding-for-different-tees)
+- A data serialization method. This method is used to serialize the Initdata data
+into a canonicalized one, which will help to get consistent cryptographic hash for
+different forms of a same initdata. This will be introduced in
+[Data Canonicalization Algorithm](#data-canonicalization-algorithm).
+
+This spec does not define how the initdata will be delivered into the TEE.
+Different projects will have its own way to do this. For Confidential Containers,
+we will use kata-runtime and kata-agent to collaborate to achieve this function.
+
+## Terminology
+
+This section will introduce the terminology used in this spec to avoid ambiguity.
+
+- `Initdata`: A data structure that includes initdata data and other information
+that will help to calculate the initdata digest. The whole data structure will
+be delivered into the guest.
+- `Initdata Metadata`: Metadata fields of initdata, s.t. `algorithm`, `version`,
+etc. They are used to calculate the initdata digest.
+- `Initdata data`: Data that needs to be injected when the TEE is started. This data
+requires integrity protection but does not require confidentiality protection. In
+initdata, this will be included inside the `data` section.
+- `Initdata digest`: Digest of the initdata data calculated following this spec.
+It will be used as the value of the TEE hostdata/initdata field.
+- `TEE initdata/hostdata`: Fields that can be bound to a specific TEE instance. This field
+information will be included in the TEE-signed remote attestation report. Typically,
+Intel TDX's `mr_config_id`, AMD SNP's `hostdata` and Arm CCA's `CCA_REALM_PERSONALIZATION_VALUE`.
+In order to avoid confusion with the hostdata field of AMD SNP, when we do not
+emphasize a specific platform s.t. SNP, we are referring to the corresponding fields of
+various TEE platforms.
+
+## Specifications
+
+### Initdata
+
+Initdata defines a standardized structure format. Please note that it
+does not indicate the specific encoding format, but requires that the encoding format
+must support the expression of key-value data pairs. Typical encodings that meet
+this requirement include JSON, TOML and YAML, etc.
+
+An initdata SHOULD have the following fields
+- `version`: The format version of the initdata metadata. Version number will provide
+extensibility. The definition in this spec is all `0.1.0`.
+- `algorithm`: The hash algorithm to calculate the value to set as `HOSTDATA`. The typical
+algorithms are `sha-256`, `sha-384`, `sha-512`. The name follows 
+[IANA Hash Function Textual Names](https://www.iana.org/assignments/hash-function-text-names/hash-function-text-names.xhtml)
+- `data`: a key-value map from string to string. Including the concrete content of initdata.
+- `digest`: the digest of the canonicalized `data` field using hash algorithm specified by
+`algorithm`. Note that the `digest` itself is not integrity protected in the current
+specification, so it may be tampered with. Therefore, this field itself can only be used as
+a reference. It **CAN NOT** be trusted if it was sent from an untrusted host.
+
+#### Examples for Different Encodings
+
+Suppose there is an initdata with the following values
+- `version`: `0.1.0`
+- `algorithm`: `sha384`
+- `data`: there are two objects. The first's key name is `attestation-agent.json` and the value
+is a string of a JSON. The second's key name is `policy.rego` and the value is a string of a rego file.
+- `digest`: related digest of `data`.
+
+##### JSON version
+
+The JSON version initdata looks like the following
+```json
+{
+  "algorithm": "sha384",
+  "version": "0.1.0",
+  "data": {
+    "attestation-agent.json": "{\"aa_kbc_params\": \"cc_kbc::http://127.0.0.1:8080\"}",
+    "policy.rego": "package agent_policy\nimport future.keywords.in\nimport future.keywords.every\nimport input\n\n# Default values, returned by OPA when rules cannot be evaluated to true.\ndefault CopyFileRequest := false\ndefault CreateContainerRequest := false\ndefault CreateSandboxRequest := true\ndefault DestroySandboxRequest := true\ndefault ExecProcessRequest := false\ndefault GetOOMEventRequest := true\ndefault GuestDetailsRequest := true\ndefault OnlineCPUMemRequest := true\ndefault PullImageRequest := true\ndefault ReadStreamRequest := false\ndefault RemoveContainerRequest := true\ndefault RemoveStaleVirtiofsShareMountsRequest := true\ndefault SignalProcessRequest := true\ndefault StartContainerRequest := true\ndefault StatsContainerRequest := true\ndefault TtyWinResizeRequest := true\ndefault UpdateEphemeralMountsRequest := true\ndefault UpdateInterfaceRequest := true\ndefault UpdateRoutesRequest := true\ndefault WaitProcessRequest := true\ndefault WriteStreamRequest := false"
+  },
+  "digest": "e4744fddbb00dd4201326d16ee5a647debf86562a77f1aca176d23017c1cf88821af5c05c03b1ac9b46afa93b9a1d368"
+}
+```
+
+it would involve a lot of escape characters. JSON is better to set simple key
+values, like the following
+```json
+{
+  "algorithm": "sha384",
+  "version": "0.1.0",
+  "data": {
+    "key1": "value1",
+    "key2": "value2"
+  },
+  "digest": "0d25829cc97872f0086fbb7eec3a08c0b76899986dc135e9545434ae9eba7d48196efa7323519f00384881b659c02087"
+}
+```
+
+##### TOML version
+
+If you want to avoid escape characters and use the plaintext of ascii file contents as initdata,
+TOML format will be better. A TOML version initdata looks like the following
+```toml
+algorithm = "sha384"
+version = "0.1.0"
+digest = "e4744fddbb00dd4201326d16ee5a647debf86562a77f1aca176d23017c1cf88821af5c05c03b1ac9b46afa93b9a1d368"
+
+[data]
+"attestation-agent.json" = '''
+{
+"kbs_addr": "http://172.18.0.1:8080"
+}
+'''
+
+"policy.rego" = '''
+package agent_policy
+
+import future.keywords.in
+import future.keywords.every
+
+import input
+
+# Default values, returned by OPA when rules cannot be evaluated to true.
+default CopyFileRequest := false
+default CreateContainerRequest := false
+default CreateSandboxRequest := true
+default DestroySandboxRequest := true
+default ExecProcessRequest := false
+default GetOOMEventRequest := true
+default GuestDetailsRequest := true
+default OnlineCPUMemRequest := true
+default PullImageRequest := true
+default ReadStreamRequest := false
+default RemoveContainerRequest := true
+default RemoveStaleVirtiofsShareMountsRequest := true
+default SignalProcessRequest := true
+default StartContainerRequest := true
+default StatsContainerRequest := true
+default TtyWinResizeRequest := true
+default UpdateEphemeralMountsRequest := true
+default UpdateInterfaceRequest := true
+default UpdateRoutesRequest := true
+default WaitProcessRequest := true
+default WriteStreamRequest := false'''
+```
+
+### Integrity Binding for Different TEEs
+
+There are multiple ways to binding the integrity of initdata data to the TEE evidence.
+Many TEE platforms supports TEE initdata field. The TEE initdata field could be set by
+the untrusted host when launching the TEE, and the field will be included in the
+TEE evidence for remote attestation.
+
+Platforms and corresponding field of the evidence
+- Intel TDX: `mr_config_id`, 48 bytes. Actually `mr_owner` and `mr_owner_config` have similiar
+attributes, but we select only `mr_config_id` for such use.
+- AMD SNP: `hostdata`, 32 bytes.
+- Arm CCA: `CCA_REALM_PERSONALIZATION_VALUE`, 64 bytes.
+- Intel SGX: `CONFIGID`, 64 bytes.
+
+When users want to deploy a TEE, they need to prepare an initdata metadata. The host
+(probably untrusted) SHOULD start TEE instance with initdata's `digest` as TEE initdata
+(or calculate `digest` with `algorithm` and `data`).
+
+The software outside the TEE would deliver the initdata into the TEE in
+some way due to concrete architecture design. This spec does not define the concrete
+way to deliver initdata, but strongly recommend that before using the received
+initdata, software stack inside the TEE **MUST** check whether the digest (calculate
+again, not the `digest` field as it could be tampered by the host) of the received
+initdata is aligned with the one inside evidence.
+
+Other platforms, such as (v)TPM based platforms, can record the initdata digest
+by extending the PCR before using it. This way will also accomplish the integrity binding
+to the TEE evidence.
+
+### Data Canonicalization Algorithm
+
+When we use JSON or Toml to represent an initdata, control characters like
+line breaks that do not affect the semantics but will cause the original initdata
+data content to change. This fact will make it difficult for us to calculate a
+determined digest from different forms of a same initdata data. We need a canonicalization
+algorithm. With this algorithm, all different forms of a same initdata will be calculated
+into a same initdata digest. The difference of forms at least include the following
+
+1. The encoding of the initdata, e.g. JSON and TOML.
+2. Initdata metadata with different expressions but the same semantics, e.g. multiple line
+JSON vs one line JSON like the following
+```json
+{
+  "algorithm": "sha384",
+  "version": "0.1.0",
+  "data": {
+    "key1": "value1",
+    "key2": "value2"
+  },
+  "digest": "0d25829cc97872f0086fbb7eec3a08c0b76899986dc135e9545434ae9eba7d48196efa7323519f00384881b659c02087"
+}
+```
+and
+```json
+{"algorithm":"sha384","version":"0.1.0","data":{"key1":"value1","key2":"value2"},"digest":"0d25829cc97872f0086fbb7eec3a08c0b76899986dc135e9545434ae9eba7d48196efa7323519f00384881b659c02087"}
+```
+
+The algorithm has 2 steps as following
+
+1. Transform the original initdata data, s.t. `data` field into JSON. This step requires converting
+different forms of initdata metadata into JSON without changing the semantics. This is practical
+because JSON, TOML and YAML all support key-value, string, and nested formats.
+2. Canonicalize the JSON resulted in step 1. The canonicalization follows [RFC 8785](https://www.rfc-editor.org/rfc/rfc8785).
+This includes removing extra whitespace characters and ordering of object members.
+
+After performing the algorithm, the result string can be used to apply hash `algorithm` and get the `digest`.