feat: add RFD describing proposed batched query feature

mitre · Nov 19, 2024 · 0bdd290 · 0bdd290
1 parent 9c4ae70
commit 0bdd290
Showing 1 changed file with 152 additions and 0 deletions.
diff --git a/site/content/docs/rfds/0009-batch-queries.md b/site/content/docs/rfds/0009-batch-queries.md
@@ -0,0 +1,152 @@
+---
+title: Batched Plugin Queries
+weight: 9
+slug: 0009
+extra:
+  rfd: 9
+---
+
+# Batched Plugin Queries
+
+Now that we've made substantial progress in converting legacy Hipcheck analysis
+implementations to plugins, we've identified the need for a way to "batch"
+queries for performance reasons. For example, the entropy and churn analyses
+rely on a "linguist" functionality that evaluates whether a given file in a repo
+contains source code. The plugin design philosophy would dictate that we
+separate the `linguist` functionality into its own plugin, with an endpoint
+`is_likely_source_file(&Path) -> bool`, and have churn and entropy each query
+that. But making a `gRPC` request from the `entropy` plugin for each file in a
+potentially large repository is likely to incur runtime costs.
+
+Plugins can expose query endpoints that take a `Vec<_>` of objects and return a
+`Vec<_>` of reponses, but because `salsa` memo-ization operates on the entire
+query object, it would cache the entire `Vec<_>` as a key, and therefore any
+future `Vec`-based queries would not benefit from the memo-ization unless they
+were the exact same `Vec` in terms of size and order of elements.
+
+## Proposed Protocol Change
+
+The current definition of a `Query` gRPC message in the Hipcheck v1 protocol
+includes the following fields:
+
+```protobuf
+message Query {
+	...
+
+    // incremental computation system will use to cache the response.
+    string key = 6;
+
+    // The response for the query, as a JSON object. This will be cached by
+    // Hipcheck for future queries matching the publisher name, plugin name,
+    // query name, and key.
+    string output = 7;
+
+    // Concern chunking is the same as other fields.
+    repeated string concern = 8;
+}
+```
+
+We propose augmenting the `key` and `output` fields into `repeated` fields.
+From a gRPC perspective, this means that multiple `key` and `output` fields can
+appear in each message. Compiling this `protobuf` definition into a Rust
+language `struct` will have the effect of replacing `key: String` with `key:
+Vec<String>` and the same for `output`.
+
+According to the `proto3` [language guide][proto3]:
+
+> "For string, bytes, and message fields, singular is compatible with repeated.
+> Given serialized data of a repeated field as input, clients that expect this
+> field to be singular will take the last input value if it’s a primitive type
+> field or merge all input elements if it’s a message type field. Note that this
+> is not generally safe for numeric types, including bools and enums. Repeated
+> fields of numeric types are serialized in the packed format by default, which
+> will not be parsed correctly when a singular field is expected."
+
+Therefore making this change should not break compatibility with any plugins
+compiled using the existing protocol definition.
+
+## Protocol Changes to Support Chunking Algorithm
+
+Since gRPC has a maximum message size, in Hipcheck core and SDK we use a
+chunking algorithm to break down messages estimated to be too large into a
+vector of acceptable "chunks" sent in serial.
+
+Without the above proposed changes, `key` and `output` are `String` fields. It
+is easy to know how to compose the `key` field of two chunked message, you just
+append the content of the second message's `key` field to that of the first
+message. But when `key` becomes a `Vec`, combining two `key` fields becomes
+ambiguous. To maximize space usage, we may "split" up an element of the `Vec`
+and send half in the first message, half in the second. If we naively `extend`
+the `Vec`s, we'll fail to re-assemble the split element. To notify when the
+first element of `key` in a subsequent message should be combined with the last
+element in the `key` aggregator, we need to add a field to the `Query` struct
+definition in protobuf:
+
+```
+	bool split = 9;
+```
+
+Whenever we split a vec element, we will set this field to `true` in the same
+message. The recepient will note the "latest" `Vec` field to have content
+(in descending order: `key`, `output`, `concern`), and note that the next
+message chunk should have a first element in that field appended to the last of
+the current message. For example:
+
+```
+{
+	key: vec!["abc", "def"]
+	output: vec!["hi"]
+	concern: vec![]
+	split: true
+}
+```
+
+Upon receiving the above message, the recipient will expect to append the first
+element of the next message's `output` vector to `"hi"`.
+
+## Associated Change to Querying Plugins from Rust SDK
+
+With the above protocol changes, under the hood the `key` field of a query
+object will be a `Vec<String>` instead of a singular `String`.
+
+We propose to leave the current `PluginEngine::query(key: Value)` function API
+intact. Under the hood we will simply wrap `key` into a single-element vector.
+We will add a second function `PluginEngine::batch_query(keys: Vec<Value>)`
+which will not do any additional wrapping of the `keys` field and insert it
+as-is into the protobuf `Query` struct once each key has been JSON-serialized.
+
+## Associated Change to Hipcheck Core
+
+Since we want to make full use of `salsa` memo-ization, when the Hipcheck core
+query engine receives a request from a plugin to another plugin that contains a
+`key: Vec<String>` with length greater than one, it will split out each element
+and make a separate `salsa` request for it.
+
+As a result, **no plugin should expect to receive a query where the `key` field
+has more than one element**, and doing so should be considered an error.
+
+Once the result of querying a plugin on each element of `key` is finished,
+Hipcheck core will package the results into an array of equal length where the
+result at each index corresponds with the key value each each index of `key`,
+and return the generated `QueryResponse` object to the requesting plugin.
+
+# Protocol Change to Support Submit Chunking
+
+As an additional feature proposed by this RFD, we will update the gRPC protocol
+to support `Submit` (outbound) query chunking. Currently, there is a query state
+`ReplyInProgress` to indicate when a query response is one chunk in a series of
+fragmented response messages, and that they should be combined on the receiving
+side into a single message. The `ReplyComplete` state indicates that the current
+reply message will not be followed by any additional chunks.
+
+When initially designing the protocol we did not expect that there would be a
+need for an analogous system on the outbound side, but after having run Hipcheck
+against the Linux kernel we encountered such a need. This RFD proposes to rename
+the existing `Submit` query state to `SubmitComplete` and to add another variant
+to the `QueryState` enum called `SubmitInProgress`. According to the gRPC
+documentation, renaming a field does not break backwards compatibility.
+
+`Submit` query chunking will use the exact same chunking algorithm as `Reply`,
+but will look for different `QueryState` variants.
+
+[proto3]: https://protobuf.dev/programming-guides/proto3/