From 0d38bac1e351e312f1d59aa1d80820d1be09901d Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Fri, 18 Oct 2024 16:18:56 -0700 Subject: [PATCH 1/7] Batching and error transmission RFC draft --- docs/rfcs/batching-process-design.md | 134 +++++++++++++++++++++++++++ 1 file changed, 134 insertions(+) create mode 100644 docs/rfcs/batching-process-design.md diff --git a/docs/rfcs/batching-process-design.md b/docs/rfcs/batching-process-design.md new file mode 100644 index 00000000000..4a03e8400c3 --- /dev/null +++ b/docs/rfcs/batching-process-design.md @@ -0,0 +1,134 @@ +# Error transmission through a batching processor with concurrency + +Establish normative guidelines for components that batch telemetry to follow so that the batchprocessor and exporterhelper-based batch_sender behave in similar ways with good defaults. + +## Motivation + +We are motivated, first, to have consistent behavior across two forms of batching process: (1) `batchprocessor`, and (2) `exporterhelper/internal/batch_sender`. Today, these two core components exhibit diferent behaviors. + +Second, to establish conditions and requirements for error transmission, tracing instrumentation, and concurrency from these components. + +Third, to establish how batching processors should handle incoming context deadlines. + +## Explanation + +Here, "batching process" refers to specifically the batch processor +(i.e., a processor component) or the exporterhelper's batch_sender +(i.e., part of an exporter component). +Both are core components, and it is an explicit goal that these two +options for batching cover a wide range of requirements. +We prefer to improve the core batching-process components instead +of introduce new components, in order to achieve desired batching +behavior. + +We use the term "request" as a signal-independent descriptor of the +concrete data types used in Collector APIs (e.g., a `plog.Logs`, +`ptrace.Traces`, `pmetric.Metrics`). + +We use the term "export" generically to refer to the next action taken +by a component, which is to call either `Consume()` or `Send()` on the +next component or sender in the pipeline. + +The primary goal of a batching process is to combine and split +requests. Small requests and large requests go in to a batching +process, and moderately-sized requests come out. Batching processes +can be used to amortize overhead costs associated with small requests, +and batching processes can be used to prevent requests from exceeding +request-size limits later in a pipeline. + +## Detailed design + +### Batching process configuration + +A batching process is usually configured with a timeout, which limits +individual requests from waiting too long for the arrival of a +sufficiently-large batch. At a high-level, the typical configuration +parameters of a batching process are: + +1. Minimum acceptable size (required) +2. Maximum acceptable size (optional) +3. Timeout (optional) + +Today, both batching processes restrict size configuration to an item +count, [however the desire to use request count and byte count (in +some encoding) is well recognized](https://github.com/open-telemetry/opentelemetry-collector/issues/9462). + +### Sequence of Events + +A batching process breaks the normal flow of Context through a +Collector pipeline. Here are the events that take place as a request +makes its way through a batching process: + +A. The request arrives when the caller calls a `Consume()` or `Send()` + method on this component with Context and data. +B. The request is added to the currently-pending batch. +C. The batching process calls export one or more times containing data from the original request. +D. The batching process receives the response from the export(s), possibly with an error. +E. The `Consume()` or `Send()` call returns control to the caller, possibly with an error. + +The two batch processors execute these steps with different sequences. + +In the batch processor, we observe two independent sequences. One is +A ⮕ B ⮕ E: the request arrives, then is placed in a batch, then +returns success. The other is B ⮕ C ⮕ D: once in a batch, the request +is exported, then errors (if any) are logged. + +In the batch_sender, we observe a single sequence, A ⮕ B ⮕ C ⮕ D ⮕ E: +the request arrives, is placed in a batch, the batch is sent, the +response is received, and the caller returns the error (if any). + +To resolve the inconsistency, this document proposes to modify the +batch processor to use a single sequence, i.e., A ⮕ B ⮕ C ⮕ D ⮕ E. + +### Request handling + +There are a number of open questions related to the arriving request +and its context. + +- The arriving request has no items of telemetry. Does the batching process return success immediately? + +Consider the arriving deadline: + +- The arriving Context deadline has already expired. Does the batching process fail the request immediately? +- The arriving Context deadline expires while waiting for the export(s). What happens? + +Considering the arriving Context's trace context: + +- An export contains data from multiple requests, is a new root span instrumented? +- An export contains data from a single request, is a child span instrumented? + +The batching process may be configured to use client metadata as a +batching identifier +([batchprocessor](https://github.com/open-telemetry/opentelemetry-collector/issues/4544) +is complete, +[batch_sender](https://github.com/open-telemetry/opentelemetry-collector/issues/10825) +is incomplete). Considering the arriving Context's client metadata: + +- An export contains data from a single request, are there circumstances when the request's client metadata passes through? + +### Error handling + +A batching process determines what happens when some or all of a +request fails to be processed. Consider when an incoming request has +partially or completely failed: + +- Does the caller receive an error? +- Does the remaining portion of the request still export? +- Under what conditions is the error returned by a batching process retryable? + +### Concurrency handling + +Here are some questions about concurrency in the batching process. +Consider what happens when there is more than one batch of data +available to send: + +- Does the batching process wait for one batch to complete before sending another? +- Does the batching process use a caller's goroutine to export, or can it create its own? +- Is there any limit on the number of concurrent exports? + +## Proposed Requirements + +The questions posed above are meant to help us identify areas where +the two batching processes are either inconsisent with each other or +inconsistent with the goals of the project. + From fbe4f995cbe585efb13a990935abdf13359dcc57 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Mon, 16 Dec 2024 15:03:40 -0800 Subject: [PATCH 2/7] Complete proposal. --- docs/rfcs/batching-process-design.md | 274 +++++++++++++++++++-------- 1 file changed, 193 insertions(+), 81 deletions(-) diff --git a/docs/rfcs/batching-process-design.md b/docs/rfcs/batching-process-design.md index 4a03e8400c3..f1d8613b816 100644 --- a/docs/rfcs/batching-process-design.md +++ b/docs/rfcs/batching-process-design.md @@ -1,4 +1,4 @@ -# Error transmission through a batching processor with concurrency +# Error transmission through a batching processor with concurrency Establish normative guidelines for components that batch telemetry to follow so that the batchprocessor and exporterhelper-based batch_sender behave in similar ways with good defaults. @@ -12,14 +12,11 @@ Third, to establish how batching processors should handle incoming context deadl ## Explanation -Here, "batching process" refers to specifically the batch processor +Here, "batching process" refers to specifically the batch processor (i.e., a processor component) or the exporterhelper's batch_sender -(i.e., part of an exporter component). -Both are core components, and it is an explicit goal that these two +(i.e., part of an exporter component). +Both are core componentry, and it is an explicit goal that these two options for batching cover a wide range of requirements. -We prefer to improve the core batching-process components instead -of introduce new components, in order to achieve desired batching -behavior. We use the term "request" as a signal-independent descriptor of the concrete data types used in Collector APIs (e.g., a `plog.Logs`, @@ -53,82 +50,197 @@ Today, both batching processes restrict size configuration to an item count, [however the desire to use request count and byte count (in some encoding) is well recognized](https://github.com/open-telemetry/opentelemetry-collector/issues/9462). -### Sequence of Events +### Batch processor logic: existing + +The batch processor operates over Pipeline data for both its input +and output. It takes advantage of the top-level `repeated` portion +of the OpenTelemetry data model to combine many requests into one +request. Here is the logical sequence of events that +takes place as a request makes its way through a batch processor: + +A. The request arrives when the preceding component calls `Consume()` + on this component with Context and data. +B. The request is placed into a channel. +C. The request is removed from a channel by a background thread and + entered into the pending state. +D. The batching process calls export one or more times containing data + from the original request, receiving responses possibly with errors + which are logged. +E. The `Consume()` call returns control to the caller. + +In the batch processor, the request producer performs step A and B, +and then it skips to F. Since the processor returns before the export +completes, it always returns success. We refer to this behavior as +"error suppression". The background thread, independently, observes steps +C, D, and E, after which errors (if any) are logged. + +The batch procsesor performs steps D and E multiple times in sequence, with +never more than one export at a time. Effective concurrency is limited to +1 within the component, however this is usually alleviated by the use of +a "queue_sender" later in the pipeline. When the exporterhelper's queue sender +is enabled and the queue has space, it immediately returns success (a +form of error suppression), which allows the batch processor to issue multiple +batches at a time. + +The batch processor does not consult the incoming request's Context deadline +or allow request context cancellation to interrupt step B. Step D is executed +without a context deadline. + +Trace context is interrupted. By default, incoming gRPC metadata is not propagated. +To address the loss of metadata, the batch processor has been extended with a +`metadata_keys` configuration; with this option, independent batch processes +are constructed for each distinct combination of metadata key values. Per-group +metadata values are placed into the outgoing context. + +### Batch sender logic: existing + +The batch sender is a feature of the exporterhelper; it is an optional +sub-component situated after the queue sender, and it is used to compute +batches in the intended encoding used by the exporter. It follows a different +sequence of events compared to the processor: + +A. Check if there is a pending batch. +B. If there is no pending batch, it creates a new one and starts a timer. +C. Add the incoming data to the pending batch. +D. Send the batch to the exporter. +E. Wait for the batch-error. +F. Each caller returns the batch-error. + +Unlike the batch processor, errors are propagated, not suppressed. + +Trace context is interrupted. Outgoing requests have empty `client.Metadata`. + +The context deadline of the caller is not considered in step E. In step D, the +export is made without a context deadline; a subsequent timeout sender typically +configures a timeout for the export. + +The pending batch is managed through a Golang interface, making it possible +to accumulate protocol-specific intermediate data. There are two specific +interfaces an exporter component provides: + +- Merge(): when request batches have no upper bound. In this case, the interface + produces single outputs. +- MergeSplit(): when there is a maximum size imposed. In this case, the interface + produces potentially more than one output request. + +Concurrency behavior varies. In the case where `MergeSplit()` is used, there is +a potential for multiple requests to emit from a single request. In this case, +steps D through F are executed repeatedly while there are more requests, meaning: + +1. Exports are synchronous and sequential. +2. An error causes aborting of subsequent parts of the request. + +### Queue sender logic: existing + +The queue sender provides key functionality that determines the overall behavior +of both batching components. When enabled, the queue sender will return success +to the caller as soon as the request is enqueued. In the background, it concurrently +exports requests in the queue using a configurable number of threads. + +It is worth evaluating the behavior of the queue sender with a persistent queue +and with an in-memory queue: + +- Persistent queue: In this case, the queue stores the request before returning + success. There is not a chance of data loss. +- In-memory queue: In this case, the queue acts as a form of error suppression. + Callers do not wait for the export to return, so there is a chance of data loss + in this configuration. + +The queue sender does not consider the caller's context deadline when it attempts +to enqueue the request. If the queue is full, the queue sender returns a queue-full +error immediately. + +### Feature matrix + +| Support area | Batch processor | Batch sender | Explanation | +| -----| -- | -- | -- | +| Merges requests | Yes | Yes | Does it merge smaller into larger batches? | +| Splits requests | Yes | Yes, however sequential | Does it split larger into smaller batches? | +| Cancellation | No | No | Does it respect caller cancellation? | +| Deadline | No | No | Does it set an outgoing deadline? | +| Metadata | Yes | No | Can it batch by metadata key value(s)? | +| Tracing | No | No | Instrumented for tracing? | +| Error transmission | No | Yes | Are export errors returned to callers? | +| Concurrency allowed | No | Yes | Does the component limit concurrency? | +| Independence | Yes | No | Are all data exported independently? | + +### Change proposal + +#### Batch processor: required + +The batch processor MUST be modified to achieve the following +outcomes: + +- Allow concurrent exports. When the processor has a batch of + data available to send, it will send the data immediately. +- Transmit errors back to callers. Callers will be blocked + while one or more requests are issued and wait on responses. +- Respect context cancellation. Callers will return control + to the pipeline when their context is cancelled. + +#### Batch sender: required + +The batch sender MUST be modified to achieve the following +outcomes: + +- Allow concurrent splitting. When multiple full-size requests + are produced from an input, they are exported concurrently and + independently. +- Recognize partial splitting. When a request is split, leaving + part of a request that is not full, it remains in the active + request (i.e., still pending). +- Respect key metadata. Implement the `metadata_keys` feature + supported by the batch processor. +- Respect context cancellation. Callers will return control + to the pipeline when their context is cancelled. -A batching process breaks the normal flow of Context through a -Collector pipeline. Here are the events that take place as a request -makes its way through a batching process: +### Open questions + +#### Batch trace context -A. The request arrives when the caller calls a `Consume()` or `Send()` - method on this component with Context and data. -B. The request is added to the currently-pending batch. -C. The batching process calls export one or more times containing data from the original request. -D. The batching process receives the response from the export(s), possibly with an error. -E. The `Consume()` or `Send()` call returns control to the caller, possibly with an error. +Should outgoing requests be instrumented by a trace Span linked to the incoming trace contexts? This document proposes yes, in +one of two ways: + +1. When an export corresponds with data for a single incoming + request, the request's original context is used as the parent. +2. When an export corresponds with data from multiple incoming + requests, the incoming trace contexts are linked with the new + root span. -The two batch processors execute these steps with different sequences. +#### Empty request handling + +How should a batching process handle requests that contain no +concrete items of data? [These requests may be seen as empty +containers](https://github.com/open-telemetry/opentelemetry-proto/issues/598), +for example, tracing requests with no spans, metric requests with +no metric data points, and logs requests with no log records. + +For a batching process, these requests can be problematic. If +request size is measured in item count, these "empty" requests +leave batch size unchanged, and could cause unbounded memory +growth. + +This document proposes a consistent treatment for empty requests: +batching processes should return immediate success, which is the +behavior of the batching processor currently. + +#### Outgoing context deadline -In the batch processor, we observe two independent sequences. One is -A ⮕ B ⮕ E: the request arrives, then is placed in a batch, then -returns success. The other is B ⮕ C ⮕ D: once in a batch, the request -is exported, then errors (if any) are logged. - -In the batch_sender, we observe a single sequence, A ⮕ B ⮕ C ⮕ D ⮕ E: -the request arrives, is placed in a batch, the batch is sent, the -response is received, and the caller returns the error (if any). - -To resolve the inconsistency, this document proposes to modify the -batch processor to use a single sequence, i.e., A ⮕ B ⮕ C ⮕ D ⮕ E. - -### Request handling - -There are a number of open questions related to the arriving request -and its context. - -- The arriving request has no items of telemetry. Does the batching process return success immediately? - -Consider the arriving deadline: - -- The arriving Context deadline has already expired. Does the batching process fail the request immediately? -- The arriving Context deadline expires while waiting for the export(s). What happens? - -Considering the arriving Context's trace context: - -- An export contains data from multiple requests, is a new root span instrumented? -- An export contains data from a single request, is a child span instrumented? - -The batching process may be configured to use client metadata as a -batching identifier -([batchprocessor](https://github.com/open-telemetry/opentelemetry-collector/issues/4544) -is complete, -[batch_sender](https://github.com/open-telemetry/opentelemetry-collector/issues/10825) -is incomplete). Considering the arriving Context's client metadata: - -- An export contains data from a single request, are there circumstances when the request's client metadata passes through? - -### Error handling - -A batching process determines what happens when some or all of a -request fails to be processed. Consider when an incoming request has -partially or completely failed: - -- Does the caller receive an error? -- Does the remaining portion of the request still export? -- Under what conditions is the error returned by a batching process retryable? - -### Concurrency handling - -Here are some questions about concurrency in the batching process. -Consider what happens when there is more than one batch of data -available to send: - -- Does the batching process wait for one batch to complete before sending another? -- Does the batching process use a caller's goroutine to export, or can it create its own? -- Is there any limit on the number of concurrent exports? - -## Proposed Requirements - -The questions posed above are meant to help us identify areas where -the two batching processes are either inconsisent with each other or -inconsistent with the goals of the project. +Should the batching process set an outgoing context deadline +to convey the maximum amount of time to consider processing +the request? +Neither existing component uses an outgoing context deadline. +This could lead to resource exhaustion, in some cases, by +allowing requests to remain pending indefinitely. + +On the one hand, this support may not be necessary, since in +most cases the batching process is followed by an exporter, which +includes a timeout sender option, capable of ensuring a default +timeout. + +On the other hand, the batching process knows the callers' +actual deadlines, and it could even use this information to +form batches. + +This proposal makes no specific recommendation. From cb0e1056c0cbedd0ba31cbad6ad2de826c1e56f7 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Mon, 16 Dec 2024 16:03:58 -0800 Subject: [PATCH 3/7] More edits. --- docs/rfcs/batching-process-design.md | 87 ++++++++++++++++++++-------- 1 file changed, 64 insertions(+), 23 deletions(-) diff --git a/docs/rfcs/batching-process-design.md b/docs/rfcs/batching-process-design.md index f1d8613b816..91fbd87f1d2 100644 --- a/docs/rfcs/batching-process-design.md +++ b/docs/rfcs/batching-process-design.md @@ -1,6 +1,8 @@ # Error transmission through a batching processor with concurrency -Establish normative guidelines for components that batch telemetry to follow so that the batchprocessor and exporterhelper-based batch_sender behave in similar ways with good defaults. +Establish normative guidelines for components that batch telemetry to follow +so that the batchprocessor and exporterhelper-based batch_sender behave in +similar ways with good defaults. ## Motivation @@ -161,11 +163,32 @@ error immediately. | Metadata | Yes | No | Can it batch by metadata key value(s)? | | Tracing | No | No | Instrumented for tracing? | | Error transmission | No | Yes | Are export errors returned to callers? | -| Concurrency allowed | No | Yes | Does the component limit concurrency? | +| Concurrency enabled | No | Merge: Yes; MergeSplit: No | Does the component limit concurrency? | | Independence | Yes | No | Are all data exported independently? | ### Change proposal +#### Queue sender defaults + +A central aspect of this presentation focuses on the queue sender, +which along with one of the batching processors defines the behavior +of most OpenTelemetry pipelines. Therefore, the default queue sender +behavior is critical. + +This proposal argues in favor of the user, who does not want default +behavior that leads to data loss. This requires one or more changes +in the exporterhelper: + +1. Disable the queue sender by default. In this configuration, requests + become synchronous through the batch processor, and responses are delayed + until whole batches covering the caller's input have been processed. + Assuming the other requirements in this proposal are also carried out, + this means that pipelines will block each caller awaiting end-to-end + success, by default. +2. Prevent start-up without a persistent queue configured; users would + have to opt-in to the in-memory queue if they want to return success + with no persistent store and not await end-to-end success. + #### Batch processor: required The batch processor MUST be modified to achieve the following @@ -198,14 +221,15 @@ outcomes: #### Batch trace context -Should outgoing requests be instrumented by a trace Span linked to the incoming trace contexts? This document proposes yes, in -one of two ways: +Should outgoing requests be instrumented by a trace Span linked +to the incoming trace contexts? This document proposes yes, as +follows. -1. When an export corresponds with data for a single incoming - request, the request's original context is used as the parent. -2. When an export corresponds with data from multiple incoming - requests, the incoming trace contexts are linked with the new - root span. +1. Create a new root span at the moment each batch starts. +2. For each new request incorporated into the batch, call + `AddLink()` on the pair of spans. +3. Use the associated root span as the context for each + export call. #### Empty request handling @@ -217,12 +241,12 @@ no metric data points, and logs requests with no log records. For a batching process, these requests can be problematic. If request size is measured in item count, these "empty" requests -leave batch size unchanged, and could cause unbounded memory -growth. +leave batch size unchanged, therefore they can cause unbounded +memory growth. This document proposes a consistent treatment for empty requests: batching processes should return immediate success, which is the -behavior of the batching processor currently. +current behavior of the batch processor. #### Outgoing context deadline @@ -230,17 +254,34 @@ Should the batching process set an outgoing context deadline to convey the maximum amount of time to consider processing the request? -Neither existing component uses an outgoing context deadline. -This could lead to resource exhaustion, in some cases, by -allowing requests to remain pending indefinitely. +This and several related questions are broken out into a +companion RFC. + +#### Prototypes + +##### Concurrent batch processor + +The OpenTelemetry Protocol with Apache Arrow project's ]`concurrentbatch` +processor](https://github.com/open-telemetry/otel-arrow/blob/main/collector/processor/concurrentbatchprocessor/README.md) +is derived from the core batch processor. It has added solutions for +the problems outlined above, including error propagation, trace +propagation, and concurrency. + +This code can be contributed back to the core with a series of minor +changes, some having an associated feature gate. + +A. Add tracing support, as described above. +B. Make "early return" a new behavior, feature gate from on (current behavior) to off (desired behavior); otherwise, wait for the response and return the error. +C. Make "concurrency_limit" a new setting measuring concurrency added by this component, feature gate from 1 (current behavior) to limited (e.g., 10, 100) + +Note that "concurrency_limit" is defined in terms that do not +count the incoming concurrency, as it is compulsory. A limit of -On the one hand, this support may not be necessary, since in -most cases the batching process is followed by an exporter, which -includes a timeout sender option, capable of ensuring a default -timeout. +##### Batch sender -On the other hand, the batching process knows the callers' -actual deadlines, and it could even use this information to -form batches. +This has not been prototyped. The exporterhelper code can be modified, +for the batch sender to conform with this proposal. -This proposal makes no specific recommendation. +A. Add tracing support, as described above. +B. Make "concurrency_limit" a new setting measuring concurrency added by this component, feature gate from 0 (current behavior) to limited (e.g., 10, 100) +C. Add metadata keys support, identical to the batch processor. From e582ef5fa8a31dccbdb54c5a8227f6db508d2b86 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Mon, 16 Dec 2024 16:39:49 -0800 Subject: [PATCH 4/7] Lint --- docs/rfcs/batching-process-design.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/rfcs/batching-process-design.md b/docs/rfcs/batching-process-design.md index 91fbd87f1d2..1ec6be411c3 100644 --- a/docs/rfcs/batching-process-design.md +++ b/docs/rfcs/batching-process-design.md @@ -261,7 +261,7 @@ companion RFC. ##### Concurrent batch processor -The OpenTelemetry Protocol with Apache Arrow project's ]`concurrentbatch` +The OpenTelemetry Protocol with Apache Arrow project's [`concurrentbatch` processor](https://github.com/open-telemetry/otel-arrow/blob/main/collector/processor/concurrentbatchprocessor/README.md) is derived from the core batch processor. It has added solutions for the problems outlined above, including error propagation, trace From 8f7c8285a9de1f34eac520bdde0cd2a6197772cd Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Thu, 19 Dec 2024 09:51:37 -0800 Subject: [PATCH 5/7] Edits, update companion RFC PR num. --- docs/rfcs/batching-process-design.md | 71 +++++++++++++--------------- 1 file changed, 33 insertions(+), 38 deletions(-) diff --git a/docs/rfcs/batching-process-design.md b/docs/rfcs/batching-process-design.md index 1ec6be411c3..0df108be185 100644 --- a/docs/rfcs/batching-process-design.md +++ b/docs/rfcs/batching-process-design.md @@ -1,8 +1,7 @@ # Error transmission through a batching processor with concurrency Establish normative guidelines for components that batch telemetry to follow -so that the batchprocessor and exporterhelper-based batch_sender behave in -similar ways with good defaults. +so that the batchprocessor and exporterhelper-based batch_sender have similar behavior. ## Motivation @@ -60,32 +59,31 @@ of the OpenTelemetry data model to combine many requests into one request. Here is the logical sequence of events that takes place as a request makes its way through a batch processor: -A. The request arrives when the preceding component calls `Consume()` +1. The request arrives when the preceding component calls `Consume()` on this component with Context and data. -B. The request is placed into a channel. -C. The request is removed from a channel by a background thread and +1. The request is placed into a channel. +1. The request is removed from a channel by a background thread and entered into the pending state. -D. The batching process calls export one or more times containing data +1. The batching process calls export one or more times containing data from the original request, receiving responses possibly with errors which are logged. -E. The `Consume()` call returns control to the caller. +1. The `Consume()` call returns control to the caller. -In the batch processor, the request producer performs step A and B, -and then it skips to F. Since the processor returns before the export -completes, it always returns success. We refer to this behavior as +In the batch processor, the request producer performs step 1 and 2, +and then it skips to 5, returning success before the export completes. We refer to this behavior as "error suppression". The background thread, independently, observes steps -C, D, and E, after which errors (if any) are logged. +3 and 4, after which any errors are logged. -The batch procsesor performs steps D and E multiple times in sequence, with +The batch procsesor performs steps 4 multiple times in sequence, with never more than one export at a time. Effective concurrency is limited to 1 within the component, however this is usually alleviated by the use of -a "queue_sender" later in the pipeline. When the exporterhelper's queue sender +a Queue sender, later in the pipeline. When the exporterhelper's Queue sender is enabled and the queue has space, it immediately returns success (a form of error suppression), which allows the batch processor to issue multiple -batches at a time. +batches at a time into the queue. The batch processor does not consult the incoming request's Context deadline -or allow request context cancellation to interrupt step B. Step D is executed +or allow request context cancellation to interrupt step 2. Step 4 is executed without a context deadline. Trace context is interrupted. By default, incoming gRPC metadata is not propagated. @@ -101,18 +99,18 @@ sub-component situated after the queue sender, and it is used to compute batches in the intended encoding used by the exporter. It follows a different sequence of events compared to the processor: -A. Check if there is a pending batch. -B. If there is no pending batch, it creates a new one and starts a timer. -C. Add the incoming data to the pending batch. -D. Send the batch to the exporter. -E. Wait for the batch-error. -F. Each caller returns the batch-error. +1. Check if there is a pending batch. +1. If there is no pending batch, it creates a new one and starts a timer. +1. Add the incoming data to the pending batch. +1. Send the batch to the exporter. +1. Wait for the batch error. +1. Each caller returns the batch error. -Unlike the batch processor, errors are propagated, not suppressed. +Unlike the batch processor, errors are transmitted back to the callers, not suppressed. Trace context is interrupted. Outgoing requests have empty `client.Metadata`. -The context deadline of the caller is not considered in step E. In step D, the +The context deadline of the caller is not considered in step 5. In step 4, the export is made without a context deadline; a subsequent timeout sender typically configures a timeout for the export. @@ -127,7 +125,7 @@ interfaces an exporter component provides: Concurrency behavior varies. In the case where `MergeSplit()` is used, there is a potential for multiple requests to emit from a single request. In this case, -steps D through F are executed repeatedly while there are more requests, meaning: +steps 4 and 5 are executed repeatedly while there are more requests, meaning: 1. Exports are synchronous and sequential. 2. An error causes aborting of subsequent parts of the request. @@ -137,13 +135,13 @@ steps D through F are executed repeatedly while there are more requests, meaning The queue sender provides key functionality that determines the overall behavior of both batching components. When enabled, the queue sender will return success to the caller as soon as the request is enqueued. In the background, it concurrently -exports requests in the queue using a configurable number of threads. +exports requests in the queue using a configurable number of execution threads. It is worth evaluating the behavior of the queue sender with a persistent queue and with an in-memory queue: - Persistent queue: In this case, the queue stores the request before returning - success. There is not a chance of data loss. + success. This component is not directly responsible for data loss. - In-memory queue: In this case, the queue acts as a form of error suppression. Callers do not wait for the export to return, so there is a chance of data loss in this configuration. @@ -163,7 +161,7 @@ error immediately. | Metadata | Yes | No | Can it batch by metadata key value(s)? | | Tracing | No | No | Instrumented for tracing? | | Error transmission | No | Yes | Are export errors returned to callers? | -| Concurrency enabled | No | Merge: Yes; MergeSplit: No | Does the component limit concurrency? | +| Concurrency enabled | No | Merge: Yes
MergeSplit: No | Does the component allow concurrent export? | | Independence | Yes | No | Are all data exported independently? | ### Change proposal @@ -194,7 +192,7 @@ in the exporterhelper: The batch processor MUST be modified to achieve the following outcomes: -- Allow concurrent exports. When the processor has a batch of +- Allow concurrent exports. When the processor has a complete batch of data available to send, it will send the data immediately. - Transmit errors back to callers. Callers will be blocked while one or more requests are issued and wait on responses. @@ -255,7 +253,7 @@ to convey the maximum amount of time to consider processing the request? This and several related questions are broken out into a -companion RFC. +[companion RFC](https://github.com/open-telemetry/opentelemetry-collector/pull/11948). #### Prototypes @@ -270,18 +268,15 @@ propagation, and concurrency. This code can be contributed back to the core with a series of minor changes, some having an associated feature gate. -A. Add tracing support, as described above. -B. Make "early return" a new behavior, feature gate from on (current behavior) to off (desired behavior); otherwise, wait for the response and return the error. -C. Make "concurrency_limit" a new setting measuring concurrency added by this component, feature gate from 1 (current behavior) to limited (e.g., 10, 100) - -Note that "concurrency_limit" is defined in terms that do not -count the incoming concurrency, as it is compulsory. A limit of +1. Add tracing support, as described above. +1. Make "early return" a new feature gate from on (current behavior) to off (desired behavior); when early return is enabled, suppress errors and return; otherwise, wait for the response and return the error. +1. Make "concurrency_limit" a new setting measuring concurrency added by this component, feature gate from 1 (current behavior) to limited (e.g., 10, 100) ##### Batch sender This has not been prototyped. The exporterhelper code can be modified, for the batch sender to conform with this proposal. -A. Add tracing support, as described above. -B. Make "concurrency_limit" a new setting measuring concurrency added by this component, feature gate from 0 (current behavior) to limited (e.g., 10, 100) -C. Add metadata keys support, identical to the batch processor. +1. Add tracing support, as described above. +1. Make "concurrency_limit" a new setting measuring concurrency added by this component, feature gate from 0 (current behavior) to limited (e.g., 10, 100) +1. Add metadata keys support, identical to the batch processor. From 5cb7d682a2a5d362d05d6d2758c919d8d642c781 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Thu, 19 Dec 2024 10:21:26 -0800 Subject: [PATCH 6/7] Chlog --- .chloggen/rfc-error-transmission.yaml | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) create mode 100644 .chloggen/rfc-error-transmission.yaml diff --git a/.chloggen/rfc-error-transmission.yaml b/.chloggen/rfc-error-transmission.yaml new file mode 100644 index 00000000000..b9874146bc3 --- /dev/null +++ b/.chloggen/rfc-error-transmission.yaml @@ -0,0 +1,25 @@ +# Use this changelog template to create an entry for release notes. + +# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix' +change_type: enhancement + +# The name of the component, or a single word describing the area of concern, (e.g. otlpreceiver) +component: batchprocessor, batch_sender + +# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`). +note: Add an RFC on consistent and ideal behavior for batchers. + +# One or more tracking issues or pull requests related to the change +issues: [11308] + +# (Optional) One or more lines of additional information to render under the primary note. +# These lines will be padded with 2 spaces and then inserted directly into the document. +# Use pipe (|) for multiline entries. +subtext: + +# Optional: The change log or logs in which this entry should be included. +# e.g. '[user]' or '[user, api]' +# Include 'user' if the change is relevant to end users. +# Include 'api' if there is a change to a library API. +# Default: '[user]' +change_logs: [user] \ No newline at end of file From 29f493f20be19e74fbffd3a0b0983821b247c067 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Thu, 19 Dec 2024 10:27:23 -0800 Subject: [PATCH 7/7] Typo --- docs/rfcs/batching-process-design.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/rfcs/batching-process-design.md b/docs/rfcs/batching-process-design.md index 0df108be185..9830373e8c7 100644 --- a/docs/rfcs/batching-process-design.md +++ b/docs/rfcs/batching-process-design.md @@ -5,7 +5,7 @@ so that the batchprocessor and exporterhelper-based batch_sender have similar be ## Motivation -We are motivated, first, to have consistent behavior across two forms of batching process: (1) `batchprocessor`, and (2) `exporterhelper/internal/batch_sender`. Today, these two core components exhibit diferent behaviors. +We are motivated, first, to have consistent behavior across two forms of batching process: (1) `batchprocessor`, and (2) `exporterhelper/internal/batch_sender`. Today, these two core components exhibit different behaviors. Second, to establish conditions and requirements for error transmission, tracing instrumentation, and concurrency from these components.