From a96fa000f2f667bd279f44954e1e4f407c328ad9 Mon Sep 17 00:00:00 2001 From: Jade Guiton Date: Thu, 28 Nov 2024 15:50:19 +0100 Subject: [PATCH 01/10] First draft of component observability requirements --- docs/component-stability.md | 68 +++++++++++++++++++++++++++++++++++++ 1 file changed, 68 insertions(+) diff --git a/docs/component-stability.md b/docs/component-stability.md index bf1cbfbd05e..f8a4f473dff 100644 --- a/docs/component-stability.md +++ b/docs/component-stability.md @@ -66,6 +66,74 @@ Stable components MUST be compatible between minor versions unless critical secu component owner MUST provide a migration path and a reasonable time frame for users to upgrade. The same rules from beta components apply to stable when it comes to configuration changes. +#### Observability requirements + +Stable components should emit enough internal telemetry to let users detect errors, as well as data +loss and performance issues inside the component, and to help diagnose them if possible. + +The internal telemetry of a stable component should allow observing the following: + +1. How much data the component receives. + + For receivers, this could be a metric counting requests, received bytes, scraping attempts, etc. + + For other components, this would typically be the number of items (log records, metric points, + spans) received through the `Consumer` API. + +2. How much data the component outputs. + + For exporters, this could be a metric counting requests, sent bytes, etc. + + For other components, this would typically be the number of items forwarded through the `Consumer` + API. + +3. How much data is dropped because of errors. + + For receivers, this could include a metric counting payloads that could not be parsed in. + + For receivers and exporters, this could include a metric counting requests that failed because + of network errors. + + The goal is to be able to easily pinpoint the source of data loss in the Collector pipeline, so + this should either: + - only include errors internal to the component, or; + - allow distinguishing said errors from ones originating in an external service, or propagated + from downstream Collector components. + +4. Details for error conditions. + + This could be in the form of logs or spans detailing the reason for an error. As much detail as + necessary should be provided to ease debugging. Processed signal data should not be included for + security and privacy reasons. + +5. Other discrepancies between input and output. This may include: + + - How much data is dropped as part of normal operation (eg. filtered out). + + - How much data is created by the component. + + - How much data is currently held by the component (eg. an UpDownCounter keeping track of the + size of an internal queue). + +6. Processing performance. + + This could be a histogram of end-to-end component latency, measured as the time between external + requests or `Consumer` API calls. + +When measuring amounts of data, counting data items (spans, log records, metric points) is +recommended. Where this can't easily be done, any relevant unit may be used, as long as zero is a +reliable indicator of the absence of data. In any case, the type of all metrics should be properly +documented (not "1"). + +If data can be dropped/created/held at multiple distinct points in a component's pipeline (eg. +scraping, validation, processing, etc.), it is recommended to define additional attributes to help +diagnose the specific source of the discrepancy, or to define different signals for each. + +Note that some of this internal telemetry may already be provided by pipeline auto-instrumentation, +or helpers modules (such as `receiverhelper`, `scraperhelper`, `processorhelper`, or +`exporterhelper`). Please check the documentation to verify which parts, if any, need to be +implemented manually. + ### Deprecated The component is planned to be removed in a future version and no further support will be provided. Note that new issues will likely not be worked on. When a component enters "deprecated" mode, it is expected to exist for at least two minor releases. See the component's readme file for more details on when a component will cease to exist. From e1a161859c70b97f06d27afafc1e550ed0e6ee16 Mon Sep 17 00:00:00 2001 From: Jade Guiton Date: Fri, 29 Nov 2024 17:10:19 +0100 Subject: [PATCH 02/10] Wording: "type" should be "unit" Co-authored-by: Pablo Baeyens --- docs/component-stability.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/component-stability.md b/docs/component-stability.md index f8a4f473dff..6441b5ee0fe 100644 --- a/docs/component-stability.md +++ b/docs/component-stability.md @@ -122,8 +122,7 @@ The internal telemetry of a stable component should allow observing the followin When measuring amounts of data, counting data items (spans, log records, metric points) is recommended. Where this can't easily be done, any relevant unit may be used, as long as zero is a -reliable indicator of the absence of data. In any case, the type of all metrics should be properly -documented (not "1"). +reliable indicator of the absence of data. In any case, all metrics should have a defined unit (not "1"). If data can be dropped/created/held at multiple distinct points in a component's pipeline (eg. scraping, validation, processing, etc.), it is recommended to define additional attributes to help From 63fdd48815bf8d7d1c52d0920339166df0887d9f Mon Sep 17 00:00:00 2001 From: Jade Guiton Date: Fri, 29 Nov 2024 17:44:23 +0100 Subject: [PATCH 03/10] Fixed formatting, added note about normativity, added spans as an option for measuring performance. --- docs/component-stability.md | 77 ++++++++++++++++++++++--------------- 1 file changed, 47 insertions(+), 30 deletions(-) diff --git a/docs/component-stability.md b/docs/component-stability.md index 6441b5ee0fe..8b4a34318ff 100644 --- a/docs/component-stability.md +++ b/docs/component-stability.md @@ -71,54 +71,76 @@ components apply to stable when it comes to configuration changes. Stable components should emit enough internal telemetry to let users detect errors, as well as data loss and performance issues inside the component, and to help diagnose them if possible. -The internal telemetry of a stable component should allow observing the following: +This section defines the categories of values that should be observable through internal telemetry +for all stable pipeline components. (Extensions are not covered.) + +**Note:** The following categories MUST all be covered, unless justification is given as to why +one may not be applicable. However, for each category, many reasonable implementations are possible +as long as the relevant information can be derived from the emitted telemetry; everything after the +basic category description is a recommendation, and is not normative. + +**Note:** Some of this internal telemetry may already be provided by pipeline auto-instrumentation +or helper modules (such as `receiverhelper`, `scraperhelper`, `processorhelper`, or +`exporterhelper`). Please check the documentation to verify which parts, if any, need to be +implemented manually. + +**Definition:** In the following, an "item" refers generically to a single log record, metric event, +or span. + +The internal telemetry of a stable pipeline component should allow observing the following: 1. How much data the component receives. - For receivers, this could be a metric counting requests, received bytes, scraping attempts, etc. + For receivers, this could be a metric counting requests, received bytes, scraping attempts, etc. - For other components, this would typically be the number of items (log records, metric points, - spans) received through the `Consumer` API. + For other components, this would typically be the number of items received through the + `Consumer` API. 2. How much data the component outputs. - For exporters, this could be a metric counting requests, sent bytes, etc. + For exporters, this could be a metric counting requests, sent bytes, etc. - For other components, this would typically be the number of items forwarded through the `Consumer` - API. + For other components, this would typically be the number of items forwarded to the next + component through the `Consumer` API. 3. How much data is dropped because of errors. - For receivers, this could include a metric counting payloads that could not be parsed in. - - For receivers and exporters, this could include a metric counting requests that failed because - of network errors. + For receivers, this could include a metric counting payloads that could not be parsed in. + + For receivers and exporters that make use of the network, this could include a metric counting + requests that failed because of network errors. - The goal is to be able to easily pinpoint the source of data loss in the Collector pipeline, so - this should either: - - only include errors internal to the component, or; - - allow distinguishing said errors from ones originating in an external service, or propagated - from downstream Collector components. + The goal is to be able to easily pinpoint the source of data loss in the Collector pipeline, so + this should either: + - only include errors internal to the component, or; + - allow distinguishing said errors from ones originating in an external service, or propagated + from downstream Collector components. 4. Details for error conditions. - This could be in the form of logs or spans detailing the reason for an error. As much detail as - necessary should be provided to ease debugging. Processed signal data should not be included for - security and privacy reasons. + This could be in the form of logs or spans detailing the reason for an error. As much detail as + necessary should be provided to ease debugging. Processed signal data should not be included for + security and privacy reasons. 5. Other discrepancies between input and output. This may include: - - How much data is dropped as part of normal operation (eg. filtered out). + - How much data is dropped as part of normal operation (eg. filtered out). - - How much data is created by the component. + - How much data is created by the component. - - How much data is currently held by the component (eg. an UpDownCounter keeping track of the - size of an internal queue). + - How much data is currently held by the component (eg. an UpDownCounter keeping track of the + size of an internal queue). 6. Processing performance. - This could be a histogram of end-to-end component latency, measured as the time between external - requests or `Consumer` API calls. + This could be spans for each operation of the component, or a histogram of end-to-end component + latency. + + The goal is to be able to easily pinpoint the source of latency in the Collector pipeline, so + this should either: + - only include time spent processing inside the component, or; + - allow distinguishing this latency from that caused by an external service, or from time spent + in downstream Collector components. When measuring amounts of data, counting data items (spans, log records, metric points) is recommended. Where this can't easily be done, any relevant unit may be used, as long as zero is a @@ -128,11 +150,6 @@ If data can be dropped/created/held at multiple distinct points in a component's scraping, validation, processing, etc.), it is recommended to define additional attributes to help diagnose the specific source of the discrepancy, or to define different signals for each. -Note that some of this internal telemetry may already be provided by pipeline auto-instrumentation, -or helpers modules (such as `receiverhelper`, `scraperhelper`, `processorhelper`, or -`exporterhelper`). Please check the documentation to verify which parts, if any, need to be -implemented manually. - ### Deprecated The component is planned to be removed in a future version and no further support will be provided. Note that new issues will likely not be worked on. When a component enters "deprecated" mode, it is expected to exist for at least two minor releases. See the component's readme file for more details on when a component will cease to exist. From b22da342ef2d4a75e820cb201594982ea24b977a Mon Sep 17 00:00:00 2001 From: Jade Guiton Date: Fri, 29 Nov 2024 17:49:40 +0100 Subject: [PATCH 04/10] Explicitly allow telemetry not in the list --- docs/component-stability.md | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/docs/component-stability.md b/docs/component-stability.md index 8b4a34318ff..1ae67bcab08 100644 --- a/docs/component-stability.md +++ b/docs/component-stability.md @@ -74,15 +74,19 @@ loss and performance issues inside the component, and to help diagnose them if p This section defines the categories of values that should be observable through internal telemetry for all stable pipeline components. (Extensions are not covered.) -**Note:** The following categories MUST all be covered, unless justification is given as to why -one may not be applicable. However, for each category, many reasonable implementations are possible -as long as the relevant information can be derived from the emitted telemetry; everything after the -basic category description is a recommendation, and is not normative. - -**Note:** Some of this internal telemetry may already be provided by pipeline auto-instrumentation -or helper modules (such as `receiverhelper`, `scraperhelper`, `processorhelper`, or -`exporterhelper`). Please check the documentation to verify which parts, if any, need to be -implemented manually. +**Notes:** +- The following categories MUST all be covered, unless justification is given as to why +one may not be applicable. + +- However, for each category, many reasonable implementations are possible, as long as the relevant +information can be derived from the emitted telemetry; everything after the basic category +description is a recommendation, and is not normative. + +- Of course, a component may define additional internal telemetry which is not in this list. + +- Some of this internal telemetry may already be provided by pipeline auto-instrumentation or +helper modules (such as `receiverhelper`, `scraperhelper`, `processorhelper`, or `exporterhelper`). +Please check the documentation to verify which parts, if any, need to be implemented manually. **Definition:** In the following, an "item" refers generically to a single log record, metric event, or span. From d726b6053be7cfc6fdabf085f4718c848e026d96 Mon Sep 17 00:00:00 2001 From: Jade Guiton Date: Fri, 29 Nov 2024 17:54:56 +0100 Subject: [PATCH 05/10] Minor rewording --- docs/component-stability.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/component-stability.md b/docs/component-stability.md index 1ae67bcab08..0ac4faba67e 100644 --- a/docs/component-stability.md +++ b/docs/component-stability.md @@ -126,7 +126,7 @@ The internal telemetry of a stable pipeline component should allow observing the necessary should be provided to ease debugging. Processed signal data should not be included for security and privacy reasons. -5. Other discrepancies between input and output. This may include: +5. Other possible discrepancies between input and output, if any. This may include: - How much data is dropped as part of normal operation (eg. filtered out). @@ -137,8 +137,8 @@ The internal telemetry of a stable pipeline component should allow observing the 6. Processing performance. - This could be spans for each operation of the component, or a histogram of end-to-end component - latency. + This could include spans for each operation of the component, or a histogram of end-to-end + component latency. The goal is to be able to easily pinpoint the source of latency in the Collector pipeline, so this should either: @@ -146,9 +146,9 @@ The internal telemetry of a stable pipeline component should allow observing the - allow distinguishing this latency from that caused by an external service, or from time spent in downstream Collector components. -When measuring amounts of data, counting data items (spans, log records, metric points) is -recommended. Where this can't easily be done, any relevant unit may be used, as long as zero is a -reliable indicator of the absence of data. In any case, all metrics should have a defined unit (not "1"). +When measuring amounts of data, counting items is recommended. Where this can't easily be done, any +relevant unit may be used, as long as zero is a reliable indicator of the absence of data. In any +case, all metrics should have a defined unit (not "1"). If data can be dropped/created/held at multiple distinct points in a component's pipeline (eg. scraping, validation, processing, etc.), it is recommended to define additional attributes to help From 07bb8d00fa072b9a43540ae929a5c43b1b88d401 Mon Sep 17 00:00:00 2001 From: Jade Guiton Date: Mon, 2 Dec 2024 16:15:59 +0100 Subject: [PATCH 06/10] Elaborate on extensions, minor rewording --- docs/component-stability.md | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/docs/component-stability.md b/docs/component-stability.md index 0ac4faba67e..39f1331f36b 100644 --- a/docs/component-stability.md +++ b/docs/component-stability.md @@ -71,8 +71,13 @@ components apply to stable when it comes to configuration changes. Stable components should emit enough internal telemetry to let users detect errors, as well as data loss and performance issues inside the component, and to help diagnose them if possible. -This section defines the categories of values that should be observable through internal telemetry -for all stable pipeline components. (Extensions are not covered.) +For extension components, this means some way to monitor errors (for example through logs or span +events), and some way to monitor performance (for example through spans or histograms). Because +extensions can be so diverse, the details will be up to the component authors, and no further +constraints are set out in this document. + +For pipeline components however, this section details the kinds of values that should be observable +via internal telemetry for all stable components. **Notes:** - The following categories MUST all be covered, unless justification is given as to why @@ -111,8 +116,8 @@ The internal telemetry of a stable pipeline component should allow observing the For receivers, this could include a metric counting payloads that could not be parsed in. - For receivers and exporters that make use of the network, this could include a metric counting - requests that failed because of network errors. + For receivers and exporters that interact with an external service, this could include a metric + counting requests that failed because of network errors. The goal is to be able to easily pinpoint the source of data loss in the Collector pipeline, so this should either: From b59f5e679a904b6b9b49a9cc586eeccd34937f7d Mon Sep 17 00:00:00 2001 From: Jade Guiton Date: Tue, 3 Dec 2024 11:00:45 +0100 Subject: [PATCH 07/10] Change "metric event" back to "metric point". --- docs/component-stability.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/component-stability.md b/docs/component-stability.md index 39f1331f36b..59732ff1e72 100644 --- a/docs/component-stability.md +++ b/docs/component-stability.md @@ -93,7 +93,7 @@ description is a recommendation, and is not normative. helper modules (such as `receiverhelper`, `scraperhelper`, `processorhelper`, or `exporterhelper`). Please check the documentation to verify which parts, if any, need to be implemented manually. -**Definition:** In the following, an "item" refers generically to a single log record, metric event, +**Definition:** In the following, an "item" refers generically to a single log record, metric point, or span. The internal telemetry of a stable pipeline component should allow observing the following: From fb0379fa7a0aafec5b4f00ae1e2cb98664ff7628 Mon Sep 17 00:00:00 2001 From: Jade Guiton Date: Wed, 4 Dec 2024 13:41:15 +0100 Subject: [PATCH 08/10] Add queue capacity and span correlation to requirements --- docs/component-stability.md | 21 ++++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/docs/component-stability.md b/docs/component-stability.md index 59732ff1e72..7465a3d079b 100644 --- a/docs/component-stability.md +++ b/docs/component-stability.md @@ -137,8 +137,11 @@ The internal telemetry of a stable pipeline component should allow observing the - How much data is created by the component. - - How much data is currently held by the component (eg. an UpDownCounter keeping track of the - size of an internal queue). + - How much data is currently held by the component, and how much can be held if there is a fixed + capacity. + + This would typically be an UpDownCounter keeping track of the size of an internal queue, along + with a gauge exposing the queue's capacity. 6. Processing performance. @@ -150,10 +153,18 @@ The internal telemetry of a stable pipeline component should allow observing the - only include time spent processing inside the component, or; - allow distinguishing this latency from that caused by an external service, or from time spent in downstream Collector components. + + As an application of this, components which hold items in a queue should allow differentiating + between time spent processing a batch of data and time where the batch is simply waiting in the + queue. + + If multiple spans are emitted for a given batch (before and after a queue for example), they + should either belong to the same trace, or have span links between them, so that they can be + correlated. -When measuring amounts of data, counting items is recommended. Where this can't easily be done, any -relevant unit may be used, as long as zero is a reliable indicator of the absence of data. In any -case, all metrics should have a defined unit (not "1"). +When measuring amounts of data, it is recommended to use "items" as your unit of measure. Where this +can't easily be done, any relevant unit may be used, as long as zero is a reliable indicator of the +absence of data. In any case, all metrics should have a defined unit (not "1"). If data can be dropped/created/held at multiple distinct points in a component's pipeline (eg. scraping, validation, processing, etc.), it is recommended to define additional attributes to help From 98779c13c55052d6ecca54ea78cb72598a137f5b Mon Sep 17 00:00:00 2001 From: Jade Guiton Date: Wed, 11 Dec 2024 13:33:40 +0100 Subject: [PATCH 09/10] Use Github note block --- docs/component-stability.md | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/docs/component-stability.md b/docs/component-stability.md index 7465a3d079b..85825e7e60a 100644 --- a/docs/component-stability.md +++ b/docs/component-stability.md @@ -79,19 +79,17 @@ constraints are set out in this document. For pipeline components however, this section details the kinds of values that should be observable via internal telemetry for all stable components. -**Notes:** -- The following categories MUST all be covered, unless justification is given as to why -one may not be applicable. - -- However, for each category, many reasonable implementations are possible, as long as the relevant -information can be derived from the emitted telemetry; everything after the basic category -description is a recommendation, and is not normative. - -- Of course, a component may define additional internal telemetry which is not in this list. - -- Some of this internal telemetry may already be provided by pipeline auto-instrumentation or -helper modules (such as `receiverhelper`, `scraperhelper`, `processorhelper`, or `exporterhelper`). -Please check the documentation to verify which parts, if any, need to be implemented manually. +> [!NOTE] +> - The following categories MUST all be covered, unless justification is given as to why one may +> not be applicable. +> - However, for each category, many reasonable implementations are possible, as long as the +> relevant information can be derived from the emitted telemetry; everything after the basic +> category description is a recommendation, and is not normative. +> - Of course, a component may define additional internal telemetry which is not in this list. +> - Some of this internal telemetry may already be provided by pipeline auto-instrumentation or +> helper modules (such as `receiverhelper`, `scraperhelper`, `processorhelper`, or +> `exporterhelper`). Please check the documentation to verify which parts, if any, need to be +> implemented manually. **Definition:** In the following, an "item" refers generically to a single log record, metric point, or span. From c95af160b33452ac8a7d6348381cdfb1e2827eeb Mon Sep 17 00:00:00 2001 From: Jade Guiton Date: Wed, 11 Dec 2024 18:24:08 +0100 Subject: [PATCH 10/10] Harmonize with pipeline universal telemetry RFC --- docs/component-stability.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/component-stability.md b/docs/component-stability.md index a3c51c98f24..b8a477f19b4 100644 --- a/docs/component-stability.md +++ b/docs/component-stability.md @@ -117,6 +117,9 @@ The internal telemetry of a stable pipeline component should allow observing the For receivers and exporters that interact with an external service, this could include a metric counting requests that failed because of network errors. + For processors, this could be an `outcome` (`success` or `failure`) attribute on a "received + items" metric defined for point 1. + The goal is to be able to easily pinpoint the source of data loss in the Collector pipeline, so this should either: - only include errors internal to the component, or; @@ -164,6 +167,10 @@ When measuring amounts of data, it is recommended to use "items" as your unit of can't easily be done, any relevant unit may be used, as long as zero is a reliable indicator of the absence of data. In any case, all metrics should have a defined unit (not "1"). +All internal telemetry emitted by a component should have attributes identifying the specific +component instance that it originates from. This should follow the same conventions as the +[pipeline universal telemetry](rfcs/component-universal-telemetry.md). + If data can be dropped/created/held at multiple distinct points in a component's pipeline (eg. scraping, validation, processing, etc.), it is recommended to define additional attributes to help diagnose the specific source of the discrepancy, or to define different signals for each.