feat/re-allow multiple workers #36134

bmiguel-teixeira · 2024-11-01T16:42:25Z

Description

This MR does the following:

Re adds the ability to allow multiple workers in this exporter due to:

Out of Order is no longer an issue now that it is fully supported in Prometheus. Nonetheless, I am setting the default worker as 1 to avoid OoO in Vanilla Prometheus Settings.
With a single worker, and for a collector with a large load, this becomes "blocking". Example: Imagine a scenario in which a collector is collecting lots of targets, and with a slow prometheus/unstable network, a single worker can easily bottleneck the off-shipping if retries are enabled.

Link to tracking issue

N/A

Testing

Documentation

docs auto-updated. Readme.md is now correct in its explanation of the `num_consumers since its no longer hard-coded at 1. Additional docs added.

dashpole · 2024-11-01T16:45:22Z

cc @jmichalek132 @ArthurSens

dashpole · 2024-11-01T16:47:05Z

I'll be OOO for the next few weeks, so I won't be able to review until then.

ArthurSens

Thanks for the PR @bmiguel-teixeira! The contributions sound solid, but I'm a bit concerned that we're doing two separate things in one single PR here. Generally that's not a good practice because in case we have to revert one thing, we end up reverting more than what's needed, not to mention that it makes the PR biggers and a bit harder to review

Could you choose only one new functionality for this PR and open another one for the other?

Regarding the changes, could you add tests for the telemetry added? Send PRW requests to a mock server and assert that the metric you've added increments as you expected.

For allowing multiple workers, it would be nice if we add extra documentation making it clear that Out-of-order needs to be enabled in Prometheus for it to work :)

bmiguel-teixeira · 2024-11-12T20:31:18Z

Sure. I will open a dedicated PR with the additional telemetry and keep de queue changes in this one which already has context.

bmiguel-teixeira · 2024-11-13T18:11:25Z

hi @ArthurSens

Removed the additional telemetry to be added in a secondary PR. Also added a bit of docs to explain the toggle and use case. Please take a look

Cheers

ArthurSens

Well, the code is simple so it does LGTM, but I'm struggling to test this.

Do you have any examples of apps/configs that will generate out of order datapoints? All attempts I've tried provide things in order so I can't be sure if this is working as expected 😅

exporter/prometheusremotewriteexporter/README.md

ArthurSens · 2024-11-19T22:59:14Z

(there's a linting failure)

bmiguel-teixeira · 2024-11-21T20:42:52Z

Hi @ArthurSens

Just submitted your recomendation to fix the spelling issues.

In regards to testing and simulating locally the out of order issues.
Here is my setup.

Prometheus Config

global:
  scrape_interval: 1s 
  evaluation_interval: 1s

#storage:
#  tsdb:
#    out_of_order_time_window: 10m

Otel Config

receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: 'node-1'
        scrape_interval: 1s
        static_configs:
          - targets: ['127.0.0.1:8081']

exporters:
  prometheusremotewrite:
    endpoint: http://localhost:9090/api/v1/write
    remote_write_queue:
      enabled: true
      num_consumers: XXX
      queue_size: 1000
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_interval: 5s
      max_elapsed_time: 30s

service:
  telemetry:
    logs:
      level: "DEBUG"

  pipelines:
    metrics:
      receivers: [prometheus]
      processors: []
      exporters: [prometheusremotewrite]

Scenario 1: (Vanilla)

Prometheus with NO out of order window.
PrometheusRemoteWrite with 1 consumer (or queue disabled)

Setup environment (let it run few couple minutes)
Stop Prometheus
Wait couple seconds for timeout errors
Restart Prometheus (ensure you still have old data)

Outcome: Metrics are reingested, in order with single worker, ALL GOOD.

Scenario 2 (PrometheusRemoteWrite 5 Consumers + Vanilla Prometheus)

Prometheus with NO out of order window.
PrometheusRemoteWrite with 5 consumers

Setup environment (let it run few couple minutes)
Stop Prometheus
Wait couple seconds for timeout errors (and for items to build up in the queue)
Restart Prometheus (ensure you still have old data)

Outcome: OutOfOrder Errors in Prometheus after boot up, since samples will be re-tried with no specific order.

Scenario 3 (PrometheusRemoteWrite 5 Consumers + Prometheus with OutOfOrder Window)

Prometheus with WITH out of order window set at 10m
PrometheusRemoteWrite with 5 consumers

Setup environment (let it run few couple minutes)
Stop Prometheus
Wait couple seconds for timeout errors (and for items to build up in the queue)
Restart Prometheus (ensure you still have old data)

Outcome: No OutOfOrder issues because even though we have multiple workers, Prometheus can accept mixed/old samples and update tsdb until 10 minutes "ago".

Let me know if you need any more info.

Cheers.

ArthurSens

Perfect; thank you for the instructions! I just tested it, and it works perfectly. There's just one CI failure that we need to fix before approving here.

Could you run make generate and commit the changes?

bmiguel-teixeira · 2024-11-22T08:17:13Z

@ArthurSens All done!

exporter/prometheusremotewriteexporter/factory.go

dashpole · 2024-11-25T18:20:13Z

exporter/prometheusremotewriteexporter/config.go

@@ -87,6 +90,10 @@ var _ component.Config = (*Config)(nil)

 // Validate checks if the exporter configuration is valid
 func (cfg *Config) Validate() error {
+	if cfg.MaxBatchRequestParallelism <= 0 {


Seems like 0 should also not be valid?

yeah. nice catch

dashpole · 2024-11-25T18:40:13Z

exporter/prometheusremotewriteexporter/exporter.go

+	if enableMultipleWorkersFeatureGate.IsEnabled() {
+		concurrency = cfg.MaxBatchRequestParallelism
+	}
+


It would be nice to always always use cfg.MaxBatchRequestParallelism, if it is set by the user. That way, existing users can migrate from RemoteWriteQueue.NumConsumers to cfg.MaxBatchRequestParallelism right away. If the feature gate is disabled, and a user has set NumConsumers to a non-default value, it would be nice to emit a warning instructing them to migrate. WDYT?

done. please check if you agree

ArthurSens · 2024-11-25T18:55:55Z

@edma2 found a bug when batching time series with multiple workers. Here is a PR trying to fix it: #36524

Should we solve the bug before re-allowing multiple workers?

Aneurysm9 · 2024-11-25T20:29:39Z

exporter/prometheusremotewriteexporter/README.md

+This exporter has feature gate: `+exporter.prometheusremotewritexporter.EnableMultipleWorkers`.
+
+When this feature gate is enabled, `num_consumers` will be used as the worker counter for handling batches from the queue, and `max_batch_request_parallelism` will be used for parallelism on single batch bigger than `max_batch_size_bytes`.
+Enabling this feature gate, with `num_consumers` higher than 1 requires the target destination to supports ingestion of OutOfOrder samples. See [Multiple Consumers and OutOfOrder](#multiple-consumers-and-outoforder) for more info


I think this value needs to default to 1 if this feature gate is enabled. When the feature gate is ultimately removed I don't think it appropriate to have a default value that is known to not work with any number of potential receivers and would not work with the default configuration of those receivers that are known to be able to support it in some configurations.

I aggree that will help with transition, and that the exporter should have "sane" defaults at all times.
done. please check

dashpole · 2024-11-26T18:51:05Z

exporter/prometheusremotewriteexporter/config.go

+		return fmt.Errorf("max_batch_request_parallelism can't be set to below 1")
+	}
+	if enableMultipleWorkersFeatureGate.IsEnabled() && cfg.MaxBatchRequestParallelism == nil {
+		return fmt.Errorf("enabling featuregate `+exporter.prometheusremotewritexporter.EnableMultipleWorkers` requires setting `max_batch_request_parallelism` in the configuration")


We don't need to require people to set this.

dashpole · 2024-11-26T18:57:03Z

exporter/prometheusremotewriteexporter/exporter.go

+	concurrency := cfg.RemoteWriteQueue.NumConsumers
+	if enableMultipleWorkersFeatureGate.IsEnabled() || cfg.MaxBatchRequestParallelism != nil {
+		concurrency = *cfg.MaxBatchRequestParallelism
+	}


Suggested change

concurrency := cfg.RemoteWriteQueue.NumConsumers

if enableMultipleWorkersFeatureGate.IsEnabled() || cfg.MaxBatchRequestParallelism != nil {

concurrency = *cfg.MaxBatchRequestParallelism

}

concurrency := 5

if !enableMultipleWorkersFeatureGate.IsEnabled() {

concurrency = cfg.RemoteWriteQueue.NumConsumers

}

if cfg.MaxBatchRequestParallelism != nil {

concurrency = *cfg.MaxBatchRequestParallelism

}

We want:

A default of 5

To always use MaxBatchRequestParallelism if it is set

To only use NumConsumers if the feature gate is disabled.

This suggestion accomplishes that.

This way it will be a "change in behavior". There may be some users that set the cfg.RemoteWriteQueue.NumConsumers to increase the concurrency. This way, it will always be set to 5 unless they explicitely set *cfg.MaxBatchRequestParallelism.

Hence why I checked for the featuregate above to make sure *cfg.MaxBatchRequestParallelism was explicitly set by the user, and if not, default to cfg.RemoteWriteQueue.NumConsumers.

I may have misunderstood but I thought the plan was to keep full retro compability.
Are we okay with setting a default to 5, and if users want to increased parallelism they need to set *cfg.MaxBatchRequestParallelism?

It will only be a change in behavior once the feature gate is enabled. It shouldn't be a breaking change in this PR (unless i'm mistaken). The reason we have a feature gate at all is to make the migration from the breaking change easier.

i missed the gate difference. ill update this over the weekend

exporter/prometheusremotewriteexporter/factory.go

Co-authored-by: David Ashpole <[email protected]>

ArthurSens · 2024-12-16T21:34:54Z

Hey @bmiguel-teixeira, once #36601 is merged, I think we should be safe to proceed with multiple workers again :)

Could you rebase your PR on top of main once that happens?

bmiguel-teixeira requested review from dashpole and a team as code owners November 1, 2024 16:42

github-actions bot assigned songy23 Nov 1, 2024

github-actions bot added the exporter/prometheusremotewrite label Nov 1, 2024

github-actions bot requested review from Aneurysm9 and rapphil November 1, 2024 16:42

dashpole added the enhancement New feature or request label Nov 1, 2024

ArthurSens reviewed Nov 12, 2024

View reviewed changes

bmiguel-teixeira force-pushed the main branch from 422bb46 to 2498ea3 Compare November 13, 2024 18:08

bmiguel-teixeira changed the title ~~feat/add outbound metrics and re-allow multiple workers~~ feat/re-allow multiple workers Nov 13, 2024

ArthurSens reviewed Nov 19, 2024

View reviewed changes

exporter/prometheusremotewriteexporter/README.md Outdated Show resolved Hide resolved

bmiguel-teixeira force-pushed the main branch from 2978d8c to 012fe5d Compare November 21, 2024 20:44

ArthurSens reviewed Nov 21, 2024

View reviewed changes

bmiguel-teixeira force-pushed the main branch from 2a50228 to 7c25326 Compare November 22, 2024 07:48

ArthurSens approved these changes Nov 22, 2024

View reviewed changes

dashpole reviewed Nov 22, 2024

View reviewed changes

exporter/prometheusremotewriteexporter/factory.go Show resolved Hide resolved

dashpole reviewed Nov 25, 2024

View reviewed changes

Aneurysm9 reviewed Nov 25, 2024

View reviewed changes

Aneurysm9 mentioned this pull request Nov 25, 2024

[prometheusremotewriteexporter] reduce allocations in createAttributes #35184

Open

feat/re-allow multiple workers

38d7bcc

bmiguel-teixeira force-pushed the main branch from 3a2ac49 to 38d7bcc Compare November 26, 2024 10:11

dashpole reviewed Nov 26, 2024

View reviewed changes

Update exporter/prometheusremotewriteexporter/factory.go

9e06537

Co-authored-by: David Ashpole <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/re-allow multiple workers #36134

feat/re-allow multiple workers #36134

bmiguel-teixeira commented Nov 1, 2024 •

edited

Loading

dashpole commented Nov 1, 2024

dashpole commented Nov 1, 2024

ArthurSens left a comment •

edited

Loading

bmiguel-teixeira commented Nov 12, 2024

bmiguel-teixeira commented Nov 13, 2024

ArthurSens left a comment

ArthurSens commented Nov 19, 2024

bmiguel-teixeira commented Nov 21, 2024

ArthurSens left a comment

bmiguel-teixeira commented Nov 22, 2024

dashpole Nov 25, 2024

bmiguel-teixeira Nov 25, 2024

bmiguel-teixeira Nov 26, 2024

dashpole Nov 25, 2024

bmiguel-teixeira Nov 26, 2024

ArthurSens commented Nov 25, 2024

Aneurysm9 Nov 25, 2024

bmiguel-teixeira Nov 26, 2024

dashpole Nov 26, 2024

dashpole Nov 26, 2024

bmiguel-teixeira Nov 29, 2024

dashpole Dec 2, 2024

bmiguel-teixeira Dec 6, 2024

ArthurSens commented Dec 16, 2024

feat/re-allow multiple workers #36134

Are you sure you want to change the base?

feat/re-allow multiple workers #36134

Conversation

bmiguel-teixeira commented Nov 1, 2024 • edited Loading

Description

Link to tracking issue

Testing

Documentation

dashpole commented Nov 1, 2024

dashpole commented Nov 1, 2024

ArthurSens left a comment • edited Loading

Choose a reason for hiding this comment

bmiguel-teixeira commented Nov 12, 2024

bmiguel-teixeira commented Nov 13, 2024

ArthurSens left a comment

Choose a reason for hiding this comment

ArthurSens commented Nov 19, 2024

bmiguel-teixeira commented Nov 21, 2024

ArthurSens left a comment

Choose a reason for hiding this comment

bmiguel-teixeira commented Nov 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurSens commented Nov 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurSens commented Dec 16, 2024

bmiguel-teixeira commented Nov 1, 2024 •

edited

Loading

ArthurSens left a comment •

edited

Loading