-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat/re-allow multiple workers #36134
base: main
Are you sure you want to change the base?
Conversation
I'll be OOO for the next few weeks, so I won't be able to review until then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @bmiguel-teixeira! The contributions sound solid, but I'm a bit concerned that we're doing two separate things in one single PR here. Generally that's not a good practice because in case we have to revert one thing, we end up reverting more than what's needed, not to mention that it makes the PR biggers and a bit harder to review
Could you choose only one new functionality for this PR and open another one for the other?
Regarding the changes, could you add tests for the telemetry added? Send PRW requests to a mock server and assert that the metric you've added increments as you expected.
For allowing multiple workers, it would be nice if we add extra documentation making it clear that Out-of-order needs to be enabled in Prometheus for it to work :)
Sure. I will open a dedicated PR with the additional telemetry and keep de queue changes in this one which already has context. |
422bb46
to
2498ea3
Compare
hi @ArthurSens Removed the additional telemetry to be added in a secondary PR. Also added a bit of docs to explain the toggle and use case. Please take a look Cheers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the code is simple so it does LGTM, but I'm struggling to test this.
Do you have any examples of apps/configs that will generate out of order datapoints? All attempts I've tried provide things in order so I can't be sure if this is working as expected 😅
(there's a linting failure) |
Hi @ArthurSens Just submitted your recomendation to fix the spelling issues. In regards to testing and simulating locally the out of order issues. Prometheus Config
Otel Config
Scenario 1: (Vanilla)
Outcome: Metrics are reingested, in order with single worker, ALL GOOD. Scenario 2 (PrometheusRemoteWrite 5 Consumers + Vanilla Prometheus)
Outcome: OutOfOrder Errors in Prometheus after boot up, since samples will be re-tried with no specific order. Scenario 3 (PrometheusRemoteWrite 5 Consumers + Prometheus with OutOfOrder Window)
Outcome: No OutOfOrder issues because even though we have multiple workers, Prometheus can accept mixed/old samples and update tsdb until 10 minutes "ago". Let me know if you need any more info. Cheers. |
2978d8c
to
012fe5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect; thank you for the instructions! I just tested it, and it works perfectly. There's just one CI failure that we need to fix before approving here.
Could you run make generate
and commit the changes?
2a50228
to
7c25326
Compare
@ArthurSens All done! |
@@ -87,6 +90,10 @@ var _ component.Config = (*Config)(nil) | |||
|
|||
// Validate checks if the exporter configuration is valid | |||
func (cfg *Config) Validate() error { | |||
if cfg.MaxBatchRequestParallelism <= 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like 0 should also not be valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. nice catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
if enableMultipleWorkersFeatureGate.IsEnabled() { | ||
concurrency = cfg.MaxBatchRequestParallelism | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to always always use cfg.MaxBatchRequestParallelism
, if it is set by the user. That way, existing users can migrate from RemoteWriteQueue.NumConsumers
to cfg.MaxBatchRequestParallelism
right away. If the feature gate is disabled, and a user has set NumConsumers
to a non-default value, it would be nice to emit a warning instructing them to migrate. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. please check if you agree
This exporter has feature gate: `+exporter.prometheusremotewritexporter.EnableMultipleWorkers`. | ||
|
||
When this feature gate is enabled, `num_consumers` will be used as the worker counter for handling batches from the queue, and `max_batch_request_parallelism` will be used for parallelism on single batch bigger than `max_batch_size_bytes`. | ||
Enabling this feature gate, with `num_consumers` higher than 1 requires the target destination to supports ingestion of OutOfOrder samples. See [Multiple Consumers and OutOfOrder](#multiple-consumers-and-outoforder) for more info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this value needs to default to 1
if this feature gate is enabled. When the feature gate is ultimately removed I don't think it appropriate to have a default value that is known to not work with any number of potential receivers and would not work with the default configuration of those receivers that are known to be able to support it in some configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I aggree that will help with transition, and that the exporter should have "sane" defaults at all times.
done. please check
3a2ac49
to
38d7bcc
Compare
return fmt.Errorf("max_batch_request_parallelism can't be set to below 1") | ||
} | ||
if enableMultipleWorkersFeatureGate.IsEnabled() && cfg.MaxBatchRequestParallelism == nil { | ||
return fmt.Errorf("enabling featuregate `+exporter.prometheusremotewritexporter.EnableMultipleWorkers` requires setting `max_batch_request_parallelism` in the configuration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to require people to set this.
concurrency := cfg.RemoteWriteQueue.NumConsumers | ||
if enableMultipleWorkersFeatureGate.IsEnabled() || cfg.MaxBatchRequestParallelism != nil { | ||
concurrency = *cfg.MaxBatchRequestParallelism | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concurrency := cfg.RemoteWriteQueue.NumConsumers | |
if enableMultipleWorkersFeatureGate.IsEnabled() || cfg.MaxBatchRequestParallelism != nil { | |
concurrency = *cfg.MaxBatchRequestParallelism | |
} | |
concurrency := 5 | |
if !enableMultipleWorkersFeatureGate.IsEnabled() { | |
concurrency = cfg.RemoteWriteQueue.NumConsumers | |
} | |
if cfg.MaxBatchRequestParallelism != nil { | |
concurrency = *cfg.MaxBatchRequestParallelism | |
} |
We want:
- A default of 5
- To always use MaxBatchRequestParallelism if it is set
- To only use NumConsumers if the feature gate is disabled.
This suggestion accomplishes that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way it will be a "change in behavior". There may be some users that set the cfg.RemoteWriteQueue.NumConsumers
to increase the concurrency. This way, it will always be set to 5 unless they explicitely set *cfg.MaxBatchRequestParallelism
.
Hence why I checked for the featuregate above to make sure *cfg.MaxBatchRequestParallelism was explicitly set by the user, and if not, default to cfg.RemoteWriteQueue.NumConsumers
.
I may have misunderstood but I thought the plan was to keep full retro compability.
Are we okay with setting a default to 5, and if users want to increased parallelism they need to set *cfg.MaxBatchRequestParallelism
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will only be a change in behavior once the feature gate is enabled. It shouldn't be a breaking change in this PR (unless i'm mistaken). The reason we have a feature gate at all is to make the migration from the breaking change easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i missed the gate difference. ill update this over the weekend
Co-authored-by: David Ashpole <[email protected]>
Hey @bmiguel-teixeira, once #36601 is merged, I think we should be safe to proceed with multiple workers again :) Could you rebase your PR on top of main once that happens? |
Description
This MR does the following:
Out of Order is no longer an issue now that it is fully supported in Prometheus. Nonetheless, I am setting the default worker as 1 to avoid OoO in Vanilla Prometheus Settings.
With a single worker, and for a collector with a large load, this becomes "blocking". Example: Imagine a scenario in which a collector is collecting lots of targets, and with a slow prometheus/unstable network, a single worker can easily bottleneck the off-shipping if retries are enabled.
Link to tracking issue
N/A
Testing
Documentation
docs auto-updated. Readme.md is now correct in its explanation of the `num_consumers since its no longer hard-coded at 1. Additional docs added.