Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: Implement RandomQueue scheduler strategy (#1914)
This PR implements a new Scheduler Strategy based on a _Concurrent Random Queue_. It is based on @erezrokah 's Priority Queue Scheduler Strategy. ## How does it work This is hopefully a much simpler scheduling strategy. It doesn't have any semaphores; it just uses the existing concurrency setting. Table resolvers (and their relations) get `Push`ed into a work queue, and `concurrency` workers `Pull` from this queue, but they pull a random element from it. ## Why it should work better **The key benefit of this strategy is this:** - Assumption 1: most slow syncs are actually slow because of rate limits, not because of I/O limits or too much data. - Assumption 2: the meaty part of the sync is syncing relations, because each child table has a resolver per parent. - Benefit: because the likelihood of picking up a child resolver of a given table is uniformly distributed across the `int32` range, all relation API calls are evenly spread throughout the sync, thus optimally minimising rate limits! ## Does it work better? Still working on results. Notably AWS & Azure yield mixed results; still have to look into why. ### GCP **Before** ``` $ cli sync . Loading spec(s) from . Starting sync for: gcp (grpc@localhost:7777) -> [postgresql (cloudquery/[email protected])] Sync completed successfully. Resources: 25799, Errors: 0, Warnings: 0, Time: 2m23s ``` UPDATE: GCP is moving to Round Robin strategy, and it's much faster with this strategy: ``` $ cli sync . Loading spec(s) from . Starting sync for: gcp (grpc@localhost:7777) -> [postgresql (cloudquery/[email protected])] Sync completed successfully. Resources: 26355, Errors: 0, Warnings: 0, Time: 40s ``` **After** ``` $ cli sync . Loading spec(s) from . Starting sync for: gcp (grpc@localhost:7777) -> [postgresql (cloudquery/[email protected])] Sync completed successfully. Resources: 26186, Errors: 0, Warnings: 0, Time: 34s ``` **Result: 76.22% reduction in time, or 3.21 times faster.** **Result against Round Robin: 15% reduction in time, or 0.18 times faster (probably within margin of error)** ### BigQuery **Before** ``` $ cli sync bigquery_to_postgresql.yaml Loading spec(s) from bigquery_to_postgresql.yaml Starting sync for: bigquery (cloudquery/[email protected]) -> [postgresql (cloudquery/[email protected])] Sync completed successfully. Resources: 26139, Errors: 0, Warnings: 0, Time: 2m7s ``` **After** ``` $ cli sync bigquery_to_postgresql.yaml Loading spec(s) from bigquery_to_postgresql.yaml Starting sync for: bigquery (cloudquery/[email protected]) -> [postgresql (cloudquery/[email protected])] Sync completed successfully. Resources: 26139, Errors: 0, Warnings: 0, Time: 1m26s ``` **Result: 32.28% reduction in time, or 0.48 times faster** ### SentinelOne **Before** (it was already quite fast due to previous merged improvement) ``` $ cli sync . Loading spec(s) from . Starting sync for: sentinelone (grpc@localhost:7777) -> [postgresql (cloudquery/[email protected])] Sync completed successfully. Resources: 1295, Errors: 0, Warnings: 0, Time: 15s ``` **After** ``` $ cli sync . Loading spec(s) from . Starting sync for: sentinelone (grpc@localhost:7777) -> [postgresql (cloudquery/[email protected])] Sync completed successfully. Resources: 1295, Errors: 0, Warnings: 0, Time: 8s ``` **Result: 46.67% reduction in time, or 0.875 times faster** ## How to test - Add a `go.mod` replace for sdk: `replace github.com/cloudquery/plugin-sdk/v4 => github.com/cloudquery/plugin-sdk/v4 v4.63.1-0.20241002131015-243705c940c6` (check last commit on this PR) - Run source plugin via grpc locally; make sure to configure the scheduler strategy to `scheduler.StrategyRandomQueue`. ## How scary is it to merge - This scheduler strategy is not used by any plugins by default, so in principle this should be safe to merge. --------- Co-authored-by: erezrokah <[email protected]>
- Loading branch information