-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate BatchOver in favor of in_batches(use_ranges: true) #136
base: master
Are you sure you want to change the base?
Conversation
92b4694
to
af7d561
Compare
BenchmarkScenarioThe benchmark iterates over a table having 30M records, in batches of 10k records. The table has an auto-incrementing index starting from 1 and 65 columns (so returning full records is at least a little costly). MeasurementsThe benchmark is executed on a single machine, with a round-trip-time < 1ms. Network (I/O) stats obtained with: docker stats --no-stream postgres --format 'table {{.NetIO}}' Postgres I/O stats obtained with: SELECT
heap_blks_read, heap_blks_hit, idx_blks_read, idx_blks_hit
FROM
pg_statio_user_tables
WHERE
relname = '<the-table>'; The metrics below only account for the time spent generating the scopes, not using them. For example, should we have used the scopes generated with Results
(*) Additional optimizationsIn In |
Did you restart pg between each run ? it's weird to see that we read data from disk only the first run ("BatchOver") |
forget what i wrote, i understood my mistake ;) |
|
||
backfill_batch_size = SafePgMigrations.config.backfill_batch_size | ||
|
||
if ActiveRecord.version >= Gem::Version.new('7.1') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
until the patch on rails is not provided/approved/merged/released, i think is to early to switch towards rails implementation. WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, let's keep this PR in draft and keep the new small optim from #138.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just pushed a proposal:
- Avoidable resource consumption when using "in_batches(use_ranges: true)" rails/rails#51242
- [Fix #51242] Rework in_batches(use_ranges: true) to be more efficient rails/rails#51243
Let's see if it gets some traction 🤞.
Starting from ActiveRecord 7.1, there's a built-in helper equivalent to what BatchOver does, let's use it instead of maintaining our own implementation forever. We keep BatchOver for compatibility with ActiveRecord < 7.1.
af7d561
to
8d89fba
Compare
Starting from ActiveRecord 7.1, there's a built-in helper equivalent to what
BatchOver
does, let's use it instead of maintaining our own implementation forever.We need to keep
BatchOver
as long as compatibility with ActiveRecord < 7.1 is maintained though.See:
If using ActiveRecord 7.1 or later, we would use the recommended built-in method
in_batches
with theuse_ranges: true
option, e.g.Otherwise, we would still use
BatchOver
as a fallback:Note that although both helpers are almost equivalent, there are small differences in the queries generated.
With the example code above, and assuming the
users
tables contains 250 records, we would have withBatchOver
:Whereas
in_batches(of: 100, use_ranges: true)
would give:It would work exactly the same if passing any
ActiveRecord::Relation
object in place of the modelUser
, e.g.User.where(condition: 'something')
, with the additional condition appearing in the WHERE clause of each query.For the selection of a range of record ids (id min / id max),
BatchOver
will generate two queries per batch, returns full records (and not only ids), but retrieves only 2 records. Conversely,in_batches(use_ranges: true)
will generate a single query per batch, return only ids (and not full records), but would return the full list of ids instead of the min/max only.I believe that trade-off is acceptable for our purpose, but that is debatable.