Slimmer #includes behaviour #1

ardecvz · 2023-03-13T19:02:41Z

The goal of the commit is to make #includes slimmer - only join the table required to satisfy the referenced tables and preload all other associations independently.

Motivation / Background

The Active Record #includes method results in different behaviour depending on the other parts of the query:
it may execute multiple queries (one per included association) or build a single JOIN query to load everything at once.

Before the commit, it was "all or nothing" behind an eager load flag - one referenced table was enough to transform the whole query into one big JOIN.

However, it's not always necessary to join all the tables except the ones with conditions.
Normally, we could avoid joining everything (which could be performance-heavy), and execute a few queries instead.

The following two statements would demonstrate the difference:

> Author.includes(:post, books: :essay).to_a

  Author Load (0.2ms)  SELECT "authors".* FROM "authors"
  Post Load (0.3ms)  SELECT "posts".* FROM "posts" WHERE "posts"."author_id" IN (?, ?, ?)  [["author_id", 1], ["author_id", 2], ["author_id", 3]]
  Book Load (0.3ms)  SELECT "books".* FROM "books" WHERE "books"."author_id" IN (?, ?, ?)  [["author_id", 1], ["author_id", 2], ["author_id", 3]]
  Essay Load (0.1ms)  SELECT "essays".* FROM "essays" WHERE "essays"."book_id" IN (?, ?, ?, ?)  [["book_id", 1], ["book_id", 3], ["book_id", 2], ["book_id", 4]]

> Author.includes(:post, books: :essay).where(books: { language: 'EN' }).to_a

  SQL (0.3ms)  SELECT "authors"."id" AS t0_r0, "authors"."name" AS t0_r1, "authors"."author_address_id" AS t0_r2, "authors"."author_address_extra_id" AS t0_r3, "authors"."organization_id" AS t0_r4, "authors"."owned_essay_id" AS t0_r5, "posts"."id" AS t1_r0, "posts"."author_id" AS t1_r1, "posts"."title" AS t1_r2, "posts"."body" AS t1_r3, "posts"."type" AS t1_r4, "posts"."legacy_comments_count" AS t1_r5, "posts"."taggings_with_delete_all_count" AS t1_r6, "posts"."taggings_with_destroy_count" AS t1_r7, "posts"."tags_count" AS t1_r8, "posts"."indestructible_tags_count" AS t1_r9, "posts"."tags_with_destroy_count" AS t1_r10, "posts"."tags_with_nullify_count" AS t1_r11, "books"."id" AS t2_r0, "books"."author_id" AS t2_r1, "books"."format" AS t2_r2, "books"."format_record_id" AS t2_r3, "books"."format_record_type" AS t2_r4, "books"."name" AS t2_r5, "books"."status" AS t2_r6, "books"."last_read" AS t2_r7, "books"."nullable_status" AS t2_r8, "books"."language" AS t2_r9, "books"."author_visibility" AS t2_r10, "books"."illustrator_visibility" AS t2_r11, "books"."font_size" AS t2_r12, "books"."difficulty" AS t2_r13, "books"."cover" AS t2_r14, "books"."isbn" AS t2_r15, "books"."external_id" AS t2_r16, "books"."original_name" AS t2_r17, "books"."published_on" AS t2_r18, "books"."boolean_status" AS t2_r19, "books"."tags_count" AS t2_r20, "books"."created_at" AS t2_r21, "books"."updated_at" AS t2_r22, "books"."updated_on" AS t2_r23, "essays"."id" AS t3_r0, "essays"."type" AS t3_r1, "essays"."name" AS t3_r2, "essays"."writer_id" AS t3_r3, "essays"."writer_type" AS t3_r4, "essays"."category_id" AS t3_r5, "essays"."author_id" AS t3_r6, "essays"."book_id" AS t3_r7 FROM "authors" LEFT OUTER JOIN "posts" ON "posts"."author_id" = "authors"."id" LEFT OUTER JOIN "books" ON "books"."author_id" = "authors"."id" LEFT OUTER JOIN "essays" ON "essays"."book_id" = "books"."id" WHERE "books"."language" IS NULL

Detail

The main idea is to remove referenced tables
(which are #references, #where, #order or copied by #and, #or, #merge) from the final join query.
So we would LEFT OUTER JOINed only includes tables with conditions on the query. At the same time,
we have to bring non-referenced tables back to separate query preloads:

> Author.includes(:post, books: :essay).where(books: { language: 'EN' }).to_a

  SQL (0.2ms)  SELECT "authors"."id" AS t0_r0, "authors"."name" AS t0_r1, "authors"."author_address_id" AS t0_r2, "authors"."author_address_extra_id" AS t0_r3, "authors"."organization_id" AS t0_r4, "authors"."owned_essay_id" AS t0_r5, "books"."id" AS t1_r0, "books"."author_id" AS t1_r1, "books"."format" AS t1_r2, "books"."format_record_id" AS t1_r3, "books"."format_record_type" AS t1_r4, "books"."name" AS t1_r5, "books"."status" AS t1_r6, "books"."last_read" AS t1_r7, "books"."nullable_status" AS t1_r8, "books"."language" AS t1_r9, "books"."author_visibility" AS t1_r10, "books"."illustrator_visibility" AS t1_r11, "books"."font_size" AS t1_r12, "books"."difficulty" AS t1_r13, "books"."cover" AS t1_r14, "books"."isbn" AS t1_r15, "books"."external_id" AS t1_r16, "books"."original_name" AS t1_r17, "books"."published_on" AS t1_r18, "books"."boolean_status" AS t1_r19, "books"."tags_count" AS t1_r20, "books"."created_at" AS t1_r21, "books"."updated_at" AS t1_r22, "books"."updated_on" AS t1_r23 FROM "authors" LEFT OUTER JOIN "books" ON "books"."author_id" = "authors"."id" WHERE "books"."language" IS NULL
  Post Load (0.2ms)  SELECT "posts".* FROM "posts" WHERE "posts"."author_id" IN (?, ?, ?)  [["author_id", 1], ["author_id", 2], ["author_id", 3]]
  Essay Load (0.1ms)  SELECT "essays".* FROM "essays" WHERE "essays"."book_id" = ?  [["book_id", 1]]

Backward compatibility

Includes and table names don't always match so we use reflections.

We try to match all possible includes table names against references including inner and left outer joins.
If there's an unreliable match (like no reflection for association), we're inclined to JOIN the associations in order to not break user applications.

All current tests are green without changes inside them (except a number of queries).

The commit certainly breaks user code with the plain string where or select:

Author.includes(:company, :books).where(company: { title: "The Office" })
                                 .where("books.title = ?", "Design Patterns")
                                 .select("books.title")

We do not want to introduce any flags or scopes to support this behaviour because in docs there're mentions to prohibit string causes without references:
https://api.rubyonrails.org/v7.0.4.2/classes/ActiveRecord/QueryMethods.html#method-i-includes https://api.rubyonrails.org/v7.0.4.2/classes/ActiveRecord/QueryMethods.html#method-i-references

People who don't know it have to start using it.

Known limitations

Includes may be ambiguous - a single table is enough to be referenced only once to remove the slim optimization from all includes trees.
Look at this example:

Author.includes(company: :authors, publishers: :authors).where(company: { authors: { name: "J. R. R. Tolkien" } })

Instead of the expected separate preload for the publishers, it would include them in a single JOIN. It's because we cannot differentiate usage level in the plain references_values at the moment.

Moreover, using references to numbered table aliases (with the digits at the end) disables all slim optimizations completely.

Checklist

Before submitting the PR make sure the following are checked:

This Pull Request is related to one change. Changes that are unrelated should be opened in separate PRs.
Commit message has a detailed description of what changed and why. If this PR fixes a related issue include it in the commit message. Ex: [Fix #issue-number]
Tests are added or updated if you fix a bug or add a feature.
CHANGELOG files are updated for the changed libraries if there is a behavior change or additional feature. Minor bug fixes and documentation changes should not be included.

The goal of the commit is to make #includes slimmer - only join the table required to satisfy the referenced tables and preload all other associations independently. Context The Active Record #includes method results in different behaviour depending on the other parts of the query: it may execute multiple queries (one per included association) or build a single JOIN query to load everything at once. Before the commit, it was "all or nothing" behind an eager load flag - one referenced table was enough to transform the whole query into one big JOIN. However, it's not always necessary to join all the tables except the ones with conditions. Normally, we could avoid joining everything (which could be performance-heavy), and execute a few queries instead. Idea The main idea is to remove referenced tables (which are #references, #where, #order or copied by #and, #or, #merge) from the final join query. So we would LEFT OUTER JOINed only includes tables with conditions on the query. At the same time, we have to bring non-referenced tables back to separate query preloads. Backward compatibility 1) Includes and table names don't always match so we use reflections. We try to match all possible includes table names against references including inner and left outer joins. If there's an unreliable match (like no reflection for association), we're inclined to JOIN the associations in order to not break user applications. All current tests are green without changes inside them. 2) The commit certainly breaks user code with the plain string where or select: ``` Author.includes(:company, :books).where(company: { title: "The Office" }) .where("books.title = ?", "Design Patterns") .select("books.title") ``` We do not want to introduce any flags or scopes to support this behaviour because in docs there're mentions to prohibit string causes without references: https://api.rubyonrails.org/v7.0.4.2/classes/ActiveRecord/QueryMethods.html#method-i-includes https://api.rubyonrails.org/v7.0.4.2/classes/ActiveRecord/QueryMethods.html#method-i-references People who don't know it have to start using it. Future improvements 1) Includes are not always plain - they may contain several levels inside a hash. In the commit, we handle several includes as separate trees starting from the top-level root. If there's any reference to possible includes table, we'll mark this whole tree as needed to JOINed. It's a safe bet which could be improved in future with a more fine-grained reference marking algorithm. 2) The commit's tree traversal is average performant at best. Several intermediate structures and tree walks could be removed - we prefer a simple algorithm at the moment to show the idea.

palkan

Thanks!

Looks more sophisticated than I expected, but seems working. I left a few questions, let's discuss them and see how we can improve the solution.

activerecord/lib/active_record/relation/calculations.rb

activerecord/lib/active_record/relation/includes_tracker.rb

palkan · 2023-03-15T16:07:43Z

activerecord/test/cases/associations/nested_through_associations_test.rb

@@ -539,7 +539,7 @@ def test_nested_has_many_through_with_conditions_on_through_associations
  def test_nested_has_many_through_with_conditions_on_through_associations_preload
    assert_empty Author.where("tags.id" => 100).joins(:misc_post_first_blue_tags)

-    author = assert_queries(2) { Author.includes(:misc_post_first_blue_tags).third }
+    author = assert_queries(3) { Author.includes(:misc_post_first_blue_tags).third }


Hm, this change looks suspicious; ideally, no existing tests should have changed. What's the source of the additional query?

I'm almost sure that it's a natural change - we're moving JOINs to separate preloads after all.
Before:

FROM "posts" LEFT OUTER JOIN "taggings" ON "taggings"."taggable_type" = ? AND "taggings"."comment" = ? AND "taggings"."taggable_id" = "posts"."id" LEFT OUTER JOIN "tags" ON "tags"."name" = ? AND "tags"."id" = "taggings"."tag_id" LEFT OUTER JOIN "taggings" "taggings_tags_2" ON "taggings_tags_2"."tag_id" = "tags"."id"

After:

-- The first query FROM "posts" LEFT OUTER JOIN "taggings" ON "taggings"."taggable_type" = ? AND "taggings"."comment" = ? AND "taggings"."taggable_id" = "posts"."id" LEFT OUTER JOIN "tags" ON "tags"."name" = ? AND "tags"."id" = "taggings"."tag_id" -- The second query FROM "taggings" LEFT OUTER JOIN "tags" ON "tags"."id" = "taggings"."tag_id" LEFT OUTER JOIN "taggings" "taggings_tags" ON "taggings_tags"."tag_id" = "tags"."id"

So we're preloading the second taggings instead of eager loading.
The change wasn't needed before because without the new tree traversal we preferred joining queries more frequently (but also we weren't able to optimize advanced things like this).
By the way, we don't observe so many failures like this in the current Rails tests as there're not so many tests with both includes + no further where usage and a number of queries check.

However, it's nice that you've spotlighted this - it's strange that there's a JOIN taggings to taggings in the second query.
It's included in the long associations chain as the last includes:

class Author < ActiveRecord::Base has_many :misc_post_first_blue_tags, through: :misc_posts, source: :first_blue_tags has_many :misc_posts, -> { where(posts: { title: ["misc post by bob", "misc post by mary"] }) }, class_name: "Post" class Post < ActiveRecord::Base has_many :first_blue_tags, -> { where tags: { name: "Blue" } }, through: :first_taggings, source: :tag has_many :first_taggings, -> { where taggings: { comment: "first" } }, as: :taggable, class_name: "Tagging" class Tagging < ActiveRecord::Base belongs_to :tag, -> { includes(:tagging) }

but it's still strange - I'm investigating. Maybe, it's because of the known limitations with ambiguous table names in includes and references.

Small update - if we remove includes in the last belongs_to, the problem would be solved.
It's super strange as we don't change how includes works - only filtering the current includes_values.
Right now, my plan is to replicate this behaviour without our optimization. It'd show this as another Rails minor issue.

ardecvz force-pushed the feat/slim-eager-load branch from 5642b37 to 96fd8d4 Compare March 13, 2023 19:29

ardecvz force-pushed the feat/slim-eager-load branch from 96fd8d4 to 44372c2 Compare March 13, 2023 19:45

palkan reviewed Mar 13, 2023

View reviewed changes

ardecvz added 3 commits March 14, 2023 01:35

Rewrite the tree traversal to support an example from README

c5c1f43

Fix specs

9331045

Fix PG type mismatch in function

95ad72b

skryukov reviewed Mar 14, 2023

View reviewed changes

activerecord/lib/active_record/relation/includes_tracker.rb Outdated Show resolved Hide resolved

ardecvz added 3 commits March 14, 2023 20:45

Explain and cover one more case

e53e3e9

Use a really nice sounding aliases

3678b28

Improve README

d8585ae

palkan reviewed Mar 15, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slimmer #includes behaviour #1

Slimmer #includes behaviour #1

ardecvz commented Mar 13, 2023 •

edited

Loading

palkan left a comment

palkan Mar 15, 2023

ardecvz Mar 16, 2023

ardecvz Mar 16, 2023 •

edited

Loading

ardecvz Mar 20, 2023

Slimmer #includes behaviour #1

Are you sure you want to change the base?

Slimmer #includes behaviour #1

Conversation

ardecvz commented Mar 13, 2023 • edited Loading

Motivation / Background

Detail

Backward compatibility

Known limitations

Checklist

palkan left a comment

Choose a reason for hiding this comment

palkan Mar 15, 2023

Choose a reason for hiding this comment

ardecvz Mar 16, 2023

Choose a reason for hiding this comment

ardecvz Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

ardecvz Mar 20, 2023

Choose a reason for hiding this comment

ardecvz commented Mar 13, 2023 •

edited

Loading

ardecvz Mar 16, 2023 •

edited

Loading