Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slimmer #includes behaviour #1

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Slimmer #includes behaviour #1

wants to merge 7 commits into from

Conversation

ardecvz
Copy link
Owner

@ardecvz ardecvz commented Mar 13, 2023

The goal of the commit is to make #includes slimmer - only join the table required to satisfy the referenced tables and preload all other associations independently.

Motivation / Background

The Active Record #includes method results in different behaviour depending on the other parts of the query:
it may execute multiple queries (one per included association) or build a single JOIN query to load everything at once.

Before the commit, it was "all or nothing" behind an eager load flag - one referenced table was enough to transform the whole query into one big JOIN.

However, it's not always necessary to join all the tables except the ones with conditions.
Normally, we could avoid joining everything (which could be performance-heavy), and execute a few queries instead.

The following two statements would demonstrate the difference:

> Author.includes(:post, books: :essay).to_a

  Author Load (0.2ms)  SELECT "authors".* FROM "authors"
  Post Load (0.3ms)  SELECT "posts".* FROM "posts" WHERE "posts"."author_id" IN (?, ?, ?)  [["author_id", 1], ["author_id", 2], ["author_id", 3]]
  Book Load (0.3ms)  SELECT "books".* FROM "books" WHERE "books"."author_id" IN (?, ?, ?)  [["author_id", 1], ["author_id", 2], ["author_id", 3]]
  Essay Load (0.1ms)  SELECT "essays".* FROM "essays" WHERE "essays"."book_id" IN (?, ?, ?, ?)  [["book_id", 1], ["book_id", 3], ["book_id", 2], ["book_id", 4]]

> Author.includes(:post, books: :essay).where(books: { language: 'EN' }).to_a

  SQL (0.3ms)  SELECT "authors"."id" AS t0_r0, "authors"."name" AS t0_r1, "authors"."author_address_id" AS t0_r2, "authors"."author_address_extra_id" AS t0_r3, "authors"."organization_id" AS t0_r4, "authors"."owned_essay_id" AS t0_r5, "posts"."id" AS t1_r0, "posts"."author_id" AS t1_r1, "posts"."title" AS t1_r2, "posts"."body" AS t1_r3, "posts"."type" AS t1_r4, "posts"."legacy_comments_count" AS t1_r5, "posts"."taggings_with_delete_all_count" AS t1_r6, "posts"."taggings_with_destroy_count" AS t1_r7, "posts"."tags_count" AS t1_r8, "posts"."indestructible_tags_count" AS t1_r9, "posts"."tags_with_destroy_count" AS t1_r10, "posts"."tags_with_nullify_count" AS t1_r11, "books"."id" AS t2_r0, "books"."author_id" AS t2_r1, "books"."format" AS t2_r2, "books"."format_record_id" AS t2_r3, "books"."format_record_type" AS t2_r4, "books"."name" AS t2_r5, "books"."status" AS t2_r6, "books"."last_read" AS t2_r7, "books"."nullable_status" AS t2_r8, "books"."language" AS t2_r9, "books"."author_visibility" AS t2_r10, "books"."illustrator_visibility" AS t2_r11, "books"."font_size" AS t2_r12, "books"."difficulty" AS t2_r13, "books"."cover" AS t2_r14, "books"."isbn" AS t2_r15, "books"."external_id" AS t2_r16, "books"."original_name" AS t2_r17, "books"."published_on" AS t2_r18, "books"."boolean_status" AS t2_r19, "books"."tags_count" AS t2_r20, "books"."created_at" AS t2_r21, "books"."updated_at" AS t2_r22, "books"."updated_on" AS t2_r23, "essays"."id" AS t3_r0, "essays"."type" AS t3_r1, "essays"."name" AS t3_r2, "essays"."writer_id" AS t3_r3, "essays"."writer_type" AS t3_r4, "essays"."category_id" AS t3_r5, "essays"."author_id" AS t3_r6, "essays"."book_id" AS t3_r7 FROM "authors" LEFT OUTER JOIN "posts" ON "posts"."author_id" = "authors"."id" LEFT OUTER JOIN "books" ON "books"."author_id" = "authors"."id" LEFT OUTER JOIN "essays" ON "essays"."book_id" = "books"."id" WHERE "books"."language" IS NULL

Detail

The main idea is to remove referenced tables
(which are #references, #where, #order or copied by #and, #or, #merge) from the final join query.
So we would LEFT OUTER JOINed only includes tables with conditions on the query. At the same time,
we have to bring non-referenced tables back to separate query preloads:

> Author.includes(:post, books: :essay).where(books: { language: 'EN' }).to_a

  SQL (0.2ms)  SELECT "authors"."id" AS t0_r0, "authors"."name" AS t0_r1, "authors"."author_address_id" AS t0_r2, "authors"."author_address_extra_id" AS t0_r3, "authors"."organization_id" AS t0_r4, "authors"."owned_essay_id" AS t0_r5, "books"."id" AS t1_r0, "books"."author_id" AS t1_r1, "books"."format" AS t1_r2, "books"."format_record_id" AS t1_r3, "books"."format_record_type" AS t1_r4, "books"."name" AS t1_r5, "books"."status" AS t1_r6, "books"."last_read" AS t1_r7, "books"."nullable_status" AS t1_r8, "books"."language" AS t1_r9, "books"."author_visibility" AS t1_r10, "books"."illustrator_visibility" AS t1_r11, "books"."font_size" AS t1_r12, "books"."difficulty" AS t1_r13, "books"."cover" AS t1_r14, "books"."isbn" AS t1_r15, "books"."external_id" AS t1_r16, "books"."original_name" AS t1_r17, "books"."published_on" AS t1_r18, "books"."boolean_status" AS t1_r19, "books"."tags_count" AS t1_r20, "books"."created_at" AS t1_r21, "books"."updated_at" AS t1_r22, "books"."updated_on" AS t1_r23 FROM "authors" LEFT OUTER JOIN "books" ON "books"."author_id" = "authors"."id" WHERE "books"."language" IS NULL
  Post Load (0.2ms)  SELECT "posts".* FROM "posts" WHERE "posts"."author_id" IN (?, ?, ?)  [["author_id", 1], ["author_id", 2], ["author_id", 3]]
  Essay Load (0.1ms)  SELECT "essays".* FROM "essays" WHERE "essays"."book_id" = ?  [["book_id", 1]]

Backward compatibility

  1. Includes and table names don't always match so we use reflections.

We try to match all possible includes table names against references including inner and left outer joins.
If there's an unreliable match (like no reflection for association), we're inclined to JOIN the associations in order to not break user applications.

All current tests are green without changes inside them (except a number of queries).

  1. The commit certainly breaks user code with the plain string where or select:
Author.includes(:company, :books).where(company: { title: "The Office" })
                                 .where("books.title = ?", "Design Patterns")
                                 .select("books.title")

We do not want to introduce any flags or scopes to support this behaviour because in docs there're mentions to prohibit string causes without references:
https://api.rubyonrails.org/v7.0.4.2/classes/ActiveRecord/QueryMethods.html#method-i-includes https://api.rubyonrails.org/v7.0.4.2/classes/ActiveRecord/QueryMethods.html#method-i-references

People who don't know it have to start using it.

Known limitations

  1. Includes may be ambiguous - a single table is enough to be referenced only once to remove the slim optimization from all includes trees.
    Look at this example:
Author.includes(company: :authors, publishers: :authors).where(company: { authors: { name: "J. R. R. Tolkien" } })

Instead of the expected separate preload for the publishers, it would include them in a single JOIN. It's because we cannot differentiate usage level in the plain references_values at the moment.

Moreover, using references to numbered table aliases (with the digits at the end) disables all slim optimizations completely.

Checklist

Before submitting the PR make sure the following are checked:

  • This Pull Request is related to one change. Changes that are unrelated should be opened in separate PRs.
  • Commit message has a detailed description of what changed and why. If this PR fixes a related issue include it in the commit message. Ex: [Fix #issue-number]
  • Tests are added or updated if you fix a bug or add a feature.
  • CHANGELOG files are updated for the changed libraries if there is a behavior change or additional feature. Minor bug fixes and documentation changes should not be included.

@ardecvz ardecvz force-pushed the feat/slim-eager-load branch from 5642b37 to 96fd8d4 Compare March 13, 2023 19:29
The goal of the commit is to make #includes slimmer -
only join the table required to satisfy the referenced tables and
preload all other associations independently.

Context

The Active Record #includes method results in different behaviour
depending on the other parts of the query:
it may execute multiple queries (one per included association) or
build a single JOIN query to load everything at once.

Before the commit, it was "all or nothing" behind an eager load flag -
one referenced table was enough to transform the whole query into one big JOIN.

However, it's not always necessary to join all the tables
except the ones with conditions.
Normally, we could avoid joining everything (which could be performance-heavy),
and execute a few queries instead.

Idea

The main idea is to remove referenced tables
(which are #references, #where, #order or copied by #and, #or, #merge)
from the final join query.
So we would LEFT OUTER JOINed only includes tables with conditions on the query.
At the same time,
we have to bring non-referenced tables back to separate query preloads.

Backward compatibility

1) Includes and table names don't always match so we use reflections.

We try to match all possible includes table names against references
including inner and left outer joins.
If there's an unreliable match (like no reflection for association),
we're inclined to JOIN the associations in order to not break user applications.

All current tests are green without changes inside them.

2) The commit certainly breaks user code with the plain string where or select:
```
Author.includes(:company, :books).where(company: { title: "The Office" })
                                 .where("books.title = ?", "Design Patterns")
                                 .select("books.title")
```
We do not want to introduce any flags or scopes to support this behaviour
because in docs there're mentions to prohibit string causes without references:
https://api.rubyonrails.org/v7.0.4.2/classes/ActiveRecord/QueryMethods.html#method-i-includes
https://api.rubyonrails.org/v7.0.4.2/classes/ActiveRecord/QueryMethods.html#method-i-references

People who don't know it have to start using it.

Future improvements

1) Includes are not always plain - they may contain several levels inside a hash.

In the commit, we handle several includes
as separate trees starting from the top-level root.
If there's any reference to possible includes table,
we'll mark this whole tree as needed to JOINed.

It's a safe bet which could be improved in future
with a more fine-grained reference marking algorithm.

2) The commit's tree traversal is average performant at best.
Several intermediate structures and tree walks could be removed -
we prefer a simple algorithm at the moment to show the idea.
@ardecvz ardecvz force-pushed the feat/slim-eager-load branch from 96fd8d4 to 44372c2 Compare March 13, 2023 19:45
Copy link

@palkan palkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Looks more sophisticated than I expected, but seems working. I left a few questions, let's discuss them and see how we can improve the solution.

@@ -539,7 +539,7 @@ def test_nested_has_many_through_with_conditions_on_through_associations
def test_nested_has_many_through_with_conditions_on_through_associations_preload
assert_empty Author.where("tags.id" => 100).joins(:misc_post_first_blue_tags)

author = assert_queries(2) { Author.includes(:misc_post_first_blue_tags).third }
author = assert_queries(3) { Author.includes(:misc_post_first_blue_tags).third }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this change looks suspicious; ideally, no existing tests should have changed. What's the source of the additional query?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm almost sure that it's a natural change - we're moving JOINs to separate preloads after all.
Before:

  FROM
  "posts"
  LEFT OUTER JOIN "taggings" ON "taggings"."taggable_type" = ?
  AND "taggings"."comment" = ?
  AND "taggings"."taggable_id" = "posts"."id"
  LEFT OUTER JOIN "tags" ON "tags"."name" = ?
  AND "tags"."id" = "taggings"."tag_id"
  LEFT OUTER JOIN "taggings" "taggings_tags_2" ON "taggings_tags_2"."tag_id" = "tags"."id"

After:

  -- The first query
  FROM
  "posts"
  LEFT OUTER JOIN "taggings" ON "taggings"."taggable_type" = ?
  AND "taggings"."comment" = ?
  AND "taggings"."taggable_id" = "posts"."id"
  LEFT OUTER JOIN "tags" ON "tags"."name" = ?
  AND "tags"."id" = "taggings"."tag_id"

  -- The second query
  FROM
  "taggings"
  LEFT OUTER JOIN "tags" ON "tags"."id" = "taggings"."tag_id"
  LEFT OUTER JOIN "taggings" "taggings_tags" ON "taggings_tags"."tag_id" = "tags"."id"

So we're preloading the second taggings instead of eager loading.
The change wasn't needed before because without the new tree traversal we preferred joining queries more frequently (but also we weren't able to optimize advanced things like this).
By the way, we don't observe so many failures like this in the current Rails tests as there're not so many tests with both includes + no further where usage and a number of queries check.

Copy link
Owner Author

@ardecvz ardecvz Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, it's nice that you've spotlighted this - it's strange that there's a JOIN taggings to taggings in the second query.
It's included in the long associations chain as the last includes:

class Author < ActiveRecord::Base
  has_many :misc_post_first_blue_tags, through: :misc_posts, source: :first_blue_tags
  has_many :misc_posts, -> { where(posts: { title: ["misc post by bob", "misc post by mary"] }) }, class_name: "Post"
class Post < ActiveRecord::Base
  has_many :first_blue_tags, -> { where tags: { name: "Blue" } }, through: :first_taggings, source: :tag
  has_many :first_taggings, -> { where taggings: { comment: "first" } }, as: :taggable, class_name: "Tagging"
class Tagging < ActiveRecord::Base
  belongs_to :tag, -> { includes(:tagging) }

but it's still strange - I'm investigating. Maybe, it's because of the known limitations with ambiguous table names in includes and references.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small update - if we remove includes in the last belongs_to, the problem would be solved.
It's super strange as we don't change how includes works - only filtering the current includes_values.
Right now, my plan is to replicate this behaviour without our optimization. It'd show this as another Rails minor issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants