move node-level diff query to its own query and limit it #5528

ajtmccarty · 2025-01-20T23:22:44Z

IFC-1141

this is the first phase of some refactoring that I am working on for the diff calculation query to both improve performance and allow limiting the query so that it won't crash the database.

I did some local testing to update an existing diff on a branch with 3700 new nodes and found that the query for calculating the changes on the branch went from taking ~5 minutes to 20 seconds, which was somewhat unexpected. the query to calculate the changes on the base branch, still takes ~4 minutes on the old and new version, but I hope that once I complete the rest of the changes for these queries, we will see a similar performance improvement.

also, removes a bunch of dead code

codspeed-hq · 2025-01-20T23:29:54Z

CodSpeed Performance Report

Merging #5528 will not alter performance

_{Comparing ajtm-01202025-limited-node-diff-query (ced94ac) with stable (5bf43e2)}

Summary

✅ 10 untouched benchmarks

ajtmccarty

some comments to help with PR review

ajtmccarty · 2025-01-21T02:18:01Z

backend/infrahub/core/diff/calculator.py

+            has_more_data = False
+            if last_result:
+                has_more_data = last_result.get_as_type("has_more_data", bool)
+            offset += node_limit


some changes to use a limit and offset on this part of the diff calculation logic. the next step is to split the DiffAllPathsQuery up into 2 more queries and do the same kind of limiting/filtering on those too.

I know that these 2 additions repeat a lot of the same code, but I am going the clean that up in the next 2 PRs b/c I will be breaking the remaining DiffAllPathsQuery call up into 2 more queries that will need to be called in the same way

ajtmccarty · 2025-01-21T02:18:35Z

backend/infrahub/core/query/diff.py

-            and result.get("r").type == prop_type
-        ]
-
-        return sort_results_by_time(results, rel_label="r")


all of the above deleted code is dead. leftovers from the old diff logic

ajtmccarty · 2025-01-21T02:19:52Z

backend/infrahub/core/query/diff.py

@@ -580,82 +156,6 @@ async def query_init(self, db: InfrahubDatabase, **kwargs: Any) -> None:
    RETURN uuids AS node_ids_list, field_names AS field_names_list
 }
 CALL {
-    WITH node_field_specifiers_list, node_ids_list, from_time


I moved this part of the diff calculation query over into DiffNodePathsQuery and made some adjustments to it. basically, this is the part of the diff calculation query that looks at added/deleted nodes.

ajtmccarty · 2025-01-21T02:21:55Z

backend/infrahub/core/diff/calculator.py

@@ -36,6 +37,31 @@ async def calculate_diff(
            to_time=to_time,
            previous_node_field_specifiers=previous_node_specifiers,
        )
+        node_limit = int(config.SETTINGS.database.query_size_limit / 10)


I divide the limit by 10 as kind of a guess b/c the below query will return multiple paths for each added/deleted node and we don't really know how many paths will be returned for each node

ajtmccarty · 2025-01-21T02:23:03Z

backend/infrahub/core/query/diff.py

@@ -929,3 +429,243 @@ async def query_init(self, db: InfrahubDatabase, **kwargs: Any) -> None:
        """ % {"id_func": db.get_id_function_name()}
        self.add_to_query(query)
        self.return_labels = ["DISTINCT diff_path AS diff_path"]
+
+
+class DiffCalculationQuery(DiffQuery):


new base class that the new DiffNodePathsQuery uses. I am going to split the DiffAllPathsQuery up into 2 separate queries that will also use this same base class

ajtmccarty · 2025-01-21T02:24:12Z

backend/infrahub/core/query/diff.py

+    RETURN latest_base_path
+}
+    """
+    relationship_peer_side_query = """


these two queries are copied almost exactly from the current DiffAllPathsQuery
the has_more_data variable is the only addition and that is used to indicate if the query needs to be run again after incrementing the offset

ajtmccarty · 2025-01-21T02:27:09Z

backend/infrahub/core/query/diff.py

+        ($from_time <= diff_rel.from < $to_time)
+        OR ($from_time <= diff_rel.to < $to_time)
+    )
+)


this is a different method for querying for nodes that we care about that seems to be faster.
the old approach was to basically run the same query twice: once for nodes already in the diff and once for nodes added in this diff. this was required b/c we use a different from_time for existing vs new nodes.

this new approach is to run the query once, but use this filter at the beginning to determine which from_time to use and pass it along with the node

ajtmccarty · 2025-01-21T02:27:55Z

backend/infrahub/core/query/diff.py

+WITH collect([p, q, diff_rel, row_from_time]) AS limited_results
+WITH limited_results, size(limited_results) = $limit AS has_more_data
+UNWIND limited_results AS one_result
+WITH one_result[0] AS p, one_result[1] AS q, one_result[2] AS diff_rel, one_result[3] AS row_from_time, has_more_data


this piece is new to check if there is more data beyond our offset + limit. I don't know if there is a better way to do this, but it seems reasonable

ajtmccarty · 2025-01-21T02:28:30Z

backend/infrahub/core/query/diff.py

+WITH limited_results, size(limited_results) = $limit AS has_more_data
+UNWIND limited_results AS one_result
+WITH one_result[0] AS p, one_result[1] AS q, one_result[2] AS diff_rel, one_result[3] AS row_from_time, has_more_data
+// -------------------------------------


everything below here is taken almost exactly from the DiffAllPathsQuery

ajtmccarty · 2025-01-21T02:28:45Z

backend/tests/unit/core/test_query_diff.py

these were tests for the dead code deleted above

move node-level diff query to its own query and limit it

8265c16

github-actions bot added the group/backend Issue related to the backend (API Server, Git Agent) label Jan 20, 2025

ajtmccarty added 2 commits January 20, 2025 16:22

fix variable name

d256198

remove dead code

ced94ac

ajtmccarty marked this pull request as ready for review January 21, 2025 02:28

ajtmccarty requested a review from a team January 21, 2025 02:29

ajtmccarty commented Jan 21, 2025

View reviewed changes

dgarros approved these changes Jan 21, 2025

View reviewed changes

ajtmccarty merged commit 2f19473 into stable Jan 21, 2025
36 checks passed

ajtmccarty deleted the ajtm-01202025-limited-node-diff-query branch January 21, 2025 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move node-level diff query to its own query and limit it #5528

move node-level diff query to its own query and limit it #5528

ajtmccarty commented Jan 20, 2025 •

edited

Loading

codspeed-hq bot commented Jan 20, 2025 •

edited

Loading

ajtmccarty left a comment

ajtmccarty Jan 21, 2025

ajtmccarty Jan 21, 2025

ajtmccarty Jan 21, 2025

ajtmccarty Jan 21, 2025

ajtmccarty Jan 21, 2025

ajtmccarty Jan 21, 2025

ajtmccarty Jan 21, 2025

ajtmccarty Jan 21, 2025

ajtmccarty Jan 21, 2025

ajtmccarty Jan 21, 2025

move node-level diff query to its own query and limit it #5528

move node-level diff query to its own query and limit it #5528

Conversation

ajtmccarty commented Jan 20, 2025 • edited Loading

codspeed-hq bot commented Jan 20, 2025 • edited Loading

Merging #5528 will not alter performance

Summary

ajtmccarty left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajtmccarty commented Jan 20, 2025 •

edited

Loading

codspeed-hq bot commented Jan 20, 2025 •

edited

Loading