Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move node-level diff query to its own query and limit it #5528

Merged
merged 3 commits into from
Jan 21, 2025

Conversation

ajtmccarty
Copy link
Contributor

@ajtmccarty ajtmccarty commented Jan 20, 2025

IFC-1141

this is the first phase of some refactoring that I am working on for the diff calculation query to both improve performance and allow limiting the query so that it won't crash the database.

I did some local testing to update an existing diff on a branch with 3700 new nodes and found that the query for calculating the changes on the branch went from taking ~5 minutes to 20 seconds, which was somewhat unexpected. the query to calculate the changes on the base branch, still takes ~4 minutes on the old and new version, but I hope that once I complete the rest of the changes for these queries, we will see a similar performance improvement.

also, removes a bunch of dead code

@github-actions github-actions bot added the group/backend Issue related to the backend (API Server, Git Agent) label Jan 20, 2025
Copy link

codspeed-hq bot commented Jan 20, 2025

CodSpeed Performance Report

Merging #5528 will not alter performance

Comparing ajtm-01202025-limited-node-diff-query (ced94ac) with stable (5bf43e2)

Summary

✅ 10 untouched benchmarks

@ajtmccarty ajtmccarty marked this pull request as ready for review January 21, 2025 02:28
@ajtmccarty ajtmccarty requested a review from a team January 21, 2025 02:29
Copy link
Contributor Author

@ajtmccarty ajtmccarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments to help with PR review

has_more_data = False
if last_result:
has_more_data = last_result.get_as_type("has_more_data", bool)
offset += node_limit
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some changes to use a limit and offset on this part of the diff calculation logic. the next step is to split the DiffAllPathsQuery up into 2 more queries and do the same kind of limiting/filtering on those too.

I know that these 2 additions repeat a lot of the same code, but I am going the clean that up in the next 2 PRs b/c I will be breaking the remaining DiffAllPathsQuery call up into 2 more queries that will need to be called in the same way

and result.get("r").type == prop_type
]

return sort_results_by_time(results, rel_label="r")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of the above deleted code is dead. leftovers from the old diff logic

@@ -580,82 +156,6 @@ async def query_init(self, db: InfrahubDatabase, **kwargs: Any) -> None:
RETURN uuids AS node_ids_list, field_names AS field_names_list
}
CALL {
WITH node_field_specifiers_list, node_ids_list, from_time
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this part of the diff calculation query over into DiffNodePathsQuery and made some adjustments to it. basically, this is the part of the diff calculation query that looks at added/deleted nodes.

@@ -36,6 +37,31 @@ async def calculate_diff(
to_time=to_time,
previous_node_field_specifiers=previous_node_specifiers,
)
node_limit = int(config.SETTINGS.database.query_size_limit / 10)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I divide the limit by 10 as kind of a guess b/c the below query will return multiple paths for each added/deleted node and we don't really know how many paths will be returned for each node

@@ -929,3 +429,243 @@ async def query_init(self, db: InfrahubDatabase, **kwargs: Any) -> None:
""" % {"id_func": db.get_id_function_name()}
self.add_to_query(query)
self.return_labels = ["DISTINCT diff_path AS diff_path"]


class DiffCalculationQuery(DiffQuery):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new base class that the new DiffNodePathsQuery uses. I am going to split the DiffAllPathsQuery up into 2 separate queries that will also use this same base class

RETURN latest_base_path
}
"""
relationship_peer_side_query = """
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two queries are copied almost exactly from the current DiffAllPathsQuery
the has_more_data variable is the only addition and that is used to indicate if the query needs to be run again after incrementing the offset

($from_time <= diff_rel.from < $to_time)
OR ($from_time <= diff_rel.to < $to_time)
)
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a different method for querying for nodes that we care about that seems to be faster.
the old approach was to basically run the same query twice: once for nodes already in the diff and once for nodes added in this diff. this was required b/c we use a different from_time for existing vs new nodes.

this new approach is to run the query once, but use this filter at the beginning to determine which from_time to use and pass it along with the node

WITH collect([p, q, diff_rel, row_from_time]) AS limited_results
WITH limited_results, size(limited_results) = $limit AS has_more_data
UNWIND limited_results AS one_result
WITH one_result[0] AS p, one_result[1] AS q, one_result[2] AS diff_rel, one_result[3] AS row_from_time, has_more_data
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this piece is new to check if there is more data beyond our offset + limit. I don't know if there is a better way to do this, but it seems reasonable

WITH limited_results, size(limited_results) = $limit AS has_more_data
UNWIND limited_results AS one_result
WITH one_result[0] AS p, one_result[1] AS q, one_result[2] AS diff_rel, one_result[3] AS row_from_time, has_more_data
// -------------------------------------
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything below here is taken almost exactly from the DiffAllPathsQuery

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these were tests for the dead code deleted above

@ajtmccarty ajtmccarty merged commit 2f19473 into stable Jan 21, 2025
36 checks passed
@ajtmccarty ajtmccarty deleted the ajtm-01202025-limited-node-diff-query branch January 21, 2025 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
group/backend Issue related to the backend (API Server, Git Agent)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants