-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move node-level diff query to its own query and limit it #5528
Conversation
CodSpeed Performance ReportMerging #5528 will not alter performanceComparing Summary
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments to help with PR review
has_more_data = False | ||
if last_result: | ||
has_more_data = last_result.get_as_type("has_more_data", bool) | ||
offset += node_limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some changes to use a limit and offset on this part of the diff calculation logic. the next step is to split the DiffAllPathsQuery
up into 2 more queries and do the same kind of limiting/filtering on those too.
I know that these 2 additions repeat a lot of the same code, but I am going the clean that up in the next 2 PRs b/c I will be breaking the remaining DiffAllPathsQuery
call up into 2 more queries that will need to be called in the same way
and result.get("r").type == prop_type | ||
] | ||
|
||
return sort_results_by_time(results, rel_label="r") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of the above deleted code is dead. leftovers from the old diff logic
@@ -580,82 +156,6 @@ async def query_init(self, db: InfrahubDatabase, **kwargs: Any) -> None: | |||
RETURN uuids AS node_ids_list, field_names AS field_names_list | |||
} | |||
CALL { | |||
WITH node_field_specifiers_list, node_ids_list, from_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this part of the diff calculation query over into DiffNodePathsQuery
and made some adjustments to it. basically, this is the part of the diff calculation query that looks at added/deleted nodes.
@@ -36,6 +37,31 @@ async def calculate_diff( | |||
to_time=to_time, | |||
previous_node_field_specifiers=previous_node_specifiers, | |||
) | |||
node_limit = int(config.SETTINGS.database.query_size_limit / 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I divide the limit by 10 as kind of a guess b/c the below query will return multiple paths for each added/deleted node and we don't really know how many paths will be returned for each node
@@ -929,3 +429,243 @@ async def query_init(self, db: InfrahubDatabase, **kwargs: Any) -> None: | |||
""" % {"id_func": db.get_id_function_name()} | |||
self.add_to_query(query) | |||
self.return_labels = ["DISTINCT diff_path AS diff_path"] | |||
|
|||
|
|||
class DiffCalculationQuery(DiffQuery): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new base class that the new DiffNodePathsQuery
uses. I am going to split the DiffAllPathsQuery
up into 2 separate queries that will also use this same base class
RETURN latest_base_path | ||
} | ||
""" | ||
relationship_peer_side_query = """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two queries are copied almost exactly from the current DiffAllPathsQuery
the has_more_data
variable is the only addition and that is used to indicate if the query needs to be run again after incrementing the offset
($from_time <= diff_rel.from < $to_time) | ||
OR ($from_time <= diff_rel.to < $to_time) | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a different method for querying for nodes that we care about that seems to be faster.
the old approach was to basically run the same query twice: once for nodes already in the diff and once for nodes added in this diff. this was required b/c we use a different from_time
for existing vs new nodes.
this new approach is to run the query once, but use this filter at the beginning to determine which from_time
to use and pass it along with the node
WITH collect([p, q, diff_rel, row_from_time]) AS limited_results | ||
WITH limited_results, size(limited_results) = $limit AS has_more_data | ||
UNWIND limited_results AS one_result | ||
WITH one_result[0] AS p, one_result[1] AS q, one_result[2] AS diff_rel, one_result[3] AS row_from_time, has_more_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this piece is new to check if there is more data beyond our offset + limit. I don't know if there is a better way to do this, but it seems reasonable
WITH limited_results, size(limited_results) = $limit AS has_more_data | ||
UNWIND limited_results AS one_result | ||
WITH one_result[0] AS p, one_result[1] AS q, one_result[2] AS diff_rel, one_result[3] AS row_from_time, has_more_data | ||
// ------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everything below here is taken almost exactly from the DiffAllPathsQuery
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these were tests for the dead code deleted above
IFC-1141
this is the first phase of some refactoring that I am working on for the diff calculation query to both improve performance and allow limiting the query so that it won't crash the database.
I did some local testing to update an existing diff on a branch with 3700 new nodes and found that the query for calculating the changes on the branch went from taking ~5 minutes to 20 seconds, which was somewhat unexpected. the query to calculate the changes on the base branch, still takes ~4 minutes on the old and new version, but I hope that once I complete the rest of the changes for these queries, we will see a similar performance improvement.
also, removes a bunch of dead code