Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sanity tests for parallel instance loading #6126
Add sanity tests for parallel instance loading #6126
Changes from all commits
db8f741
4250050
7649572
de88c52
3ce3c72
d99fb1b
21a0954
b0ef827
37b0101
ee60a0e
1a8c81f
83407cf
8573bfa
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my understanding. The goal of this test is to check if the included backends are capable of loading model instances concurrently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is to check that nothing crashes and burns internally when trying to load instances in parallel for the supported backends. All it does it load N instances of various kinds, and ensure that exactly N instances are loaded (via sequence batch slot logic) and complete a successful inference (no exception is raised on the N inferences).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to ideas if you think there's a better way to test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the test case does not belong here, because the L0_lifecycle is for testing the correctness of the Triton core, while the correctness of the backend implementation belongs to the testing of each backend. Since the test is similar for all backends, maybe a new L0_backend_thread_safety (or something similar) can be added for this test?
@GuanLuo @tanmayv25 should correct me if you think otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kthui do you have any feedback on the test itself other than how it is named? Or do you think it is sufficient for now?
Also @Tabrizian @tanmayv25 any feedback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering how is this testing the parallel instance feature? I see that it tests whether all the instances are ready or not but I don't see any testing in this specific case for parallel instance loading. I believe this test case would pass before this feature too. Please correct me if I'm missing anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Tabrizian you are correct that this subtest doesn't explicitly the parallelism in terms of speedup, I did that in the other subtest as it is easy to do for a python model, similar to the existing parallel model loading tests.
This sanity subtest was written as more of a sanity check that when loading a model with a significant (N=5, though upon writing I can probably increase this) instance count that (1) nothing crashes and burns, with higher instance counts it would be likely to reveal a race condition or segfault, hang, etc. if there was a bug, and (2) verify that exactly N instances (not more, not less, due to any possible race condition) do actually get loaded and can handle inference.
Upon reflection though, I can probably modify this test to do a serial load (env var) and then a parallel load and just check that the parallel load is faster at all. Initially I was afraid of this check introducing flakiness, but ideally it should consistently be faster on the types of CI systems we have with higher core counts.
Let me know if that makes sense, or if you have other ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the server become not ready when a model is unloaded? I was wondering whether we should make
unload_model
blocking? I think it is really confusing that it is async. (cc @kthui @GuanLuo). Also, even if it is async we should probably update this example to better demonstrate the usage: https://github.com/triton-inference-server/client/blob/f7c45d382639f31a5162666034fae7852681f082/src/python/examples/simple_grpc_model_control.py#L98-L101There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Server ready is true when all models are ready for inferencing, so while a model is in the unloading state, server is considered not ready.
I have been thinking about this recently when seeing Jacky's model update tests requiring polling, Kris's BLS test unload issues, and this test as well. I second that it's definitely confusing and unintuitive.
I do believe we document that it returns without waiting for model to finish in most places. However, I still think it is unituitive and not symmetric with
load_model
which is blocking.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: I understand there's good reason for an async unload to allow inflight requests to finish without a long blocking/wait time, but it would be nice to have an alternative or optional blocking call to support both.
Couldn't the client also perform the blocking call asynchronously to do other work while waiting if needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning is that when
unload
returns, the model is no longer accessible from the client. Does it matter from the client's perspective that whether the model is fully cleaned up on server side whenunload
returns?