-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added more detailed error messages for KNN model training #2378
Conversation
a9c0dbd
to
ef0e7e5
Compare
@@ -134,6 +138,30 @@ protected void getTrainingIndexSizeInKB(TrainingModelRequest trainingModelReques | |||
trainingVectors = trainingModelRequest.getMaximumVectorCount(); | |||
} | |||
|
|||
long minTrainingVectorCount = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make the 1000 value a class level constant?
|
||
if (trainingVectors < minTrainingVectorCount) { | ||
ValidationException exception = new ValidationException(); | ||
exception.addValidationError("Number of training points should be greater than " + minTrainingVectorCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use String.format for concatenation
if (knnMethodContext.getMethodComponentContext().getParameters().containsKey(ENCODER_PARAMETER_PQ_M) | ||
&& knnMethodConfigContext.getDimension() % (Integer) knnMethodContext.getMethodComponentContext() | ||
.getParameters() | ||
.get(ENCODER_PARAMETER_PQ_M) != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if that parameter is always present or not, but if it's optional then this line can generate the runtime exception in case parameter is not present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe Java has short-circuit evaluation, so if containsKey(ENCODER_PARAMETER_PQ_M) returns false then the second expression will not be evaluated. So a runtime exception shouldn't be thrown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @anntians. I think this is a good first step. What we should do next is move all of the specific checks around the parameters behind the engine method abstraction. See https://github.com/opensearch-project/k-NN/tree/main/src/main/java/org/opensearch/knn/index/engine.
Here is my idea for it: add in a method in KNNLibraryIndexingContext called something like getTrainingConfigValidationSetup() that returns a function that takes as input the number of training vectors (or a more general object) and performs some kind of validation.
Then, we can use this to hide the method specific validations inside the engine abstraction, which will be clean and maintainable. For instance, we can implement checks for IVFPQ in https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/engine/faiss/FaissIVFPQEncoder.java, etc.
@@ -283,6 +285,15 @@ public ActionRequestValidationException validate() { | |||
exception.addValidationError("Description exceeds limit of " + KNNConstants.MAX_MODEL_DESCRIPTION_LENGTH + " characters"); | |||
} | |||
|
|||
// Check if ENCODER_PARAMETER_PQ_M is divisible by vector dimension | |||
if (knnMethodContext.getMethodComponentContext().getParameters().containsKey(ENCODER_PARAMETER_PQ_M) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove these checks here now, correct?
@@ -52,4 +57,33 @@ public PerDimensionValidator getPerDimensionValidator() { | |||
public PerDimensionProcessor getPerDimensionProcessor() { | |||
return perDimensionProcessor; | |||
} | |||
|
|||
@Override | |||
public BiFunction<Long, KNNMethodContext, TrainingConfigValidationOutput> getTrainingConfigValidationSetup() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the right direction. However, the function that is returned should be passed in via the builder and be KNN method specific. See https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/engine/faiss/AbstractFaissMethod.java#L68-L90 as an example.
The main purpose of this is to not expose parameters outside of the method class unless explicitly necessary.
e61b5c7
to
1da10cf
Compare
Signed-off-by: AnnTian Shao <[email protected]>
…tion Signed-off-by: AnnTian Shao <[email protected]>
1da10cf
to
bee8983
Compare
bee8983
to
cd1b553
Compare
Signed-off-by: AnnTian Shao <[email protected]>
cd1b553
to
3b87975
Compare
@@ -108,6 +114,55 @@ protected PerDimensionProcessor doGetPerDimensionProcessor( | |||
return PerDimensionProcessor.NOOP_PROCESSOR; | |||
} | |||
|
|||
protected Function<TrainingConfigValidationInput, TrainingConfigValidationOutput> doGetTrainingConfigValidationSetup() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in the right direction, but ideally we want this to be handled per algo type. For instance, pq validation should happen in https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/engine/faiss/AbstractFaissPQEncoder.java. IVF validation should happen in https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/engine/faiss/FaissIVFMethod.java, etc.
If you think this change wouldnt be possible to make in time for code freeze (Monday), I think we shouldnt block this PR. However, this last refactor should be taken up as a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, will merge this PR and create a separate PR for followup. Thanks
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-main main
# Navigate to the new working tree
cd .worktrees/backport-main
# Create a new branch
git switch --create backport/backport-2378-to-main
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 4058a53c53d6d4ec397b8bd8b6c079f1b584dfc0
# Push it to GitHub
git push --set-upstream origin backport/backport-2378-to-main
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-main Then, create a pull request where the |
…-project#2378) * Add more detailed error messages for KNN model training Signed-off-by: AnnTian Shao <[email protected]> * Add validation check for training parameters in engine method abstraction Signed-off-by: AnnTian Shao <[email protected]> * Fixes for bwc and IT tests Signed-off-by: AnnTian Shao <[email protected]> --------- Signed-off-by: AnnTian Shao <[email protected]> Co-authored-by: AnnTian Shao <[email protected]> (cherry picked from commit 4058a53)
…-project#2378) * Add more detailed error messages for KNN model training Signed-off-by: AnnTian Shao <[email protected]> * Add validation check for training parameters in engine method abstraction Signed-off-by: AnnTian Shao <[email protected]> * Fixes for bwc and IT tests Signed-off-by: AnnTian Shao <[email protected]> --------- Signed-off-by: AnnTian Shao <[email protected]> Co-authored-by: AnnTian Shao <[email protected]> (cherry picked from commit 4058a53)
Description
Previously, a consistent feedback we get around PQ and IVF is that there is limited visibility into the failure cases. Part of this is because the errors are thrown on the Faiss side and we don't return stack traces in Rest response. So, this makes it difficult to use PQ and IVF. Thus, this PR provides improved error messages by adding explicit checks for the most common errors:
[ ] For PQ, explicitly check in OpenSearch an invalid configuration where m does not divide dimension
[ ] For PQ/IVF, check the number of training points matches the minimum clustering criteria defined in faiss
[ ] If there is not enough memory, explicitly say that there is not enough memory.
Adding these 3 checks will cover 90% of the training failures that occur.
Related Issues
Resolves #2268
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.