-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rm np_config.enable_numpy_behavior() #1093
base: main
Are you sure you want to change the base?
rm np_config.enable_numpy_behavior() #1093
Conversation
@@ -15,7 +12,6 @@ | |||
from hls4ml.optimization.dsp_aware_pruning.keras.reduction import reduce_model | |||
from hls4ml.optimization.dsp_aware_pruning.scheduler import OptimizationScheduler | |||
|
|||
np_config.enable_numpy_behavior() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide more details as to why this breaks the flow on some setups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some indexing and other behaviors are changed, with some undocumented effects, e.g.: in default mode tf.reduce_all(a>b)
is fine, but will raise if this is set. This causes some qkeras quantizers to return dicts for unknown reason, and I did not track down the direct cause.
Failed tests:
qkeras po2 quantizer:
failed for condition on x>=y pattern
hgq activation:
failure in jitted code, not located yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this break occur? Is it with a newer version of NumPy or TensorFlow? Because the CI/CD on GitHub doesn't shows this, so I am just wondering under what environment this occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gitlab CI test is not affected because this line is not executed before those (qkeras po2 and hgq) tests: only dsp aware pruning related tests will import the optimization api and trigger this line, and those don't run before the broken tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what set-up would cause this error? Importing both the dsp_aware_pruning and hgq / po2 in the same script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If some tests are run with numpy behavior enabled. Full failure pattern: https://gitlab.cern.ch/fastmachinelearning/hls4ml/-/pipelines/8406093 (numpy behavior enabled at hls4ml import)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I have 2 comments on this but in general, I am okay with merging this PR as long as the loss is actually printed with tf.print
. First, the error causing seems a bit artificial - is it ever likely that a user is going to import both dsp_aware_pruning
and hgq
/ po2
in the same module. Secondly, looking at the test logs this seems like this is a TensorFlow issue in some recent update? Looking at the docs for this function: https://www.tensorflow.org/api_docs/python/tf/experimental/numpy/experimental_enable_numpy_behavior maybe the parameter dtype_conversion_mode
can help solve this issue without changing the print statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Response to your comments:
- I don't think the errors are artificial -- if one runs the tests on one node, this is what to expect. I spend at least half an hour just to debug why some irrelevant tests breaks after some editing, and just to find out that the test execution order changed and things breaks because of this.
- The issue is that global behavior is silently overridden by accessing a submodule of a package, and this should be avoided to the max extent possible. In this specific case, it only breaks hgq/po2 conversion hls4ml, but it could introduce undefined behavior on other places. I would suggest changing the print statement using
tf.print
to avoid affecting things outside, or pack it into a context manager to restore the original state after exiting the module.
Regarding if losses are actually printed, while I believe it should do so, I don't have the setup to validate it on my side.
@@ -121,7 +117,7 @@ def optimize_model( | |||
model.compile(optimizer, loss_fn, metrics=[validation_metric]) | |||
baseline_performance = model.evaluate(validation_dataset, verbose=0, return_dict=False)[-1] | |||
if verbose: | |||
print(f'Baseline performance on validation set: {baseline_performance}') | |||
tf.print(f'Baseline performance on validation set: {baseline_performance}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this actually print the statement, or will it "maybe print" depending on some external case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In tf execution environment (e.g., tf.function with jit), it is supposed to print, and it worked for me in other cases. Did not have the setup locally to run this part specifically, thus not tested here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NumPy flag is only required for printing the loss tensors during training. Everything else should work fine with normal Python prints. However, there are no tests that run the full dsp-optimization flow and train a model with sparsity, so we need to make sure that the behaviour still stays the same and that the loss tensors are printed.
I just ran this code - removing the NumPy behaviour doesn't break prints, it breaks the line above
So if we remove this import we need to find a way to print the loss during training. |
Use |
That's a likely solution, but in the code we don't actually call |
Description
Fixes #1092, while trying to keep the original printing behavior.