Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rm np_config.enable_numpy_behavior() #1093

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

calad0i
Copy link
Contributor

@calad0i calad0i commented Oct 26, 2024

Description

Fixes #1092, while trying to keep the original printing behavior.

@calad0i calad0i requested review from vloncar and bo3z October 26, 2024 02:34
@calad0i calad0i added the please test Trigger testing by creating local PR branch label Oct 26, 2024
@@ -15,7 +12,6 @@
from hls4ml.optimization.dsp_aware_pruning.keras.reduction import reduce_model
from hls4ml.optimization.dsp_aware_pruning.scheduler import OptimizationScheduler

np_config.enable_numpy_behavior()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide more details as to why this breaks the flow on some setups?

Copy link
Contributor Author

@calad0i calad0i Oct 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some indexing and other behaviors are changed, with some undocumented effects, e.g.: in default mode tf.reduce_all(a>b) is fine, but will raise if this is set. This causes some qkeras quantizers to return dicts for unknown reason, and I did not track down the direct cause.
Failed tests:
qkeras po2 quantizer:
failed for condition on x>=y pattern
hgq activation:
failure in jitted code, not located yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this break occur? Is it with a newer version of NumPy or TensorFlow? Because the CI/CD on GitHub doesn't shows this, so I am just wondering under what environment this occurs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gitlab CI test is not affected because this line is not executed before those (qkeras po2 and hgq) tests: only dsp aware pruning related tests will import the optimization api and trigger this line, and those don't run before the broken tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what set-up would cause this error? Importing both the dsp_aware_pruning and hgq / po2 in the same script?

Copy link
Contributor Author

@calad0i calad0i Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If some tests are run with numpy behavior enabled. Full failure pattern: https://gitlab.cern.ch/fastmachinelearning/hls4ml/-/pipelines/8406093 (numpy behavior enabled at hls4ml import)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I have 2 comments on this but in general, I am okay with merging this PR as long as the loss is actually printed with tf.print. First, the error causing seems a bit artificial - is it ever likely that a user is going to import both dsp_aware_pruning and hgq / po2 in the same module. Secondly, looking at the test logs this seems like this is a TensorFlow issue in some recent update? Looking at the docs for this function: https://www.tensorflow.org/api_docs/python/tf/experimental/numpy/experimental_enable_numpy_behavior maybe the parameter dtype_conversion_mode can help solve this issue without changing the print statements.

Copy link
Contributor Author

@calad0i calad0i Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Response to your comments:

  1. I don't think the errors are artificial -- if one runs the tests on one node, this is what to expect. I spend at least half an hour just to debug why some irrelevant tests breaks after some editing, and just to find out that the test execution order changed and things breaks because of this.
  2. The issue is that global behavior is silently overridden by accessing a submodule of a package, and this should be avoided to the max extent possible. In this specific case, it only breaks hgq/po2 conversion hls4ml, but it could introduce undefined behavior on other places. I would suggest changing the print statement using tf.print to avoid affecting things outside, or pack it into a context manager to restore the original state after exiting the module.

Regarding if losses are actually printed, while I believe it should do so, I don't have the setup to validate it on my side.

@@ -121,7 +117,7 @@ def optimize_model(
model.compile(optimizer, loss_fn, metrics=[validation_metric])
baseline_performance = model.evaluate(validation_dataset, verbose=0, return_dict=False)[-1]
if verbose:
print(f'Baseline performance on validation set: {baseline_performance}')
tf.print(f'Baseline performance on validation set: {baseline_performance}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this actually print the statement, or will it "maybe print" depending on some external case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In tf execution environment (e.g., tf.function with jit), it is supposed to print, and it worked for me in other cases. Did not have the setup locally to run this part specifically, thus not tested here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NumPy flag is only required for printing the loss tensors during training. Everything else should work fine with normal Python prints. However, there are no tests that run the full dsp-optimization flow and train a model with sparsity, so we need to make sure that the behaviour still stays the same and that the loss tensors are printed.

@bo3z
Copy link
Contributor

bo3z commented Oct 29, 2024

I just ran this code - removing the NumPy behaviour doesn't break prints, it breaks the line above avg_loss = round(epoch_loss_avg.result(), 3) in the file dsp_aware_pruning/keras/__init__.py (line 234). So this has to do with querying EagerTensors (which are not always evaluated until necessary). The error I get is:

AttributeError: EagerTensor object has no attribute 'astype'. 
        If you are looking for numpy-related methods, please run the following:
        from tensorflow.python.ops.numpy_ops import np_config
        np_config.enable_numpy_behavior()

So if we remove this import we need to find a way to print the loss during training.

@calad0i
Copy link
Contributor Author

calad0i commented Oct 29, 2024

I just ran this code - removing the NumPy behaviour doesn't break prints, it breaks the line above avg_loss = round(epoch_loss_avg.result(), 3) in the file dsp_aware_pruning/keras/__init__.py (line 234). So this has to do with querying EagerTensors (which are not always evaluated until necessary). The error I get is:

AttributeError: EagerTensor object has no attribute 'astype'. 
        If you are looking for numpy-related methods, please run the following:
        from tensorflow.python.ops.numpy_ops import np_config
        np_config.enable_numpy_behavior()

So if we remove this import we need to find a way to print the loss during training.

Use tf.cast(x) instead of x.astype would solve that.

@bo3z
Copy link
Contributor

bo3z commented Oct 30, 2024

I just ran this code - removing the NumPy behaviour doesn't break prints, it breaks the line above avg_loss = round(epoch_loss_avg.result(), 3) in the file dsp_aware_pruning/keras/__init__.py (line 234). So this has to do with querying EagerTensors (which are not always evaluated until necessary). The error I get is:

AttributeError: EagerTensor object has no attribute 'astype'. 
        If you are looking for numpy-related methods, please run the following:
        from tensorflow.python.ops.numpy_ops import np_config
        np_config.enable_numpy_behavior()

So if we remove this import we need to find a way to print the loss during training.

Use tf.cast(x) instead of x.astype would solve that.

That's a likely solution, but in the code we don't actually call x.astype. This is called somewhere internally in TF when trying to get the value of the loss tensor (epoch_loss_avg.result()), so we need to see what the alternative is to get the loss value to this function. Also do we now if this buggy behaviour is introduced by an updated in NumPy / TF / QKeras?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
please test Trigger testing by creating local PR branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

hls4ml optimization/dsp_aware_pruning changing tensorflow global behavior
3 participants