rm np_config.enable_numpy_behavior() #1093

calad0i · 2024-10-26T02:34:13Z

Description

Fixes #1092, while trying to keep the original printing behavior.

vloncar · 2024-10-26T19:42:00Z

hls4ml/optimization/dsp_aware_pruning/keras/__init__.py

@@ -15,7 +12,6 @@
 from hls4ml.optimization.dsp_aware_pruning.keras.reduction import reduce_model
 from hls4ml.optimization.dsp_aware_pruning.scheduler import OptimizationScheduler

-np_config.enable_numpy_behavior()


Can you provide more details as to why this breaks the flow on some setups?

Some indexing and other behaviors are changed, with some undocumented effects, e.g.: in default mode tf.reduce_all(a>b) is fine, but will raise if this is set. This causes some qkeras quantizers to return dicts for unknown reason, and I did not track down the direct cause.
Failed tests:
qkeras po2 quantizer:
failed for condition on x>=y pattern
hgq activation:
failure in jitted code, not located yet.

Where does this break occur? Is it with a newer version of NumPy or TensorFlow? Because the CI/CD on GitHub doesn't shows this, so I am just wondering under what environment this occurs.

Gitlab CI test is not affected because this line is not executed before those (qkeras po2 and hgq) tests: only dsp aware pruning related tests will import the optimization api and trigger this line, and those don't run before the broken tests.

So what set-up would cause this error? Importing both the dsp_aware_pruning and hgq / po2 in the same script?

If some tests are run with numpy behavior enabled. Full failure pattern: https://gitlab.cern.ch/fastmachinelearning/hls4ml/-/pipelines/8406093 (numpy behavior enabled at hls4ml import)

I see. I have 2 comments on this but in general, I am okay with merging this PR as long as the loss is actually printed with tf.print. First, the error causing seems a bit artificial - is it ever likely that a user is going to import both dsp_aware_pruning and hgq / po2 in the same module. Secondly, looking at the test logs this seems like this is a TensorFlow issue in some recent update? Looking at the docs for this function: https://www.tensorflow.org/api_docs/python/tf/experimental/numpy/experimental_enable_numpy_behavior maybe the parameter dtype_conversion_mode can help solve this issue without changing the print statements.

Response to your comments:

I don't think the errors are artificial -- if one runs the tests on one node, this is what to expect. I spend at least half an hour just to debug why some irrelevant tests breaks after some editing, and just to find out that the test execution order changed and things breaks because of this.

The issue is that global behavior is silently overridden by accessing a submodule of a package, and this should be avoided to the max extent possible. In this specific case, it only breaks hgq/po2 conversion hls4ml, but it could introduce undefined behavior on other places. I would suggest changing the print statement using tf.print to avoid affecting things outside, or pack it into a context manager to restore the original state after exiting the module.

Regarding if losses are actually printed, while I believe it should do so, I don't have the setup to validate it on my side.

vloncar · 2024-10-26T19:45:17Z

hls4ml/optimization/dsp_aware_pruning/keras/__init__.py

@@ -121,7 +117,7 @@ def optimize_model(
    model.compile(optimizer, loss_fn, metrics=[validation_metric])
    baseline_performance = model.evaluate(validation_dataset, verbose=0, return_dict=False)[-1]
    if verbose:
-        print(f'Baseline performance on validation set: {baseline_performance}')
+        tf.print(f'Baseline performance on validation set: {baseline_performance}')


Will this actually print the statement, or will it "maybe print" depending on some external case?

In tf execution environment (e.g., tf.function with jit), it is supposed to print, and it worked for me in other cases. Did not have the setup locally to run this part specifically, thus not tested here.

The NumPy flag is only required for printing the loss tensors during training. Everything else should work fine with normal Python prints. However, there are no tests that run the full dsp-optimization flow and train a model with sparsity, so we need to make sure that the behaviour still stays the same and that the loss tensors are printed.

bo3z · 2024-10-29T16:51:32Z

I just ran this code - removing the NumPy behaviour doesn't break prints, it breaks the line above avg_loss = round(epoch_loss_avg.result(), 3) in the file dsp_aware_pruning/keras/__init__.py (line 234). So this has to do with querying EagerTensors (which are not always evaluated until necessary). The error I get is:

AttributeError: EagerTensor object has no attribute 'astype'. 
        If you are looking for numpy-related methods, please run the following:
        from tensorflow.python.ops.numpy_ops import np_config
        np_config.enable_numpy_behavior()

So if we remove this import we need to find a way to print the loss during training.

calad0i · 2024-10-29T21:30:11Z

I just ran this code - removing the NumPy behaviour doesn't break prints, it breaks the line above avg_loss = round(epoch_loss_avg.result(), 3) in the file dsp_aware_pruning/keras/__init__.py (line 234). So this has to do with querying EagerTensors (which are not always evaluated until necessary). The error I get is:
AttributeError: EagerTensor object has no attribute 'astype'. 
        If you are looking for numpy-related methods, please run the following:
        from tensorflow.python.ops.numpy_ops import np_config
        np_config.enable_numpy_behavior()
So if we remove this import we need to find a way to print the loss during training.

Use tf.cast(x) instead of x.astype would solve that.

bo3z · 2024-10-30T10:14:16Z

I just ran this code - removing the NumPy behaviour doesn't break prints, it breaks the line above avg_loss = round(epoch_loss_avg.result(), 3) in the file dsp_aware_pruning/keras/__init__.py (line 234). So this has to do with querying EagerTensors (which are not always evaluated until necessary). The error I get is:
AttributeError: EagerTensor object has no attribute 'astype'. 
        If you are looking for numpy-related methods, please run the following:
        from tensorflow.python.ops.numpy_ops import np_config
        np_config.enable_numpy_behavior()
So if we remove this import we need to find a way to print the loss during training.
Use tf.cast(x) instead of x.astype would solve that.

That's a likely solution, but in the code we don't actually call x.astype. This is called somewhere internally in TF when trying to get the value of the loss tensor (epoch_loss_avg.result()), so we need to see what the alternative is to get the loss value to this function. Also do we now if this buggy behaviour is introduced by an updated in NumPy / TF / QKeras?

rm np_config.enable_numpy_behavior()

8057e8c

calad0i requested review from vloncar and bo3z October 26, 2024 02:34

calad0i added the please test Trigger testing by creating local PR branch label Oct 26, 2024

vloncar reviewed Oct 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rm np_config.enable_numpy_behavior() #1093

rm np_config.enable_numpy_behavior() #1093

calad0i commented Oct 26, 2024

vloncar Oct 26, 2024

calad0i Oct 26, 2024 •

edited

Loading

bo3z Oct 27, 2024

calad0i Oct 27, 2024

bo3z Oct 28, 2024

calad0i Oct 28, 2024 •

edited

Loading

bo3z Oct 29, 2024

calad0i Oct 29, 2024 •

edited

Loading

vloncar Oct 26, 2024

calad0i Oct 26, 2024

bo3z Oct 27, 2024

bo3z commented Oct 29, 2024

calad0i commented Oct 29, 2024

bo3z commented Oct 30, 2024

rm np_config.enable_numpy_behavior() #1093

Are you sure you want to change the base?

rm np_config.enable_numpy_behavior() #1093

Conversation

calad0i commented Oct 26, 2024

Description

Choose a reason for hiding this comment

calad0i Oct 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calad0i Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calad0i Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bo3z commented Oct 29, 2024

calad0i commented Oct 29, 2024

bo3z commented Oct 30, 2024

calad0i Oct 26, 2024 •

edited

Loading

calad0i Oct 28, 2024 •

edited

Loading

calad0i Oct 29, 2024 •

edited

Loading