Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Depthwise 1D and 2D Resource strategy for io_stream #1079

Merged

Conversation

steltze
Copy link
Contributor

@steltze steltze commented Oct 11, 2024

Description

Similar to the dense resource implementation, there are 3 separate implementations depending on the reuse factor and number of channels

The common functionality is similar to the dense_resource layer

  • calculate the block_factor

  • initialize the accumulator array with the biases

  • in_index: data-weight index, always incremented by the reuse factor

  • out_index: output index

  1. rf < n_chan

the out_index follows the in_index and there is just a check for out_index if it's out of bounds

  1. rf >= n_chan and rf % n_chan == 0

the out_index does not change in the block_factor loop, it's only incremented in the rufactor loop. It should be the least resource-demanding implementation

  1. rf > n_chan

the out_index is incremented with the remainder rf % n_chan and there is just a check if out_index is out of bounds

Type of change

  • New feature (non-breaking change which adds functionality)

Tests

  • Apart from the numerical correctness in the pytests comparing Keras to hls4ml, executed successfully the c-synthesis
  • Checked the numerical correctness and c-synthesis of a 10k parameter UNet that has 20 Separable 2D Convolutions

Checklist

  • I have read the guidelines for contributing.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • I have installed and run pre-commit on the files I edited or added.
  • I have added tests that prove my fix is effective or that my feature works.

@steltze
Copy link
Contributor Author

steltze commented Oct 11, 2024

Some results from vivado synthesis

@jmitrevs
Copy link
Contributor

Thanks. I should mention that there is also an effort to develop depthwise (and seperable, which gets split) for oneAPI, which will replace the Quartus backend eventually. It's in a PR to the oneAPI PR.

@jmitrevs jmitrevs added the please test Trigger testing by creating local PR branch label Oct 11, 2024
# ('Vitis', 'io_stream', 'latency'),
('Vivado', 'io_stream', 'resource'),
('Vitis', 'io_stream', 'resource'),
# ('Catapult', 'io_stream', 'latency'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a debugging setup that got committed. We should uncomment the useful checkps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmitrevs because of the addition of the reuse factor and input size options, the tests increase exponentially. I could include those options only to the resource strategy tests to reduce the total number of tests but it won't be easy to maintain. Any ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all these reuse factor and input size options, i.e. how likely is it that the code would break in different ways for the different cases? I think we will probably need to live with testing only 1 or 2 or these combinations every time, otherwise this would get quite out of hand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduced the number of tests to the minimum in order to cover all the cases of the resource implementation. Tests won't probably break unless something changes in the rf validation logic. When I first started the implementation, sepconv layer had a single rf which caused problems with dense-resource, now that sepconv is split, it works ok.

There was also an idea only to use rfs that trigger the 2nd case (rem0), instead of the 3rd one for better QoR.

# ('Vitis', 'io_stream', 'latency'),
('Vivado', 'io_stream', 'resource'),
('Vitis', 'io_stream', 'resource'),
# ('Catapult', 'io_stream', 'latency'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here. The others should be uncommented.

@jmitrevs
Copy link
Contributor

Updated to the latest main branch since I think it fixes the issue that caused the test failure.

@jmitrevs jmitrevs added please test Trigger testing by creating local PR branch and removed please test Trigger testing by creating local PR branch labels Oct 16, 2024
@steltze steltze changed the title Depthwise 1D and 2D Resource strategy for io_stream Draft: Depthwise 1D and 2D Resource strategy for io_stream Oct 17, 2024
@steltze steltze changed the title Draft: Depthwise 1D and 2D Resource strategy for io_stream Depthwise 1D and 2D Resource strategy for io_stream Oct 17, 2024
@steltze
Copy link
Contributor Author

steltze commented Oct 17, 2024

Pipelining over the kernel size might produce better QoR. Pending to test this hypothesis

@steltze steltze marked this pull request as draft October 17, 2024 20:39
@nghielme
Copy link
Contributor

What is blocking this PR to be merged?

@vloncar
Copy link
Contributor

vloncar commented Nov 21, 2024

@nghielme It is still marked as a draft, so no one took the time to thoroughly review it 😉. Also, it will come after 1.0.0

@xtreme8000
Copy link

With this patch my Vivado HLS 2019.2 fails with following output, regardless of resource or latency mode:

ERROR: [HLS 200-70] '#pragma HLS ARRAY_PARTITION variable=&acc type=complete' is not a valid pragma.
ERROR: [HLS 200-70] '#pragma HLS ARRAY_RESHAPE variable=&weights type=block factor=block_factor' is not a valid pragma.
ERROR: [HLS 200-70] '#pragma HLS ARRAY_RESHAPE variable=&data type=block factor=block_factor' is not a valid pragma.
ERROR: [HLS 200-70] '#pragma HLS ARRAY_PARTITION variable=&acc type=complete' is not a valid pragma.
ERROR: [HLS 200-70] '#pragma HLS ARRAY_PARTITION variable=&acc type=complete' is not a valid pragma.

The issue can be solved by removing type= from these pragmas as done elsewhere in the code base.

@steltze steltze marked this pull request as ready for review January 9, 2025 15:07
@steltze
Copy link
Contributor Author

steltze commented Jan 9, 2025

thanks @xtreme8000! fixed

@JanFSchulte JanFSchulte added please test Trigger testing by creating local PR branch and removed please test Trigger testing by creating local PR branch labels Jan 10, 2025
Copy link
Contributor

@vloncar vloncar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good so far, needs minor changes.

@@ -86,7 +86,7 @@ void separable_conv_1d_cl(hls::stream<data_T> &data, hls::stream<res_T> &res,
#pragma HLS DATAFLOW

hls::stream<dw_res_T> depthwise_res;
unsigned res_depth = CONFIG_T::depthwise_config::out_width;
const unsigned res_depth = CONFIG_T::depthwise_config::out_width;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does using const here bring any benefit? And since this is a newer compiler, why not constexpr?

const int multiplier_limit = DIV_ROUNDUP(nin, multfactor);
const int block_factor = DIV_ROUNDUP(nin, rufactor);

assert((multiplier_limit == block_factor) && "This function is correct only for RF <= N_IN");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great if this message made it more clear what N_IN is. Same for N_OUT in others. We should update all functions to make it more clear, not just resource ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I change the non-resource functions in this PR or open a separate one?

assert((rufactor > nout) && "This function is correct only for RF > N_IN");

#pragma HLS function_instantiate variable=weights,biases
//#pragma HLS RESOURCE variable=weights core=RAM_2P_BRAM Commenting out the deisgnation HLS seems to choose correctly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should get rid of these comments across the file.

void depthwise_product(data_T data[CONFIG_T::kernel_size * CONFIG_T::n_chan], res_T res[CONFIG_T::n_chan],
typename CONFIG_T::weight_t weights[CONFIG_T::kernel_size * CONFIG_T::n_chan],
typename CONFIG_T::bias_t biases[CONFIG_T::n_chan]) {
void depthwise_product_resource_rf_lt_nchan(data_T data[CONFIG_T::kernel_size * CONFIG_T::n_chan],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of making this file clearer, I propose we move all depthwise product functions into a separate file, nnet_depthwise_product.h. Don't forget that you'll have to then add this file in sepconv include list in convolution_templates.py

Copy link
Contributor Author

@steltze steltze Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I think changing convolution_templates.py is not needed? Tests seem to work. Is there another reason I am missing?

@@ -13,6 +13,8 @@
strides_options = [(1), (2)]
kernel_options = [(2), (3)]
bias_options = [False]
rf_options = [1, 16, 24] # each rf corresponds to one of the three cases of depthwise resource for io_stream
input_size_options = [16]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 16? This only makes the test run longer. Can we choose a smaller number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduced to 3 and 4 respectively in order to make sure all 3 cases of the implementation are triggered

@@ -8,12 +8,13 @@

test_root_path = Path(__file__).parent

padds_options = ['same', 'valid']
padds_options = ['same']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove valid?

@steltze steltze requested a review from vloncar February 10, 2025 16:18
@vloncar
Copy link
Contributor

vloncar commented Feb 24, 2025

The only thing I see lacking here is the use of function pointers for depnse instead of the current if (strategy == latency) do_latency_dense() else do_resource_dense() approach. We moved the rest of the codebase towards that model as it allows the dense function to be generated and implemented in special ways (HGQ and resource_unrolled use this). I've made the changes needded and put them in vloncar/stelios_depthwise_resource.

After extensive testing of both approaches, I've made a few observations:

  • In some rare instances function pointer adds a clock cycle to the main loop of depthwise function. I didn't manage to figure out why, the schedule viewer shows both approaches resulting in identical sequence of operations, but with function pointer one having one operation after depthwise_product function scheduled in the next cycle for no apparent reason. I'm thinking to follow this up with AMD.
  • Sometimes the 3rd case (rf > n_chan and rf % n_chan != 0) fails to meet timing. This is regardless of the approach used. We could consider enforcing RF that avoids it
  • Depending on the size of the layer/input, the pragma HLS INLINE recursive just prior to calling depthwise_resource may have a detrimental effect (going from added cycles to missing timing). Again, regardless of the approach. FWIW, this also affects regular Conv1D/2D.

Since the issues are specific to a use case, I'm thinking to merge the function pointer changes into this and then merge it to main. (there's also a trivial change to tests to avoid creating directory within a directory.) If the issue appears, it is trivial to manually tweak around it to avoid the use of function pointer and/or tweak the inline pragma. The 3rd case can also be avoided by the user if they observe that issue. If no one objects I'll do this tomorrow.

@vloncar vloncar added please test Trigger testing by creating local PR branch and removed please test Trigger testing by creating local PR branch labels Feb 26, 2025
@vloncar
Copy link
Contributor

vloncar commented Feb 26, 2025

This is now ready. Let's just wait to see if I messed up the tests before we merge.

@vloncar vloncar added please test Trigger testing by creating local PR branch and removed please test Trigger testing by creating local PR branch labels Feb 26, 2025
@vloncar vloncar dismissed their stale review February 26, 2025 20:37

Code updated

@vloncar vloncar merged commit 898d53d into fastmachinelearning:main Feb 26, 2025
8 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
please test Trigger testing by creating local PR branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants