bench_sanger script #3

jimmy-adams · 2024-03-28T12:13:31Z

Hello,
In this repo you provide a simulation script to calculate the sanger's processing of BERT.
//==============================================================
def bert_base_gflops(seq_len):
HIDDEN_SIZE = 768
linear_flops = seq_len * HIDDEN_SIZE * HIDDEN_SIZE * 2 * 3
qk_flops = seq_len * seq_len * HIDDEN_SIZE * 2
pv_flops = seq_len * seq_len * HIDDEN_SIZE * 2
out_proj_flops = seq_len * HIDDEN_SIZE * HIDDEN_SIZE * 2
stage1_flops = linear_flops
stage2_flops = qk_flops + pv_flops + out_proj_flops
stage1_gflops = stage1_flops / 1e9
stage2_gflops = stage2_flops / 1e9
print("The stage1_gflops: %.3f FLOPS", stage1_gflops * 1e9)
print("The stage2_gflops: %.3f FLOPS", stage2_gflops * 1e9)
return stage1_gflops, stage2_gflops
//===============================================================
I want to ask if the flops covers the total 12 hidden layers? or just a single layer of the BERT encoder.

Best Regards

jimmy-adams · 2024-03-28T14:20:28Z

Also another question related is: when i set the input sequence with 128, the calculated GFLOPS is about 0.65. Assume there are 12 hidden layers in BERT base, the total GFLOPs is less than 12 GFLOPs, not compatible with the profiler test result about 20s GFLOPs.

hatsu3 · 2024-03-29T21:43:15Z

The FLOPs number corresponds to a single layer of BERT. Can you provide more information about the profiler?

jimmy-adams · 2024-03-30T04:17:56Z

The FLOPs number corresponds to a single layer of BERT. Can you provide more information about the profiler?
Hello,
https://github.com/cli99/flops-profiler
autoliuweijie/FastBERT#11

Here the two posts mentioned their result, more or less different, but still at the value of 20~GLOPS.

hatsu3 · 2024-03-30T14:00:17Z

Our provided simulation script only calculates the FLOPs of a single multi-head attention (MHA) module. However, an encoder layer of BERT also includes a fully-connected feed-forward network (FFN) following the MHA. The thop profiler used by FastBERT calculates the total FLOPs of all modules in a BERT model, which includes MHA, FFN, and potentially other modules not included in our calculation. Therefore, it should produce a larger FLOPs count than ours.

jimmy-adams · 2024-03-30T14:14:43Z

Hello,
Listed is one hidden layer of BERT:
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False) )
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False) ) )
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation() )
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
Do you mean the FLOPs in Sanger only contain the attention submodule in the listed BertEncoder?
How can I calculate the other two parts based on your calculation method?

Best Regards

hatsu3 · 2024-03-30T14:27:38Z

(1) Yes, and BertIntermediate and BertOutput above correspond to the FFN submodule. Besides, we do not include the GLOPs of LayerNorm and Softmax operations in our calculation.
(2) Calculating FLOPs of fully connected layers (or Linear in PyTorch terms) in the FFN submodule should be almost identical to our script’s method.

Please refer to the implementation of thop (https://github.com/Lyken17/pytorch-OpCounter/blob/master/thop/profile.py and https://github.com/Lyken17/pytorch-OpCounter/blob/master/thop/vision/calc_func.py) for the formulas used.

jimmy-adams · 2024-03-30T14:30:16Z

(1) Yes, and BertIntermediate and BertOutput above correspond to the FFN submodule. Besides, we do not include the GLOPs of LayerNorm and Softmax operations in our calculation. (2) Calculating FLOPs fully fully connected layers (or Linear in PyTorch terms) in the FFN submodule should be almost identical to our script’s method.

Please refer to the implementation of thop (https://github.com/Lyken17/pytorch-OpCounter/blob/master/thop/profile.py and https://github.com/Lyken17/pytorch-OpCounter/blob/master/thop/vision/calc_func.py) for the formulas used.

Hello,

Does that mean in these two modules there are no matmul ops and only contain Linear Ops?

hatsu3 · 2024-03-30T19:12:20Z

From what I understand, fully connected layers, or Linear modules, are essentially affine transformations (i.e., a matmul and an element-wise addition of a bias vector with broadcast). Besides, BertIntermediate and BertOutput contain not just Linear modules, but also LayerNorm operations and element-wise activation functions. Depending on how thop library calculates FLOPs, you may also need to include the FLOPs of these operations in the final result if you want to replicate the estimation of thop.

jimmy-adams · 2024-04-02T07:26:55Z

From what I understand, fully connected layers, or Linear modules, are essentially affine transformations (i.e., a matmul and an element-wise addition of a bias vector with broadcast). Besides, BertIntermediate and BertOutput contain not just Linear modules, but also LayerNorm operations and element-wise activation functions. Depending on how thop library calculates FLOPs, you may also need to include the FLOPs of these operations in the final result if you want to replicate the estimation of thop.

Dear author,
Thanks a lot for your kind reply.
One further question is will Sanger can process efficiently for layernorm or element-wise activation functions?

Best Regards

hatsu3 · 2024-04-03T02:48:26Z

Our accelerator design is primarily focused on the core attention mechanism, which does not contain LayerNorm or activation functions. Therefore, these operations are not taken into account in our work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench_sanger script #3

bench_sanger script #3

jimmy-adams commented Mar 28, 2024 •

edited

Loading

jimmy-adams commented Mar 28, 2024

hatsu3 commented Mar 29, 2024

jimmy-adams commented Mar 30, 2024

hatsu3 commented Mar 30, 2024

jimmy-adams commented Mar 30, 2024

hatsu3 commented Mar 30, 2024 •

edited

Loading

jimmy-adams commented Mar 30, 2024

hatsu3 commented Mar 30, 2024

jimmy-adams commented Apr 2, 2024

hatsu3 commented Apr 3, 2024

bench_sanger script #3

bench_sanger script #3

Comments

jimmy-adams commented Mar 28, 2024 • edited Loading

jimmy-adams commented Mar 28, 2024

hatsu3 commented Mar 29, 2024

jimmy-adams commented Mar 30, 2024

hatsu3 commented Mar 30, 2024

jimmy-adams commented Mar 30, 2024

hatsu3 commented Mar 30, 2024 • edited Loading

jimmy-adams commented Mar 30, 2024

hatsu3 commented Mar 30, 2024

jimmy-adams commented Apr 2, 2024

hatsu3 commented Apr 3, 2024

jimmy-adams commented Mar 28, 2024 •

edited

Loading

hatsu3 commented Mar 30, 2024 •

edited

Loading