Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bench_sanger script #3

Open
jimmy-adams opened this issue Mar 28, 2024 · 10 comments
Open

bench_sanger script #3

jimmy-adams opened this issue Mar 28, 2024 · 10 comments

Comments

@jimmy-adams
Copy link

jimmy-adams commented Mar 28, 2024

Hello,
In this repo you provide a simulation script to calculate the sanger's processing of BERT.
//==============================================================
def bert_base_gflops(seq_len):
HIDDEN_SIZE = 768
linear_flops = seq_len * HIDDEN_SIZE * HIDDEN_SIZE * 2 * 3
qk_flops = seq_len * seq_len * HIDDEN_SIZE * 2
pv_flops = seq_len * seq_len * HIDDEN_SIZE * 2
out_proj_flops = seq_len * HIDDEN_SIZE * HIDDEN_SIZE * 2
stage1_flops = linear_flops
stage2_flops = qk_flops + pv_flops + out_proj_flops
stage1_gflops = stage1_flops / 1e9
stage2_gflops = stage2_flops / 1e9
print("The stage1_gflops: %.3f FLOPS", stage1_gflops * 1e9)
print("The stage2_gflops: %.3f FLOPS", stage2_gflops * 1e9)
return stage1_gflops, stage2_gflops
//===============================================================
I want to ask if the flops covers the total 12 hidden layers? or just a single layer of the BERT encoder.

Best Regards

@jimmy-adams
Copy link
Author

Also another question related is: when i set the input sequence with 128, the calculated GFLOPS is about 0.65. Assume there are 12 hidden layers in BERT base, the total GFLOPs is less than 12 GFLOPs, not compatible with the profiler test result about 20s GFLOPs.

@hatsu3
Copy link
Owner

hatsu3 commented Mar 29, 2024

The FLOPs number corresponds to a single layer of BERT. Can you provide more information about the profiler?

@jimmy-adams
Copy link
Author

The FLOPs number corresponds to a single layer of BERT. Can you provide more information about the profiler?
Hello,
https://github.com/cli99/flops-profiler
autoliuweijie/FastBERT#11

Here the two posts mentioned their result, more or less different, but still at the value of 20~GLOPS.

@hatsu3
Copy link
Owner

hatsu3 commented Mar 30, 2024

Our provided simulation script only calculates the FLOPs of a single multi-head attention (MHA) module. However, an encoder layer of BERT also includes a fully-connected feed-forward network (FFN) following the MHA. The thop profiler used by FastBERT calculates the total FLOPs of all modules in a BERT model, which includes MHA, FFN, and potentially other modules not included in our calculation. Therefore, it should produce a larger FLOPs count than ours.

@jimmy-adams
Copy link
Author

Hello,
Listed is one hidden layer of BERT:
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False) )
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False) ) )

(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation() )
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
Do you mean the FLOPs in Sanger only contain the attention submodule in the listed BertEncoder?
How can I calculate the other two parts based on your calculation method?

Best Regards

@hatsu3
Copy link
Owner

hatsu3 commented Mar 30, 2024

(1) Yes, and BertIntermediate and BertOutput above correspond to the FFN submodule. Besides, we do not include the GLOPs of LayerNorm and Softmax operations in our calculation.
(2) Calculating FLOPs of fully connected layers (or Linear in PyTorch terms) in the FFN submodule should be almost identical to our script’s method.

Please refer to the implementation of thop (https://github.com/Lyken17/pytorch-OpCounter/blob/master/thop/profile.py and https://github.com/Lyken17/pytorch-OpCounter/blob/master/thop/vision/calc_func.py) for the formulas used.

@jimmy-adams
Copy link
Author

(1) Yes, and BertIntermediate and BertOutput above correspond to the FFN submodule. Besides, we do not include the GLOPs of LayerNorm and Softmax operations in our calculation. (2) Calculating FLOPs fully fully connected layers (or Linear in PyTorch terms) in the FFN submodule should be almost identical to our script’s method.

Please refer to the implementation of thop (https://github.com/Lyken17/pytorch-OpCounter/blob/master/thop/profile.py and https://github.com/Lyken17/pytorch-OpCounter/blob/master/thop/vision/calc_func.py) for the formulas used.

Hello,

Does that mean in these two modules there are no matmul ops and only contain Linear Ops?

@hatsu3
Copy link
Owner

hatsu3 commented Mar 30, 2024

From what I understand, fully connected layers, or Linear modules, are essentially affine transformations (i.e., a matmul and an element-wise addition of a bias vector with broadcast). Besides, BertIntermediate and BertOutput contain not just Linear modules, but also LayerNorm operations and element-wise activation functions. Depending on how thop library calculates FLOPs, you may also need to include the FLOPs of these operations in the final result if you want to replicate the estimation of thop.

@jimmy-adams
Copy link
Author

From what I understand, fully connected layers, or Linear modules, are essentially affine transformations (i.e., a matmul and an element-wise addition of a bias vector with broadcast). Besides, BertIntermediate and BertOutput contain not just Linear modules, but also LayerNorm operations and element-wise activation functions. Depending on how thop library calculates FLOPs, you may also need to include the FLOPs of these operations in the final result if you want to replicate the estimation of thop.

Dear author,
Thanks a lot for your kind reply.
One further question is will Sanger can process efficiently for layernorm or element-wise activation functions?

Best Regards

@hatsu3
Copy link
Owner

hatsu3 commented Apr 3, 2024

Our accelerator design is primarily focused on the core attention mechanism, which does not contain LayerNorm or activation functions. Therefore, these operations are not taken into account in our work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants