-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bench_sanger script #3
Comments
Also another question related is: when i set the input sequence with 128, the calculated GFLOPS is about 0.65. Assume there are 12 hidden layers in BERT base, the total GFLOPs is less than 12 GFLOPs, not compatible with the profiler test result about 20s GFLOPs. |
The FLOPs number corresponds to a single layer of BERT. Can you provide more information about the profiler? |
Here the two posts mentioned their result, more or less different, but still at the value of 20~GLOPS. |
Our provided simulation script only calculates the FLOPs of a single multi-head attention (MHA) module. However, an encoder layer of BERT also includes a fully-connected feed-forward network (FFN) following the MHA. The thop profiler used by FastBERT calculates the total FLOPs of all modules in a BERT model, which includes MHA, FFN, and potentially other modules not included in our calculation. Therefore, it should produce a larger FLOPs count than ours. |
Hello, Best Regards |
(1) Yes, and BertIntermediate and BertOutput above correspond to the FFN submodule. Besides, we do not include the GLOPs of LayerNorm and Softmax operations in our calculation. Please refer to the implementation of thop (https://github.com/Lyken17/pytorch-OpCounter/blob/master/thop/profile.py and https://github.com/Lyken17/pytorch-OpCounter/blob/master/thop/vision/calc_func.py) for the formulas used. |
Hello, Does that mean in these two modules there are no matmul ops and only contain Linear Ops? |
From what I understand, fully connected layers, or |
Dear author, Best Regards |
Our accelerator design is primarily focused on the core attention mechanism, which does not contain LayerNorm or activation functions. Therefore, these operations are not taken into account in our work. |
Hello,
In this repo you provide a simulation script to calculate the sanger's processing of BERT.
//==============================================================
def bert_base_gflops(seq_len):
HIDDEN_SIZE = 768
linear_flops = seq_len * HIDDEN_SIZE * HIDDEN_SIZE * 2 * 3
qk_flops = seq_len * seq_len * HIDDEN_SIZE * 2
pv_flops = seq_len * seq_len * HIDDEN_SIZE * 2
out_proj_flops = seq_len * HIDDEN_SIZE * HIDDEN_SIZE * 2
stage1_flops = linear_flops
stage2_flops = qk_flops + pv_flops + out_proj_flops
stage1_gflops = stage1_flops / 1e9
stage2_gflops = stage2_flops / 1e9
print("The stage1_gflops: %.3f FLOPS", stage1_gflops * 1e9)
print("The stage2_gflops: %.3f FLOPS", stage2_gflops * 1e9)
return stage1_gflops, stage2_gflops
//===============================================================
I want to ask if the flops covers the total 12 hidden layers? or just a single layer of the BERT encoder.
Best Regards
The text was updated successfully, but these errors were encountered: