Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the performance of Transformer-Big on 1 V100 GPU #148

Open
Xreki opened this issue Jul 23, 2019 · 10 comments
Open

Optimize the performance of Transformer-Big on 1 V100 GPU #148

Xreki opened this issue Jul 23, 2019 · 10 comments
Assignees

Comments

@Xreki
Copy link
Collaborator

Xreki commented Jul 23, 2019

负责人

@wangchaochaohu

初始性能

  • 测试时间:2019年06月20日
  • Paddle commit:
  • models commit:
  • 测试脚本:run.sh
base_batch_size=4096
python -u train.py \
    --src_vocab_fpath data/vocab.bpe.32000 \
    --trg_vocab_fpath data/vocab.bpe.32000 \
    --special_token <s> <e> <unk> \
    --train_file_pattern data/train.tok.clean.bpe.32000.en-de \
    --batch_size ${base_batch_size} \
    --use_token_batch True \
    --sort_type pool \
    --pool_size 200000 \
    --shuffle True \
    --shuffle_batch True \
    --use_py_reader True \
    --use_mem_opt True \
    --enable_ce False \
    --fetch_steps 100 \
    learning_rate 2.0 \
    warmup_steps 8000 \
    beta2 0.997 \
    d_model 1024 \
    d_inner_hid 4096 \
    n_head 16 \
    prepostprocess_dropout 0.3 \
    attention_dropout 0.1 \
    relu_dropout 0.1 \
    weight_sharing True \
    pass_num 100 \
    max_length 256
Paddle 1.5.0 TensorFlow 1.12.0 Ratio
1 GPU 1.82 1.968 -7.6%
8 GPUs (SP) 13.12 7.072 +86%
@Xreki
Copy link
Collaborator Author

Xreki commented Jul 25, 2019

Profile和Timeline分析结果

Event                                  Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
GpuMemcpyAsync:CPU->GPU                480         7787.05     7389.604243 (0.948961)  397.441730 (0.051039)   0.013041    420.263     16.223      0.133052
BufferedReader:MemoryCopy              20          7303.63     7278.173796 (0.996514)  25.457269 (0.003486)    25.4573     434.426     365.182     0.124793
elementwise_pow                        20          7166.14     6825.366504 (0.952446)  340.778477 (0.047554)   9.12253     423.546     358.307     0.122443
mul_grad                               1920        5687.55     1445.043506 (0.254071)  4242.507587 (0.745929)  1.29767     9.8521      2.96227     0.0971796
reduce_sum                             40          4140.06     3864.660628 (0.933479)  275.402385 (0.066521)   0.060799    430.98      103.502     0.0707387
dropout                                1240        3978.58     2765.469102 (0.695089)  1213.113105 (0.304911)  0.491289    410.938     3.20853     0.0679795
mul                                    1920        3901.35     1805.701624 (0.462840)  2095.648077 (0.537160)  0.63364     324.938     2.03195     0.0666599
elementwise_add                        1480        3288.72     3150.309974 (0.957914)  138.409242 (0.042086)   0.070056    410.482     2.22211     0.0561923
lookup_table                           80          3028.77     3014.187865 (0.995187)  14.577378 (0.004813)    0.173184    405.138     37.8596     0.0517506
label_smooth                           20          1937.18     1932.993586 (0.997837)  4.189719 (0.002163)     0.232803    412.821     96.8592     0.0330994
scale                                  120         1554.73     1551.029336 (0.997621)  3.699388 (0.002379)     0.034238    398.374     12.9561     0.0265647
one_hot                                20          1476.73     1122.240401 (0.759950)  354.488098 (0.240050)   0.183862    381.2       73.8364     0.0252319
layer_norm                             640         1374.52     1310.665698 (0.953541)  63.858665 (0.046459)    0.114613    409.522     2.14769     0.0234856
softmax_grad                           360         720.891     136.584903 (0.189467)   584.306443 (0.810533)   0.590407    11.6805     2.00248     0.0123174
matmul_grad                            740         689.291     270.524317 (0.392467)   418.767109 (0.607533)   0.319167    20.8852     0.931475    0.0117775
sum                                    880         546.1       473.475160 (0.867011)   72.625192 (0.132989)    0.081672    231.909     0.620569    0.00933087
transpose2_grad                        1440        388.157     302.213362 (0.778586)   85.943228 (0.221414)    0.080706    4.70972     0.269553    0.00663219
elementwise_add_grad                   1480        382.432     234.041129 (0.611982)   148.390533 (0.388018)   0.068687    6.66967     0.2584      0.00653437
TensorCopy:CPU->GPU                    40          375.823     371.409839 (0.988258)   4.412854 (0.011742)     0.069173    156.41      9.39557     0.00642145
GpuMemcpySync:CPU->GPU                 40          375.611     371.122581 (0.988051)   4.488224 (0.011949)     0.066045    156.406     9.39027     0.00641783
layer_norm_grad                        640         360.088     121.749323 (0.338110)   238.339039 (0.661890)   0.404998    2.93524     0.562638    0.0061526
softmax                                360         292.556     47.513194 (0.162407)    245.042412 (0.837593)   0.30691     4.71959     0.812654    0.00499871
elementwise_mul                        80          285.141     284.907918 (0.999183)   0.233098 (0.000817)     0.023708    171.18      3.56426     0.00487202
dropout_grad                           1240        276.793     160.729982 (0.580687)   116.062616 (0.419313)   0.071996    4.83908     0.22322     0.00472938
matmul                                 740         244.321     41.341208 (0.169208)    202.980160 (0.830792)   0.138019    8.78548     0.330164    0.00417457
Fetch                                  2           242.76      7.987677 (0.032904)     234.772810 (0.967096)   7.98768     234.773     121.38      0.0041479
transpose2                             1440        175.377     90.358935 (0.515228)    85.017579 (0.484772)    0.080037    13.3819     0.121789    0.00299655
adam                                   20          127.312     10.754483 (0.084473)    116.557663 (0.915527)   6.20271     6.5953      6.36561     0.0021753
relu_grad                              240         92.1757     34.169213 (0.370697)    58.006495 (0.629303)    0.257737    2.36124     0.384065    0.00157495
fill_constant                          80          82.3621     82.169197 (0.997658)    0.192885 (0.002342)     0.027829    18.5124     1.02953     0.00140727
relu                                   240         48.2929     6.332414 (0.131125)     41.960468 (0.868875)    0.189964    0.237106    0.20122     0.00082515
GpuMemcpyAsync:GPU->CPU                2           44.0849     7.877569 (0.178691)     36.207331 (0.821309)    7.88781     36.1971     22.0424     0.000753251
lookup_table_grad                      40          38.4372     27.945842 (0.727051)    10.491405 (0.272949)    0.315866    9.55536     0.960931    0.000656753
reshape2                               1460        26.5977     26.570679 (0.998983)    0.027058 (0.001017)     0.013066    0.084964    0.0182176   0.000454459
read                                   40          25.4974     25.382462 (0.995493)    0.114929 (0.004507)     0.030177    9.74623     0.637435    0.000435658
reshape2_grad                          1460        22.3558     22.341702 (0.999371)    0.014061 (0.000629)     0.010398    0.06125     0.0153122   0.000381979
softmax_with_cross_entropy             20          14.5057     2.132466 (0.147008)     12.373270 (0.852992)    0.71054     0.755835    0.725287    0.00024785
softmax_with_cross_entropy_grad        20          6.94796     1.685615 (0.242606)     5.262341 (0.757394)     0.333883    0.362152    0.347398    0.000118715
GpuMemcpyAsync(same_gpu):GPU->GPU      20          4.51302     0.809991 (0.179479)     3.703031 (0.820521)     0.215234    0.239234    0.225651    7.71112e-05
elementwise_min                        20          2.78114     2.529154 (0.909395)     0.251985 (0.090605)     0.068388    0.276339    0.139057    4.75196e-05
TensorCopy:GPU->GPU                    1460        2.02861     2.027637 (0.999523)     0.000968 (0.000477)     0.000845    0.035362    0.00138946  3.46615e-05
reduce_sum_grad                        20          1.83731     1.652398 (0.899356)     0.184915 (0.100644)     0.076729    0.112897    0.0918656   3.1393e-05
Scale LossGrad                         20          1.46823     1.323094 (0.901147)     0.145140 (0.098853)     0.028795    0.14514     0.0734117   2.50868e-05
cast                                   20          0.997326    0.970855 (0.973458)     0.026471 (0.026542)     0.023909    0.111488    0.0498663   1.70407e-05
elementwise_div                        20          0.977814    0.858257 (0.877730)     0.119557 (0.122270)     0.042825    0.066951    0.0488907   1.67073e-05
elementwise_mul_grad                   20          0.731742    0.634836 (0.867568)     0.096906 (0.132432)     0.030531    0.050781    0.0365871   1.25028e-05
elementwise_div_grad                   20          0.690907    0.626869 (0.907313)     0.064038 (0.092687)     0.031416    0.046577    0.0345453   1.18051e-05
FastThreadedSSAGraphExecutorPrepare    20          0.415665    0.338768 (0.815002)     0.076897 (0.184998)     0.015645    0.076897    0.0207832   7.10221e-06
increment                              20          0.375824    0.363365 (0.966849)     0.012459 (0.033151)     0.012459    0.05304     0.0187912   6.42147e-06
InitLocalVars                          1           0.249576    0.000000 (0.000000)     0.249576 (1.000000)     0.249576    0.249576    0.249576    4.26435e-06
create_double_buffer_reader            20          0.185831    0.178659 (0.961406)     0.007172 (0.038594)     0.006572    0.016392    0.00929155  3.17518e-06

image

GPU利用率基本很满,但存在一定的空白区域,从timeline上来看,是在等待CPU -> GPU的数据传输。

image

从GPU使用情况来看,GPU占用比较多的是dropout和mul:

  • 临近的dropout和mul占用GPU时间差不多,可分析下计算量,看dropout是否可优化。
  • timeline里面有很多mul操作,观察各个mul的size、以及各个mul之间的关系,考虑是否可融合成一个大的matmul来计算。

数据加载时间过长的问题

使用tiny数据。tiny数据是从整个数据集的头部摘取了40w条,因此测试结果和使用整个数据集测试的存在diff。

优化计划

@wangchaochaohu
Copy link
Contributor

CPU-->GPU 的数据从log来看是数据读取的部分。但是尝试了YOLOv3多进程读取数据的方法,性能并没有得到提升。

@Xreki
Copy link
Collaborator Author

Xreki commented Jul 25, 2019

问题:测试脚本中设置了--fetch_steps 100,意思是每100个step才fetch一次?如果每个step都fetch,速度是否有影响?竞品是如何fetch的?

回答From @guoshengCS :设置--fetch_steps 100对8卡训练速度有很大影响,但设置--fetch_steps 5和设置--fetch_steps 100的结果是差不多的。对于单卡影响不大,需确认。

@wangchaochaohu
Copy link
Contributor

wangchaochaohu commented Jul 25, 2019

CPU-->GPU 的数据从log来看是数据读取的部分。但是尝试了YOLOv3多进程读取数据的方法,性能并没有得到提升。

在我本机上(CUDA10.0)

  • 如果原始代码 export FLAGS_reader_queue_speed_test_mode=True 性能提升很小 大概是从1.86---->1.92左右差不多
  • 如果改成YOLOv3多进程的方式
    • export FLAGS_reader_queue_speed_test_mode=True 那么大概会从1.86---->2.19左右的提升
    • 但是export FLAGS_reader_queue_speed_test_mode=False 就没有提升

@Xreki
Copy link
Collaborator Author

Xreki commented Jul 25, 2019

CPU -> GPU数据拷贝分析

分析方法

  • fluid/memory/memcpy.cc里面加入log
  • 运行时设置export GLOG_v=4
  • 设置exec_stratepy.num_threads=1
  • 结果:
I0719 10:14:50.539557 70931 operator.cc:169] CUDAPlace(0) Op(increment), inputs:{X[@LR_DECAY_COUNTER@:int64_t[1]({})]}, outputs:{Out[@LR_DECAY_COUNTER@:int64_t[1]({})]}.
I0719 10:14:50.539577 70931 operator.cc:1011] expected_kernel_key:data_type[int64_t]:data_layout[ANY_LAYOUT]:place[CPUPlace]:library_type[PLAIN]
I0719 10:14:50.539613 70931 operator.cc:190] CUDAPlace(0) Op(increment), inputs:{X[@LR_DECAY_COUNTER@:int64_t[1]({})]}, outputs:{Out[@LR_DECAY_COUNTER@:int64_t[1]({})]}.
I0719 10:14:50.539629 70931 operator.cc:169] CUDAPlace(0) Op(cast), inputs:{X[@LR_DECAY_COUNTER@:int64_t[1]({})]}, outputs:{Out[cast_0.tmp_0:[-1]({{}})]}.
I0719 10:14:50.539638 70931 operator.cc:1011] expected_kernel_key:data_type[int64_t]:data_layout[ANY_LAYOUT]:place[CPUPlace]:library_type[PLAIN]
I0719 10:14:50.539671 70931 operator.cc:190] CUDAPlace(0) Op(cast), inputs:{X[@LR_DECAY_COUNTER@:int64_t[1]({})]}, outputs:{Out[cast_0.tmp_0:float[1]({})]}.
I0719 10:14:50.539690 70931 operator.cc:169] CUDAPlace(0) Op(elementwise_pow), inputs:{X[cast_0.tmp_0:float[1]({})], Y[tmp_52:float[1]({})]}, outputs:{Out[tmp_53:[-1]({{}})]}.
I0719 10:14:50.539710 70931 operator.cc:1011] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I0719 10:14:50.539719 70931 operator.cc:1109] Transform Variable cast_0.tmp_0 from data_type[float]:data_layout[NCHW]:place[CPUPlace]:library_type[PLAIN] to data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I0719 10:14:50.539731 70931 scope.cc:164] Create variable cast_0.tmp_0
I0719 10:14:50.539741 70931 data_device_transform.cc:21] DeviceTransform in, src_place CPUPlace dst_place: CUDAPlace(0)
I0719 10:14:50.539767 70931 tensor_util.cu:120] TensorCopySync 1 from CPUPlace to CUDAPlace(0)
I0719 10:14:50.539816 70931 memcpy.cc:79] GpuMemcpyAsync:CPU->GPU
I0719 10:14:50.539881 70931 operator.cc:190] CUDAPlace(0) Op(elementwise_pow), inputs:{X[cast_0.tmp_0:float[1]({})], Y[tmp_52:float[1]({})]}, outputs:{Out[tmp_53:float[1]({})]}.

incrementcast都是在CPU上执行的,elementwise_powcast的output作为input,因此产生了CPU->GPU data transform。

I0719 10:14:50.539970 70931 operator.cc:169] CUDAPlace(0) Op(elementwise_mul), inputs:{X[cast_0.tmp_0:float[1]({})], Y[tmp_52:float[1]({})]}, outputs:{Out[tmp_55:[-1]({{}})]}.
I0719 10:14:50.539979 70931 operator.cc:1011] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I0719 10:14:50.539988 70931 operator.cc:1109] Transform Variable cast_0.tmp_0 from data_type[float]:data_layout[NCHW]:place[CPUPlace]:library_type[PLAIN] to data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I0719 10:14:50.539994 70931 scope.cc:164] Create variable cast_0.tmp_0
I0719 10:14:50.540000 70931 data_device_transform.cc:21] DeviceTransform in, src_place CPUPlace dst_place: CUDAPlace(0)
I0719 10:14:50.540010 70931 tensor_util.cu:120] TensorCopySync 1 from CPUPlace to CUDAPlace(0)
I0719 10:14:50.540036 70931 memcpy.cc:79] GpuMemcpyAsync:CPU->GPU
I0719 10:14:50.540077 70931 operator.cc:190] CUDAPlace(0) Op(elementwise_mul), inputs:{X[cast_0.tmp_0:float[1]({})], Y[tmp_52:float[1]({})]}, outputs:{Out[tmp_55:float[1]({})]}.

elementwise_mul也是以cast的output作为input,因此也产生了CPU -> GPU的data transform。

I0719 10:15:11.189215 70931 operator.cc:169] CUDAPlace(0) Op(sum), inputs:{X[fc_93.tmp_0:float[32, 125, 1024]({}), fc_95.tmp_1:float[32, 125, 1024]({}), transpose_67.tmp_0:float[32, 125, 1024]({})]}, outputs:{Out[dropout_59.tmp_0:float[32, 125, 1024]({})]}.
I0719 10:15:11.189225 70931 operator.cc:1011] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I0719 10:15:11.189285 70931 memcpy.cc:71] stream GpuMemcpyAsync:CPU->GPU
I0719 10:15:11.189311 70931 operator.cc:190] CUDAPlace(0) Op(sum), inputs:{X[fc_93.tmp_0:float[32, 125, 1024]({}), fc_95.tmp_1:float[32, 125, 1024]({}), transpose_67.tmp_0:float[32, 125, 1024]({})]}, outputs:{Out[dropout_59.tmp_0:float[32, 125, 1024]({})]}.

sum求3个LoDTensor的和,需要将输入Tensor的address传到GPU上。一共13次。

I0719 10:15:11.145866 70930 operator.cc:169] CUDAPlace(0) Op(lookup_table), inputs:{Ids[read_file_0.tmp_0:int64_t[32, 126, 1]({})], W[src_word_emb_table:float[4579, 1024]({})]}, outputs:{Out[embedding_0.tmp_0:[-1]({{}})]}.
I0719 10:15:11.145892 70930 operator.cc:1011] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I0719 10:15:11.146122 70930 operator.cc:190] CUDAPlace(0) Op(lookup_table), inputs:{Ids[read_file_0.tmp_0:int64_t[32, 126, 1]({})], W[src_word_emb_table:float[4579, 1024]({})]}, outputs:{Out[embedding_0.tmp_0:float[32, 126, 1024]({})]}.
I0719 10:15:11.146148 70930 operator.cc:169] CUDAPlace(0) Op(scale), inputs:{X[embedding_0.tmp_0:float[32, 126, 1024]({})]}, outputs:{Out[scale_0.tmp_0:[-1]({{}})]}.
I0719 10:15:11.146173 70930 operator.cc:1011] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I0719 10:15:11.146220 70930 operator.cc:190] CUDAPlace(0) Op(scale), inputs:{X[embedding_0.tmp_0:float[32, 126, 1024]({})]}, outputs:{Out[scale_0.tmp_0:float[32, 126, 1024]({})]}.
I0719 10:15:11.146247 70930 operator.cc:169] CUDAPlace(0) Op(lookup_table), inputs:{Ids[read_file_0.tmp_1:int64_t[32, 126, 1]({})], W[src_pos_enc_table:float[257, 1024]({})]}, outputs:{Out[embedding_0.tmp_0:float[32, 126, 1024]({})]}.
I0719 10:15:11.146257 70930 operator.cc:1011] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I0719 10:15:11.146288 70930 operator.cc:190] CUDAPlace(0) Op(lookup_table), inputs:{Ids[read_file_0.tmp_1:int64_t[32, 126, 1]({})], W[src_pos_enc_table:float[257, 1024]({})]}, outputs:{Out[embedding_0.tmp_0:float[32, 126, 1024]({})]}.

lookup_table直接以读进来的数据作为输入。

@chengduoZH
Copy link
Contributor

image

这个sync是因为CPU->GPU数据传输导致的,因为在Op里面,如果数据的tensor是在CPU上,但是当前Op是在GPU上运行,需要有从CPU到GPU的拷贝,拷贝时会调用sync操作。

@Xreki Xreki changed the title Optimize the performance ofTransformer-Big on 1 V100 GPU Optimize the performance of Transformer-Big on 1 V100 GPU Jul 25, 2019
@wangchaochaohu
Copy link
Contributor

CPU-->GPU 的数据从log来看是数据读取的部分。但是尝试了YOLOv3多进程读取数据的方法,性能并没有得到提升。

在我本机上(CUDA10.0)

  • 如果原始代码 export FLAGS_reader_queue_speed_test_mode=True 性能提升很小 大概是从1.86---->1.92左右差不多

  • 如果改成YOLOv3多进程的方式

    • export FLAGS_reader_queue_speed_test_mode=True 那么大概会从1.86---->2.19左右的提升
    • 但是export FLAGS_reader_queue_speed_test_mode=False 就没有提升

关于多进程的写法需要@邓凯鹏 review 下,确定测试结果的正确性

@wangchaochaohu
Copy link
Contributor

wangchaochaohu commented Aug 1, 2019

优化dropout实现

1. 利用cuDNN提供的dropout api的实现实现dropout_cudnn_op,PaddlePaddle/Paddle#18954
  • 遇到的问题:
    • mask shape不一致问题,CUDNN为节省显存,Mask 是使用位存储的
    • cache问题,我们的OP Test 前向测试并未实现隔离,当创建同名的Cache Var的时候会造成共用一个Var。
  • transformer-big模型加速效果,性能提升约:10%
    • 实验环境:V100 + CUDA10.0
    • 单GPU训练速度: 1.852 step/s-> 2.040 step /s
  • op加速效果(通过profile观察实验数据):3.81724 -> 2.32318ms
  • 多卡固定随机性的问题
    • 问题描述:单卡的时候设定seed能够去掉随机性,但多卡的时候还是会有随机性
    • 测试方法:设定seed看多卡多次运行是否输出一致

多卡固定随机性问题测试结果:
测试方法:设置enable_ce=True, 使用CUDNN dropout实现运行transformer-big多次,发现结果不一样(可以通过loss是否一致观察)
测试结果: 存在多卡固定随机性问题。
目前解决方案:
(1)每个iter 都初始化一次dropout cudnn 相关desc,这样运行速度会降低很多,比cuda实现要慢。
(2)排查多卡Cache不一致原因,正在进行中

2. 优化dropout的CUDA实现:PaddlePaddle/Paddle#19136

试transformer-big(enable_ce)中dropout OP 的平均耗时:(利用PaddlePaddle的profiler工具):Ave Time : 1.16155 ----> 0.344537(ms)
实验环境:V100 + CUDA10.0

模型 优化前 优化后 加速
transformer-big 1.852 2.047 10%
ransformer-base 5.503 6.240 12%

已经验证CUDA实现修改之后无多卡随机性问题
CUDA实现修改之后无多卡随机性问题

@wangchaochaohu
Copy link
Contributor

Label Smooth优化 PaddlePaddle/Paddle#19175
transformer-big模型测试: 无性能提升
在transformer-big模型中利用PaddlePaddle的profile工具测试单个OP 平均时间:3.51607----------->2.39707(ms)

@wangchaochaohu
Copy link
Contributor

wangchaochaohu commented Aug 22, 2019

对于 cast OP 和increment OP选择CPU Kernel计算的原因是因为我们的代码在这两个OP选择CPU或者GPU算法的时候是根据输入数据是在CPU还是在GPU上进行选择的。
修改代码, 使用两个OP的GPU kernel type运行transformer-big训练过程,训练速度变化如下:
1.852 --------->1.844 (step /s)
本质上数据的data transform 是无法避免的,只不过是在哪个OP进行。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants