-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathresult.txt
2318 lines (2158 loc) · 145 KB
/
result.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
以下结果在opensuse kernel-desktop中得到:
~/cuda-workspace/hybrid_sort $ cd Release/
~/cuda-workspace/hybrid_sort/Release $ ./hybrid_sort
8
67108864 1023
Using device 0: GeForce GTX 765M (PTX version 300, SM300, 4 SMs, 2017 free / 2047 total MB physmem, ECC off)
cache size selected: 32768
selected block size on cpu: 2048
0.803887s wall, 0.830000s user + 0.000000s system = 0.830000s CPU (103.2%)
cache size selected: 32768
cache factor is: 1block length is: 8192 average time for merge: 0.916913
cache size selected: 32768
cache factor is: 2block length is: 4096 average time for merge: 0.825981
cache size selected: 32768
cache factor is: 4block length is: 2048 average time for merge: 0.743128
cache size selected: 32768
cache factor is: 8block length is: 1024 average time for merge: 0.654696
~/cuda-workspace/hybrid_sort/Release $ ./hybrid_sort
8
67108864 1023
Using device 0: GeForce GTX 765M (PTX version 300, SM300, 4 SMs, 2017 free / 2047 total MB physmem, ECC off)
cache size selected: 32768
selected block size on cpu: 2048
0.805790s wall, 0.850000s user + 0.000000s system = 0.850000s CPU (105.5%)
cache size selected: 32768
cache factor is: 1block length is: 8192 average time for merge: 0.920401
cache size selected: 32768
cache factor is: 2block length is: 4096 average time for merge: 0.829736
cache size selected: 32768
cache factor is: 4block length is: 2048 average time for merge: 0.741284
cache size selected: 32768
cache factor is: 8block length is: 1024 average time for merge: 0.656434
cache size selected: 32768
cache factor is: 16block length is: 512 average time for merge: 0.570493
cache size selected: 32768
cache factor is: 32block length is: 256 average time for merge: 0.482025
~/cuda-workspace/hybrid_sort/Release $
以上结果测试的是串行地运行merge block算法,不同的是,block的大小。结果表明,对merge算法影响的主要因素是block的大小,而不是block的数量,所以应考虑保证算法其他部分性能的前提下,使用较小的block大小,而且在multiway算法之后,使用比开头更小的block大小。两者的不同应该会提高算法性能。
8
67108864 1023
Using device 0: GeForce GTX 765M (PTX version 300, SM300, 4 SMs, 2017 free / 2047 total MB physmem, ECC off)
cache size selected: 32768
selected block size on cpu: 2048
0.801105s wall, 0.850000s user + 0.000000s system = 0.850000s CPU (106.1%)
cache size selected: 32768
cache factor is: 1block length is: 8192 average time for merge: 2.08887
cache size selected: 32768
cache factor is: 2block length is: 4096 average time for merge: 2.08289
cache size selected: 32768
cache factor is: 4block length is: 2048 average time for merge: 2.08202
cache size selected: 32768
cache factor is: 8block length is: 1024 average time for merge: 2.09634
cache size selected: 32768
cache factor is: 16block length is: 512 average time for merge: 2.13182
cache size selected: 32768
cache factor is: 32block length is: 256 average time for merge: 2.15656
以上是merge sort加上block间合并两个部分的结果。相减之后得到block间合并的运行时间分别是:1.168469,1.253154,1.340736,1.439906,1.561327,1.674535。意料之中的是,block数量增多,则block之间merge所需的时间变长。按目前串行运行结果来看,cache factor为4是两者权衡后,性能最优值。
以下是缓存内merge sort在不同的分组大小、以及不同的block大小情况下的运行情况。从结果看出,block大小对缓存内排序的影响在并行状态下减弱,但仍然维持着减小block大小则降低计算时间的趋势。在chunk的大小太大致使线程数量未到达CPU最大线程数量前,chunk的大小对排序的时间影响非常大,但chunk小到足够数量后,如果获得的分组是CPU最大线程数量的整数倍,则对可达到最高性能,此时chunk大小对排序的时间影响就很低了。说明可以用比较小的chunk大小来进行排序。
8
67108864 1023
Using device 0: GeForce GTX 765M (PTX version 300, SM300, 4 SMs, 2017 free / 2047 total MB physmem, ECC off)
cache size selected: 32768
selected block size on cpu: 2048
0.795491s wall, 0.830000s user + 0.000000s system = 0.830000s CPU (104.3%)
cache size selected: 32768
cache factor is: 1 block length is: 8192
chunk size is: 67108864 data length is: 67108864 average time for merge: 0.91912
chunk size is: 33554432 data length is: 67108864 average time for merge: 0.486693
chunk size is: 16777216 data length is: 50331648 average time for merge: 0.24775
chunk size is: 16777216 data length is: 67108864 average time for merge: 0.256977
chunk size is: 8388608 data length is: 41943040 average time for merge: 0.174309
chunk size is: 8388608 data length is: 50331648 average time for merge: 0.186918
chunk size is: 8388608 data length is: 58720256 average time for merge: 0.203507
chunk size is: 8388608 data length is: 67108864 average time for merge: 0.227964
chunk size is: 4194304 data length is: 37748736 average time for merge: 0.165601
chunk size is: 4194304 data length is: 41943040 average time for merge: 0.170621
chunk size is: 4194304 data length is: 46137344 average time for merge: 0.172789
chunk size is: 4194304 data length is: 50331648 average time for merge: 0.180197
chunk size is: 4194304 data length is: 54525952 average time for merge: 0.199743
chunk size is: 4194304 data length is: 58720256 average time for merge: 0.210536
chunk size is: 4194304 data length is: 62914560 average time for merge: 0.219386
chunk size is: 4194304 data length is: 67108864 average time for merge: 0.232904
cache size selected: 32768
cache factor is: 2 block length is: 4096
chunk size is: 67108864 data length is: 67108864 average time for merge: 0.83019
chunk size is: 33554432 data length is: 67108864 average time for merge: 0.442707
chunk size is: 16777216 data length is: 50331648 average time for merge: 0.225554
chunk size is: 16777216 data length is: 67108864 average time for merge: 0.234255
chunk size is: 8388608 data length is: 41943040 average time for merge: 0.15889
chunk size is: 8388608 data length is: 50331648 average time for merge: 0.17038
chunk size is: 8388608 data length is: 58720256 average time for merge: 0.187008
chunk size is: 8388608 data length is: 67108864 average time for merge: 0.211858
chunk size is: 4194304 data length is: 37748736 average time for merge: 0.148872
chunk size is: 4194304 data length is: 41943040 average time for merge: 0.153968
chunk size is: 4194304 data length is: 46137344 average time for merge: 0.157522
chunk size is: 4194304 data length is: 50331648 average time for merge: 0.164075
chunk size is: 4194304 data length is: 54525952 average time for merge: 0.182631
chunk size is: 4194304 data length is: 58720256 average time for merge: 0.191836
chunk size is: 4194304 data length is: 62914560 average time for merge: 0.202545
chunk size is: 4194304 data length is: 67108864 average time for merge: 0.219357
cache size selected: 32768
cache factor is: 4 block length is: 2048
chunk size is: 67108864 data length is: 67108864 average time for merge: 0.747548
chunk size is: 33554432 data length is: 67108864 average time for merge: 0.398258
chunk size is: 16777216 data length is: 50331648 average time for merge: 0.205362
chunk size is: 16777216 data length is: 67108864 average time for merge: 0.212345
chunk size is: 8388608 data length is: 41943040 average time for merge: 0.141225
chunk size is: 8388608 data length is: 50331648 average time for merge: 0.154646
chunk size is: 8388608 data length is: 58720256 average time for merge: 0.167641
chunk size is: 8388608 data length is: 67108864 average time for merge: 0.18799
chunk size is: 4194304 data length is: 37748736 average time for merge: 0.13287
chunk size is: 4194304 data length is: 41943040 average time for merge: 0.139645
chunk size is: 4194304 data length is: 46137344 average time for merge: 0.143338
chunk size is: 4194304 data length is: 50331648 average time for merge: 0.143889
chunk size is: 4194304 data length is: 54525952 average time for merge: 0.162715
chunk size is: 4194304 data length is: 58720256 average time for merge: 0.171688
chunk size is: 4194304 data length is: 62914560 average time for merge: 0.178705
chunk size is: 4194304 data length is: 67108864 average time for merge: 0.189666
cache size selected: 32768
cache factor is: 8 block length is: 1024
chunk size is: 67108864 data length is: 67108864 average time for merge: 0.656252
chunk size is: 33554432 data length is: 67108864 average time for merge: 0.358028
chunk size is: 16777216 data length is: 50331648 average time for merge: 0.184249
chunk size is: 16777216 data length is: 67108864 average time for merge: 0.189341
chunk size is: 8388608 data length is: 41943040 average time for merge: 0.125875
chunk size is: 8388608 data length is: 50331648 average time for merge: 0.135041
chunk size is: 8388608 data length is: 58720256 average time for merge: 0.147421
chunk size is: 8388608 data length is: 67108864 average time for merge: 0.169079
chunk size is: 4194304 data length is: 37748736 average time for merge: 0.117914
chunk size is: 4194304 data length is: 41943040 average time for merge: 0.121668
chunk size is: 4194304 data length is: 46137344 average time for merge: 0.123751
chunk size is: 4194304 data length is: 50331648 average time for merge: 0.129352
chunk size is: 4194304 data length is: 54525952 average time for merge: 0.145142
chunk size is: 4194304 data length is: 58720256 average time for merge: 0.149592
chunk size is: 4194304 data length is: 62914560 average time for merge: 0.160814
chunk size is: 4194304 data length is: 67108864 average time for merge: 0.166354
cache size selected: 32768
cache factor is: 16 block length is: 512
chunk size is: 67108864 data length is: 67108864 average time for merge: 0.570701
chunk size is: 33554432 data length is: 67108864 average time for merge: 0.310276
chunk size is: 16777216 data length is: 50331648 average time for merge: 0.160818
chunk size is: 16777216 data length is: 67108864 average time for merge: 0.166627
chunk size is: 8388608 data length is: 41943040 average time for merge: 0.110389
chunk size is: 8388608 data length is: 50331648 average time for merge: 0.119363
chunk size is: 8388608 data length is: 58720256 average time for merge: 0.128531
chunk size is: 8388608 data length is: 67108864 average time for merge: 0.14439
chunk size is: 4194304 data length is: 37748736 average time for merge: 0.103421
chunk size is: 4194304 data length is: 41943040 average time for merge: 0.108355
chunk size is: 4194304 data length is: 46137344 average time for merge: 0.109315
chunk size is: 4194304 data length is: 50331648 average time for merge: 0.113116
chunk size is: 4194304 data length is: 54525952 average time for merge: 0.123825
chunk size is: 4194304 data length is: 58720256 average time for merge: 0.130646
chunk size is: 4194304 data length is: 62914560 average time for merge: 0.138185
chunk size is: 4194304 data length is: 67108864 average time for merge: 0.144975
cache size selected: 32768
cache factor is: 32 block length is: 256
chunk size is: 67108864 data length is: 67108864 average time for merge: 0.487251
chunk size is: 33554432 data length is: 67108864 average time for merge: 0.265322
chunk size is: 16777216 data length is: 50331648 average time for merge: 0.1358
chunk size is: 16777216 data length is: 67108864 average time for merge: 0.147134
chunk size is: 8388608 data length is: 41943040 average time for merge: 0.0945607
chunk size is: 8388608 data length is: 50331648 average time for merge: 0.10151
chunk size is: 8388608 data length is: 58720256 average time for merge: 0.111193
chunk size is: 8388608 data length is: 67108864 average time for merge: 0.120799
chunk size is: 4194304 data length is: 37748736 average time for merge: 0.0875762
chunk size is: 4194304 data length is: 41943040 average time for merge: 0.0912467
chunk size is: 4194304 data length is: 46137344 average time for merge: 0.0945104
chunk size is: 4194304 data length is: 50331648 average time for merge: 0.0953401
chunk size is: 4194304 data length is: 54525952 average time for merge: 0.104437
chunk size is: 4194304 data length is: 58720256 average time for merge: 0.111081
chunk size is: 4194304 data length is: 62914560 average time for merge: 0.115701
chunk size is: 4194304 data length is: 67108864 average time for merge: 0.123957
cache size selected: 32768
cache factor is: 64 block length is: 128
chunk size is: 67108864 data length is: 67108864 average time for merge: 0.387165
chunk size is: 33554432 data length is: 67108864 average time for merge: 0.212687
chunk size is: 16777216 data length is: 50331648 average time for merge: 0.109799
chunk size is: 16777216 data length is: 67108864 average time for merge: 0.11868
chunk size is: 8388608 data length is: 41943040 average time for merge: 0.0766825
chunk size is: 8388608 data length is: 50331648 average time for merge: 0.0822685
chunk size is: 8388608 data length is: 58720256 average time for merge: 0.0897224
chunk size is: 8388608 data length is: 67108864 average time for merge: 0.0964838
chunk size is: 4194304 data length is: 37748736 average time for merge: 0.0707252
chunk size is: 4194304 data length is: 41943040 average time for merge: 0.0751364
chunk size is: 4194304 data length is: 46137344 average time for merge: 0.0758591
chunk size is: 4194304 data length is: 50331648 average time for merge: 0.0774116
chunk size is: 4194304 data length is: 54525952 average time for merge: 0.0851278
chunk size is: 4194304 data length is: 58720256 average time for merge: 0.0898697
chunk size is: 4194304 data length is: 62914560 average time for merge: 0.0948851
chunk size is: 4194304 data length is: 67108864 average time for merge: 0.0999488
以下是缓存内排序加上合并的过程。可以看出,chunk的大小也会对合并产生影响,但线程数量对合并的影响并不太规律,中间有一段没有大的变化,但chunk大小变小对合并影响的趋势是时间减少。但block大小的合并的影响产生的仍然是反作用,block变小,合并的时间将变长。总的时间竟然没有太大变化,性能最高的时候竟然是block为最大值、chunk为最小值的情况。也许继续降低chunk大小才能提高这两个部分的综合性能。
~/cuda-workspace/hybrid_sort/Release $ ./hybrid_sort
8
67108864 1023
Using device 0: GeForce GTX 765M (PTX version 300, SM300, 4 SMs, 2017 free / 2047 total MB physmem, ECC off)
cache size selected: 32768
selected block size on cpu: 2048
0.786732s wall, 0.820000s user + 0.000000s system = 0.820000s CPU (104.2%)
cache size selected: 32768
cache factor is: 1 block length is: 8192
chunk size is: 67108864 data length is: 67108864 average time for merge: 2.09265
chunk size is: 33554432 data length is: 67108864 average time for merge: 1.05356
chunk size is: 16777216 data length is: 50331648 average time for merge: 0.588733
chunk size is: 16777216 data length is: 67108864 average time for merge: 0.701094
chunk size is: 8388608 data length is: 41943040 average time for merge: 0.410754
chunk size is: 8388608 data length is: 50331648 average time for merge: 0.472931
chunk size is: 8388608 data length is: 58720256 average time for merge: 0.548714
chunk size is: 8388608 data length is: 67108864 average time for merge: 0.631338
chunk size is: 4194304 data length is: 37748736 average time for merge: 0.386273
chunk size is: 4194304 data length is: 41943040 average time for merge: 0.399879
chunk size is: 4194304 data length is: 46137344 average time for merge: 0.420782
chunk size is: 4194304 data length is: 50331648 average time for merge: 0.4391
chunk size is: 4194304 data length is: 54525952 average time for merge: 0.473814
chunk size is: 4194304 data length is: 58720256 average time for merge: 0.500381
chunk size is: 4194304 data length is: 62914560 average time for merge: 0.538567
chunk size is: 4194304 data length is: 67108864 average time for merge: 0.580517
cache size selected: 32768
cache factor is: 2 block length is: 4096
chunk size is: 67108864 data length is: 67108864 average time for merge: 2.0918
chunk size is: 33554432 data length is: 67108864 average time for merge: 1.05622
chunk size is: 16777216 data length is: 50331648 average time for merge: 0.59736
chunk size is: 16777216 data length is: 67108864 average time for merge: 0.719049
chunk size is: 8388608 data length is: 41943040 average time for merge: 0.422236
chunk size is: 8388608 data length is: 50331648 average time for merge: 0.491487
chunk size is: 8388608 data length is: 58720256 average time for merge: 0.552852
chunk size is: 8388608 data length is: 67108864 average time for merge: 0.66002
chunk size is: 4194304 data length is: 37748736 average time for merge: 0.399904
chunk size is: 4194304 data length is: 41943040 average time for merge: 0.412146
chunk size is: 4194304 data length is: 46137344 average time for merge: 0.434303
chunk size is: 4194304 data length is: 50331648 average time for merge: 0.458178
chunk size is: 4194304 data length is: 54525952 average time for merge: 0.49662
chunk size is: 4194304 data length is: 58720256 average time for merge: 0.528176
chunk size is: 4194304 data length is: 62914560 average time for merge: 0.562708
chunk size is: 4194304 data length is: 67108864 average time for merge: 0.603823
cache size selected: 32768
cache factor is: 4 block length is: 2048
chunk size is: 67108864 data length is: 67108864 average time for merge: 2.07545
chunk size is: 33554432 data length is: 67108864 average time for merge: 1.03659
chunk size is: 16777216 data length is: 50331648 average time for merge: 0.596902
chunk size is: 16777216 data length is: 67108864 average time for merge: 0.718368
chunk size is: 8388608 data length is: 41943040 average time for merge: 0.425974
chunk size is: 8388608 data length is: 50331648 average time for merge: 0.490601
chunk size is: 8388608 data length is: 58720256 average time for merge: 0.572045
chunk size is: 8388608 data length is: 67108864 average time for merge: 0.655591
chunk size is: 4194304 data length is: 37748736 average time for merge: 0.407455
chunk size is: 4194304 data length is: 41943040 average time for merge: 0.424598
chunk size is: 4194304 data length is: 46137344 average time for merge: 0.445244
chunk size is: 4194304 data length is: 50331648 average time for merge: 0.475201
chunk size is: 4194304 data length is: 54525952 average time for merge: 0.514222
chunk size is: 4194304 data length is: 58720256 average time for merge: 0.544235
chunk size is: 4194304 data length is: 62914560 average time for merge: 0.58062
chunk size is: 4194304 data length is: 67108864 average time for merge: 0.621772
cache size selected: 32768
cache factor is: 8 block length is: 1024
chunk size is: 67108864 data length is: 67108864 average time for merge: 2.0898
chunk size is: 33554432 data length is: 67108864 average time for merge: 1.05147
chunk size is: 16777216 data length is: 50331648 average time for merge: 0.605295
chunk size is: 16777216 data length is: 67108864 average time for merge: 0.734926
chunk size is: 8388608 data length is: 41943040 average time for merge: 0.437639
chunk size is: 8388608 data length is: 50331648 average time for merge: 0.506575
chunk size is: 8388608 data length is: 58720256 average time for merge: 0.589041
chunk size is: 8388608 data length is: 67108864 average time for merge: 0.691027
chunk size is: 4194304 data length is: 37748736 average time for merge: 0.422249
chunk size is: 4194304 data length is: 41943040 average time for merge: 0.433473
chunk size is: 4194304 data length is: 46137344 average time for merge: 0.456193
chunk size is: 4194304 data length is: 50331648 average time for merge: 0.487589
chunk size is: 4194304 data length is: 54525952 average time for merge: 0.527793
chunk size is: 4194304 data length is: 58720256 average time for merge: 0.562276
chunk size is: 4194304 data length is: 62914560 average time for merge: 0.599604
chunk size is: 4194304 data length is: 67108864 average time for merge: 0.641975
cache size selected: 32768
cache factor is: 16 block length is: 512
chunk size is: 67108864 data length is: 67108864 average time for merge: 2.1304
chunk size is: 33554432 data length is: 67108864 average time for merge: 1.08524
chunk size is: 16777216 data length is: 50331648 average time for merge: 0.629332
chunk size is: 16777216 data length is: 67108864 average time for merge: 0.773615
chunk size is: 8388608 data length is: 41943040 average time for merge: 0.462019
chunk size is: 8388608 data length is: 50331648 average time for merge: 0.539963
chunk size is: 8388608 data length is: 58720256 average time for merge: 0.62904
chunk size is: 8388608 data length is: 67108864 average time for merge: 0.720721
chunk size is: 4194304 data length is: 37748736 average time for merge: 0.436158
chunk size is: 4194304 data length is: 41943040 average time for merge: 0.451007
chunk size is: 4194304 data length is: 46137344 average time for merge: 0.47729
chunk size is: 4194304 data length is: 50331648 average time for merge: 0.503452
chunk size is: 4194304 data length is: 54525952 average time for merge: 0.546306
chunk size is: 4194304 data length is: 58720256 average time for merge: 0.582017
chunk size is: 4194304 data length is: 62914560 average time for merge: 0.621248
chunk size is: 4194304 data length is: 67108864 average time for merge: 0.66347
cache size selected: 32768
cache factor is: 32 block length is: 256
chunk size is: 67108864 data length is: 67108864 average time for merge: 2.1566
chunk size is: 33554432 data length is: 67108864 average time for merge: 1.09542
chunk size is: 16777216 data length is: 50331648 average time for merge: 0.639681
chunk size is: 16777216 data length is: 67108864 average time for merge: 0.793271
chunk size is: 8388608 data length is: 41943040 average time for merge: 0.475254
chunk size is: 8388608 data length is: 50331648 average time for merge: 0.554001
chunk size is: 8388608 data length is: 58720256 average time for merge: 0.648141
chunk size is: 8388608 data length is: 67108864 average time for merge: 0.745882
chunk size is: 4194304 data length is: 37748736 average time for merge: 0.451599
chunk size is: 4194304 data length is: 41943040 average time for merge: 0.464668
chunk size is: 4194304 data length is: 46137344 average time for merge: 0.487042
chunk size is: 4194304 data length is: 50331648 average time for merge: 0.520059
chunk size is: 4194304 data length is: 54525952 average time for merge: 0.563853
chunk size is: 4194304 data length is: 58720256 average time for merge: 0.603129
chunk size is: 4194304 data length is: 62914560 average time for merge: 0.644905
chunk size is: 4194304 data length is: 67108864 average time for merge: 0.686318
cache size selected: 32768
cache factor is: 64 block length is: 128
chunk size is: 67108864 data length is: 67108864 average time for merge: 2.16871
chunk size is: 33554432 data length is: 67108864 average time for merge: 1.11048
chunk size is: 16777216 data length is: 50331648 average time for merge: 0.646732
chunk size is: 16777216 data length is: 67108864 average time for merge: 0.808172
chunk size is: 8388608 data length is: 41943040 average time for merge: 0.486099
chunk size is: 8388608 data length is: 50331648 average time for merge: 0.569189
chunk size is: 8388608 data length is: 58720256 average time for merge: 0.670774
chunk size is: 8388608 data length is: 67108864 average time for merge: 0.766087
chunk size is: 4194304 data length is: 37748736 average time for merge: 0.462948
chunk size is: 4194304 data length is: 41943040 average time for merge: 0.471599
chunk size is: 4194304 data length is: 46137344 average time for merge: 0.503479
chunk size is: 4194304 data length is: 50331648 average time for merge: 0.538941
chunk size is: 4194304 data length is: 54525952 average time for merge: 0.580001
chunk size is: 4194304 data length is: 58720256 average time for merge: 0.623547
chunk size is: 4194304 data length is: 62914560 average time for merge: 0.668482
chunk size is: 4194304 data length is: 67108864 average time for merge: 0.712807
下面是对第二步排序中的multi way merge,即计算quantile数组,并据此进行数据拷贝的过程进行的测试。从结果中发现,chunk和block的大小对multi way merge的运算都有影响,其中chunk大小的影响更为明显,表现为chunk大小太大导致线程数少于最大线程数量时,减小chunk大小会大幅提高multi way merge的性能,而在达到最大线程后,继续降低chunk的大小又会显著地使运算时间加长。由此看出,一味地降低chunk的大小是不可取的。在线程数量未达到最大值时,block大小对性能的影响并不明显,但当chunk持续减少大小后,block减少也会明显降低multi way merge的性能。可以考虑,在chunk大小合适的前提下,使用可变的block大小,在第一阶段为了使得合并的性能提高,需要使用较大一点的block,而在第二阶段,只要最后的merge过程能够有足够的性能提升,可以适当减小multi way merge输出的block大小。
~/cuda-workspace/hybrid_sort/Release $ ./hybrid_sort
8
67108864 1023
Using device 0: GeForce GTX 765M (PTX version 300, SM300, 4 SMs, 2017 free / 2047 total MB physmem, ECC off)
cache size selected: 32768
selected block size on cpu: 2048
cache size selected: 32768
cache factor is: 1 block length is: 8192
chunk size is: 67108864 average time for merge: 2.05768 for multi way: 0.076606
chunk size is: 33554432 average time for merge: 1.03715 for multi way: 0.0459809
chunk size is: 16777216 average time for merge: 0.692317 for multi way: 0.0425379
chunk size is: 8388608 average time for merge: 0.620033 for multi way: 0.0421687
chunk size is: 4194304 average time for merge: 0.576575 for multi way: 0.0448253
chunk size is: 2097152 average time for merge: 0.522828 for multi way: 0.0550891
chunk size is: 1048576 average time for merge: 0.471611 for multi way: 0.0983638
cache size selected: 32768
cache factor is: 2 block length is: 4096
chunk size is: 67108864 average time for merge: 2.05825 for multi way: 0.0758435
chunk size is: 33554432 average time for merge: 1.03814 for multi way: 0.0470952
chunk size is: 16777216 average time for merge: 0.711597 for multi way: 0.0431474
chunk size is: 8388608 average time for merge: 0.643027 for multi way: 0.0429882
chunk size is: 4194304 average time for merge: 0.598387 for multi way: 0.0454451
chunk size is: 2097152 average time for merge: 0.545002 for multi way: 0.0771396
chunk size is: 1048576 average time for merge: 0.492671 for multi way: 0.155252
cache size selected: 32768
cache factor is: 4 block length is: 2048
chunk size is: 67108864 average time for merge: 2.06221 for multi way: 0.0759497
chunk size is: 33554432 average time for merge: 1.04802 for multi way: 0.048682
chunk size is: 16777216 average time for merge: 0.730287 for multi way: 0.0432062
chunk size is: 8388608 average time for merge: 0.663259 for multi way: 0.0441254
chunk size is: 4194304 average time for merge: 0.61728 for multi way: 0.0504086
chunk size is: 2097152 average time for merge: 0.560828 for multi way: 0.100869
chunk size is: 1048576 average time for merge: 0.506825 for multi way: 0.226741
cache size selected: 32768
cache factor is: 8 block length is: 1024
chunk size is: 67108864 average time for merge: 2.06856 for multi way: 0.0765422
chunk size is: 33554432 average time for merge: 1.05385 for multi way: 0.0545177
chunk size is: 16777216 average time for merge: 0.747647 for multi way: 0.0440748
chunk size is: 8388608 average time for merge: 0.683525 for multi way: 0.0441462
chunk size is: 4194304 average time for merge: 0.638974 for multi way: 0.0681523
chunk size is: 2097152 average time for merge: 0.578149 for multi way: 0.168184
chunk size is: 1048576 average time for merge: 0.530845 for multi way: 0.356954
cache size selected: 32768
cache factor is: 16 block length is: 512
chunk size is: 67108864 average time for merge: 2.09433 for multi way: 0.0778313
chunk size is: 33554432 average time for merge: 1.05189 for multi way: 0.0538024
chunk size is: 16777216 average time for merge: 0.749907 for multi way: 0.0409918
chunk size is: 8388608 average time for merge: 0.684176 for multi way: 0.0481309
chunk size is: 4194304 average time for merge: 0.647002 for multi way: 0.0939164
chunk size is: 2097152 average time for merge: 0.600912 for multi way: 0.211201
chunk size is: 1048576 average time for merge: 0.548108 for multi way: 0.505834
cache size selected: 32768
cache factor is: 32 block length is: 256
chunk size is: 67108864 average time for merge: 2.08852 for multi way: 0.0794089
chunk size is: 33554432 average time for merge: 1.06319 for multi way: 0.0520707
chunk size is: 16777216 average time for merge: 0.767624 for multi way: 0.0438199
chunk size is: 8388608 average time for merge: 0.705063 for multi way: 0.0730891
chunk size is: 4194304 average time for merge: 0.666322 for multi way: 0.155514
chunk size is: 2097152 average time for merge: 0.612668 for multi way: 0.406442
chunk size is: 1048576 average time for merge: 0.579216 for multi way: 0.852891
cache size selected: 32768
cache factor is: 64 block length is: 128
chunk size is: 67108864 average time for merge: 2.14445 for multi way: 0.0872126
chunk size is: 33554432 average time for merge: 1.09435 for multi way: 0.0623436
chunk size is: 16777216 average time for merge: 0.800438 for multi way: 0.0586079
chunk size is: 8388608 average time for merge: 0.750248 for multi way: 0.101817
chunk size is: 4194304 average time for merge: 0.702751 for multi way: 0.229587
chunk size is: 2097152 average time for merge: 0.650344 for multi way: 0.521058
chunk size is: 1048576 average time for merge: 0.60246 for multi way: 1.40526
下面是整个第一和第二阶段的性能测试。和之前的一样,在线程数量达到最大之后,merge sort只受block的影响,所以使用可变的block大小从测试结果来看是可行的,在当前机器上测试的结果是block为512时,线程数为8时性能最优。
~/cuda-workspace/hybrid_sort/Release $ ./hybrid_sort
8
67108864 1023
Using device 0: GeForce GTX 765M (PTX version 300, SM300, 4 SMs, 2017 free / 2047 total MB physmem, ECC off)
cache size selected: 32768
selected block size on cpu: 2048
successfully sorted!
cache size selected: 32768
cache factor is: 1 block length is: 8192
chunk size is: 67108864 average time for merge: 2.0759 for multi way: 0.954205
chunk size is: 33554432 average time for merge: 1.05111 for multi way: 0.515011
chunk size is: 16777216 average time for merge: 0.697046 for multi way: 0.293023
chunk size is: 8388608 average time for merge: 0.620012 for multi way: 0.262628
chunk size is: 4194304 average time for merge: 0.57007 for multi way: 0.266041
chunk size is: 2097152 average time for merge: 0.522759 for multi way: 0.279818
chunk size is: 1048576 average time for merge: 0.468054 for multi way: 0.327316
cache size selected: 32768
cache factor is: 2 block length is: 4096
chunk size is: 67108864 average time for merge: 2.05668 for multi way: 0.863459
chunk size is: 33554432 average time for merge: 1.02891 for multi way: 0.464659
chunk size is: 16777216 average time for merge: 0.697195 for multi way: 0.262909
chunk size is: 8388608 average time for merge: 0.633452 for multi way: 0.237818
chunk size is: 4194304 average time for merge: 0.590525 for multi way: 0.249546
chunk size is: 2097152 average time for merge: 0.539725 for multi way: 0.281025
chunk size is: 1048576 average time for merge: 0.493876 for multi way: 0.382417
cache size selected: 32768
cache factor is: 4 block length is: 2048
chunk size is: 67108864 average time for merge: 2.08842 for multi way: 0.791484
chunk size is: 33554432 average time for merge: 1.05603 for multi way: 0.435722
chunk size is: 16777216 average time for merge: 0.728837 for multi way: 0.255663
chunk size is: 8388608 average time for merge: 0.652369 for multi way: 0.220183
chunk size is: 4194304 average time for merge: 0.609063 for multi way: 0.233492
chunk size is: 2097152 average time for merge: 0.56105 for multi way: 0.282702
chunk size is: 1048576 average time for merge: 0.507073 for multi way: 0.402705
cache size selected: 32768
cache factor is: 8 block length is: 1024
chunk size is: 67108864 average time for merge: 2.06845 for multi way: 0.701639
chunk size is: 33554432 average time for merge: 1.03895 for multi way: 0.388728
chunk size is: 16777216 average time for merge: 0.732634 for multi way: 0.223842
chunk size is: 8388608 average time for merge: 0.682579 for multi way: 0.209163
chunk size is: 4194304 average time for merge: 0.640262 for multi way: 0.241701
chunk size is: 2097152 average time for merge: 0.583255 for multi way: 0.339769
chunk size is: 1048576 average time for merge: 0.525844 for multi way: 0.535193
cache size selected: 32768
cache factor is: 16 block length is: 512
chunk size is: 67108864 average time for merge: 2.11396 for multi way: 0.627418
chunk size is: 33554432 average time for merge: 1.05637 for multi way: 0.35159
chunk size is: 16777216 average time for merge: 0.752142 for multi way: 0.200458
chunk size is: 8388608 average time for merge: 0.693115 for multi way: 0.186347
chunk size is: 4194304 average time for merge: 0.649074 for multi way: 0.236358
chunk size is: 2097152 average time for merge: 0.600771 for multi way: 0.363825
chunk size is: 1048576 average time for merge: 0.551158 for multi way: 0.691156
cache size selected: 32768
cache factor is: 32 block length is: 256
chunk size is: 67108864 average time for merge: 2.14524 for multi way: 0.548194
chunk size is: 33554432 average time for merge: 1.09175 for multi way: 0.31035
chunk size is: 16777216 average time for merge: 0.780648 for multi way: 0.182848
chunk size is: 8388608 average time for merge: 0.712272 for multi way: 0.194734
chunk size is: 4194304 average time for merge: 0.668674 for multi way: 0.275312
chunk size is: 2097152 average time for merge: 0.617521 for multi way: 0.523597
chunk size is: 1048576 average time for merge: 0.568138 for multi way: 0.945387
cache size selected: 32768
cache factor is: 64 block length is: 128
chunk size is: 67108864 average time for merge: 2.12979 for multi way: 0.454656
chunk size is: 33554432 average time for merge: 1.10168 for multi way: 0.271524
chunk size is: 16777216 average time for merge: 0.804752 for multi way: 0.174431
chunk size is: 8388608 average time for merge: 0.750833 for multi way: 0.213945
chunk size is: 4194304 average time for merge: 0.705879 for multi way: 0.331116
chunk size is: 2097152 average time for merge: 0.656215 for multi way: 0.630073
chunk size is: 1048576 average time for merge: 0.610386 for multi way: 1.52348
Output/append to file test.
success!
Output/append to file test.
success!
下面的数据是在opensuse kernel default而且是init 3的情况下测试各阶段运行及gpu排序的情况。能够影响gpu性能的,自然只有数据总长度和chunk大小。当要排序的数据量越来越大时,chunk大小带来的性能区别也越来越少。这应该是因为排序的kernel计算时间比cpu-gpu间数据传输的时间要多得多,或者说前者增加的速度比后者要快得多。这样数据量越大,排序的时间所占的比重越大。
tested data length: 1048576
cache factor block length chunk size merge time multi way gpu omp gpu cuda
1 8192 131072 0.00487138 0.00363453 0.00648733 6.03248
1 8192 65536 0.00423352 0.00377865 0.00769448 7.2556
1 8192 32768 0.00392828 0.00405144 0.00955167 9.08508
1 8192 16384 0.00362956 0.00526234 0.0159779 14.6318
2 4096 131072 0.00491065 0.00333896 0.00697066 6.26632
2 4096 65536 0.00421809 0.00357182 0.00791373 7.22066
2 4096 32768 0.00391889 0.00399252 0.00981562 9.095
2 4096 16384 0.00361006 0.00587243 0.0160447 14.6501
4 2048 131072 0.00489066 0.00310974 0.00697101 6.26758
4 2048 65536 0.00423236 0.00337024 0.00788624 7.21431
4 2048 32768 0.00390177 0.00405635 0.00980582 9.09647
4 2048 16384 0.0035925 0.00651996 0.0160454 14.6445
8 1024 131072 0.00485467 0.00289253 0.00696577 6.26845
8 1024 65536 0.00424771 0.00351608 0.00789853 7.21515
8 1024 32768 0.00394923 0.00464937 0.00976582 9.05014
8 1024 16384 0.00360476 0.00855886 0.0160412 14.6372
16 512 131072 0.00495624 0.00279871 0.00696563 6.26643
16 512 65536 0.00428087 0.00348572 0.00788609 7.21096
16 512 32768 0.00397098 0.00522562 0.0097556 9.046
16 512 16384 0.00361457 0.0105033 0.0160606 14.6418
32 256 131072 0.00499474 0.0028894 0.00698929 6.27271
32 256 65536 0.00429606 0.00450605 0.00790412 7.2179
32 256 32768 0.00397847 0.00712372 0.00981689 9.0912
32 256 16384 0.00362382 0.0157277 0.0160693 14.6403
64 128 131072 0.00495582 0.00292037 0.0067566 6.04625
64 128 65536 0.00430662 0.00462144 0.00791006 7.21774
64 128 32768 0.00395806 0.00865346 0.0098237 9.09769
64 128 16384 0.00363915 0.0225313 0.0160743 14.6472
tested data length: 2097152
cache factor block length chunk size merge time multi way gpu omp gpu cuda
1 8192 262144 0.012196 0.00773677 0.0102158 9.47895
1 8192 131072 0.00942467 0.00797621 0.012469 11.7193
1 8192 65536 0.00913769 0.00906054 0.0150786 14.3004
1 8192 32768 0.00912372 0.0124428 0.0194915 18.0256
2 4096 262144 0.0139643 0.00840732 0.0104748 9.73446
2 4096 131072 0.00971101 0.00751216 0.0124643 11.7192
2 4096 65536 0.00835846 0.00804394 0.0150647 14.2997
2 4096 32768 0.00775699 0.0114811 0.0195406 18.0605
4 2048 262144 0.0139481 0.00668829 0.0104786 9.73495
4 2048 131072 0.00949116 0.00712101 0.0124828 11.7259
4 2048 65536 0.00837367 0.00823736 0.0150809 14.2987
4 2048 32768 0.00776508 0.0122302 0.0195175 18.0562
8 1024 262144 0.0145367 0.00612258 0.0104813 9.74821
8 1024 131072 0.00963182 0.00721374 0.0124622 11.7291
8 1024 65536 0.00844565 0.00915575 0.0150688 14.2995
8 1024 32768 0.00777381 0.0164827 0.0194957 18.0114
16 512 262144 0.0152198 0.00558836 0.0104178 9.67542
16 512 131072 0.00957703 0.00718502 0.0124892 11.726
16 512 65536 0.00851703 0.0101167 0.0150963 14.3095
16 512 32768 0.00779887 0.0196049 0.0194811 18.0109
32 256 262144 0.015864 0.00593294 0.0102886 9.53576
32 256 131072 0.00989634 0.00788519 0.012481 11.7259
32 256 65536 0.00855493 0.0141695 0.0150837 14.3068
32 256 32768 0.00783991 0.0282505 0.0195094 18.0015
64 128 262144 0.0163962 0.00575398 0.0102195 9.47736
64 128 131072 0.00981508 0.00923537 0.0124964 11.7255
64 128 65536 0.00858702 0.016952 0.0150948 14.3075
64 128 32768 0.00785214 0.0430395 0.0195798 18.0714
tested data length: 4194304
cache factor block length chunk size merge time multi way gpu omp gpu cuda
1 8192 524288 0.0279047 0.0155139 0.0162911 15.5148
1 8192 262144 0.0252721 0.0156679 0.0192199 18.456
1 8192 131072 0.0193564 0.0165237 0.024038 23.2299
1 8192 65536 0.0166791 0.0202201 0.0298918 28.4779
2 4096 524288 0.0303305 0.0167048 0.0163116 15.5103
2 4096 262144 0.0273572 0.0159376 0.0192422 18.4557
2 4096 131072 0.0196342 0.0163351 0.0240491 23.244
2 4096 65536 0.0166699 0.0232211 0.0298813 28.4742
4 2048 524288 0.0302064 0.0130035 0.0163174 15.5585
4 2048 262144 0.0278178 0.0138662 0.0192219 18.4577
4 2048 131072 0.0200031 0.0167939 0.024051 23.25
4 2048 65536 0.0166496 0.0258932 0.02985 28.4823
8 1024 524288 0.0314675 0.0118254 0.0162028 15.4891
8 1024 262144 0.0291323 0.0138612 0.0191739 18.4558
8 1024 131072 0.0206354 0.0191932 0.0239915 23.2374
8 1024 65536 0.0168242 0.0345776 0.0298733 28.4744
16 512 524288 0.0327912 0.0110251 0.0161851 15.4682
16 512 262144 0.0304462 0.0136792 0.0191865 18.458
16 512 131072 0.0209396 0.0214164 0.0240026 23.2436
16 512 65536 0.0169972 0.0398498 0.0298613 28.4829
32 256 524288 0.0341181 0.0108133 0.0162327 15.5107
32 256 262144 0.0316969 0.01825 0.0191805 18.4549
32 256 131072 0.0209644 0.0302623 0.0240065 23.2408
32 256 65536 0.0171079 0.0635552 0.0298918 28.4833
64 128 524288 0.0352896 0.0114389 0.0162708 15.5545
64 128 262144 0.0329054 0.017909 0.0191788 18.4512
64 128 131072 0.0210533 0.0356511 0.0240202 23.2482
64 128 65536 0.0184662 0.0910068 0.0299239 28.4974
tested data length: 8388608
cache factor block length chunk size merge time multi way gpu omp gpu cuda
1 8192 1048576 0.059937 0.0307026 0.0283036 27.5608
1 8192 524288 0.0549318 0.0313641 0.0308126 30.0557
1 8192 262144 0.0493931 0.0330056 0.0372624 36.4722
1 8192 131072 0.0375226 0.039459 0.0476374 46.2324
2 4096 1048576 0.0624714 0.0285355 0.0282591 27.4986
2 4096 524288 0.0575286 0.0288871 0.0308125 30.0503
2 4096 262144 0.05193 0.0328776 0.0372658 36.4743
2 4096 131072 0.0381094 0.0435109 0.0476368 46.2221
4 2048 1048576 0.0648963 0.0260703 0.0282626 27.5021
4 2048 524288 0.0599672 0.0270587 0.0308128 30.0536
4 2048 262144 0.0546619 0.0323475 0.0372685 36.4686
4 2048 131072 0.0382416 0.0497282 0.0476518 46.2269
8 1024 1048576 0.0674212 0.023916 0.0282877 27.531
8 1024 524288 0.0624384 0.0272698 0.0308246 30.0619
8 1024 262144 0.0577473 0.039564 0.037309 36.4898
8 1024 131072 0.0391934 0.0643214 0.0476734 46.2322
16 512 1048576 0.070122 0.0219302 0.0283648 27.5458
16 512 524288 0.0650311 0.0271178 0.0308714 30.0498
16 512 262144 0.059377 0.0399486 0.0373292 36.4799
16 512 131072 0.039338 0.0773001 0.0477197 46.2307
32 256 1048576 0.0727761 0.0228141 0.0283575 27.5474
32 256 524288 0.0675214 0.0344929 0.0308968 30.0578
32 256 262144 0.0620052 0.0597419 0.0373508 36.4812
32 256 131072 0.0398287 0.116448 0.0477266 46.2314
64 128 1048576 0.0751036 0.0227745 0.0283242 27.5128
64 128 524288 0.0698059 0.0357082 0.0308704 30.0531
64 128 262144 0.0642449 0.0664894 0.0373374 36.4814
64 128 131072 0.0397425 0.168614 0.0477369 46.2283
tested data length: 16777216
cache factor block length chunk size merge time multi way gpu omp gpu cuda
1 8192 2097152 0.129866 0.0631933 0.0525992 51.7231
1 8192 1048576 0.119776 0.0637263 0.0544561 53.5751
1 8192 524288 0.108936 0.0657341 0.0603127 59.3875
1 8192 262144 0.0957066 0.0775056 0.0741481 72.5846
2 4096 2097152 0.134404 0.0566834 0.0525916 51.7245
2 4096 1048576 0.124429 0.0581902 0.0544801 53.597
2 4096 524288 0.114245 0.0642952 0.0603161 59.3936
2 4096 262144 0.100571 0.0848281 0.0741621 72.5806
4 2048 2097152 0.139413 0.0520188 0.0525943 51.7201
4 2048 1048576 0.129354 0.0539953 0.0544706 53.5867
4 2048 524288 0.118455 0.0686951 0.0603372 59.4044
4 2048 262144 0.104969 0.0955208 0.0741647 72.5919
8 1024 2097152 0.144249 0.0471373 0.0526105 51.7354
8 1024 1048576 0.134071 0.053037 0.0544811 53.5952
8 1024 524288 0.123172 0.0715863 0.0603339 59.3976
8 1024 262144 0.109548 0.125418 0.0741654 72.5943
16 512 2097152 0.14959 0.0437156 0.0526136 51.7375
16 512 1048576 0.139316 0.0539873 0.0544935 53.5962
16 512 524288 0.129174 0.0836946 0.0603359 59.4078
16 512 262144 0.116661 0.156426 0.0741644 72.5997
32 256 2097152 0.154542 0.042816 0.0526071 51.7394
32 256 1048576 0.144404 0.0627544 0.0544788 53.5854
32 256 524288 0.134222 0.107509 0.0603393 59.4016
32 256 262144 0.120828 0.243699 0.0741503 72.5914
64 128 2097152 0.159731 0.0452704 0.0526128 51.7324
64 128 1048576 0.149435 0.0715368 0.0544724 53.5748
64 128 524288 0.140437 0.141051 0.0603193 59.3842
64 128 262144 0.123751 0.346028 0.0741729 72.5918
tested data length: 33554432
cache factor block length chunk size merge time multi way gpu omp gpu cuda
1 8192 4194304 0.278833 0.125815 0.100632 99.614
1 8192 2097152 0.259103 0.12924 0.10121 100.186
1 8192 1048576 0.2361 0.136261 0.103403 102.346
1 8192 524288 0.213475 0.159085 0.111018 109.311
2 4096 4194304 0.288449 0.117659 0.0931261 92.1178
2 4096 2097152 0.268668 0.119233 0.100461 99.429
2 4096 1048576 0.247968 0.133786 0.102841 101.779
2 4096 524288 0.222759 0.172918 0.111166 109.462
4 2048 4194304 0.297904 0.107743 0.0931469 92.1405
4 2048 2097152 0.27817 0.111901 0.0988516 97.8268
4 2048 1048576 0.257102 0.134877 0.102852 101.778
4 2048 524288 0.231603 0.194206 0.11093 109.213
8 1024 4194304 0.309774 0.100113 0.0931435 92.1403
8 1024 2097152 0.287968 0.109861 0.0937207 92.714
8 1024 1048576 0.266657 0.155542 0.0988627 97.8103
8 1024 524288 0.242618 0.242718 0.11084 109.134
16 512 4194304 0.320873 0.09263 0.0931544 92.1387
16 512 2097152 0.298953 0.112682 0.0937494 92.7159
16 512 1048576 0.27673 0.170273 0.0988033 97.748
16 512 524288 0.248787 0.318203 0.11085 109.136
32 256 4194304 0.330793 0.0949726 0.0931729 92.1603
32 256 2097152 0.308388 0.130396 0.0937266 92.7054
32 256 1048576 0.286948 0.247291 0.102117 101.053
32 256 524288 0.260697 0.439918 0.111668 109.955
64 128 4194304 0.340502 0.097521 0.0931453 92.1365
64 128 2097152 0.318016 0.151475 0.093717 92.6961
64 128 1048576 0.296584 0.286995 0.0989219 97.9261
64 128 524288 0.273592 0.69696 0.110856 109.188
tested data length: 67108864
cache factor block length chunk size merge time multi way gpu omp gpu cuda
1 8192 8388608 0.60076 0.254641 0.182465 181.238
1 8192 4194304 0.561249 0.262527 0.179568 178.312
1 8192 2097152 0.517825 0.27164 0.18349 182.197
1 8192 1048576 0.464335 0.315507 0.19626 194.437
2 4096 8388608 0.620315 0.239973 0.182121 180.908
2 4096 4194304 0.580375 0.243549 0.179552 178.308
2 4096 2097152 0.534566 0.273292 0.183449 182.159
2 4096 1048576 0.48217 0.346023 0.196234 194.435
4 2048 8388608 0.639609 0.22047 0.182135 180.912
4 2048 4194304 0.598808 0.226404 0.17959 178.339
4 2048 2097152 0.554694 0.276961 0.183408 182.155
4 2048 1048576 0.501608 0.386114 0.196164 194.35
8 1024 8388608 0.658034 0.200952 0.182106 180.894
8 1024 4194304 0.617488 0.222005 0.179588 178.337
8 1024 2097152 0.573111 0.316936 0.18333 182.105
8 1024 1048576 0.520596 0.489181 0.196166 194.355
16 512 8388608 0.678219 0.180858 0.182107 180.898
16 512 4194304 0.637669 0.226954 0.179569 178.339
16 512 2097152 0.59478 0.34052 0.183248 182.05
16 512 1048576 0.538786 0.623867 0.196158 194.33
32 256 8388608 0.698694 0.188241 0.182104 180.881
32 256 4194304 0.656813 0.262862 0.179444 178.247
32 256 2097152 0.612078 0.499933 0.183249 182.029
32 256 1048576 0.560623 0.893341 0.196177 194.401
64 128 8388608 0.716689 0.190323 0.182064 180.861
64 128 4194304 0.675118 0.305148 0.179438 178.255
64 128 2097152 0.632411 0.577998 0.183373 182.116
64 128 1048576 0.582943 1.38125 0.196306 194.564
tested data length: 134217728
cache factor block length chunk size merge time multi way gpu omp gpu cuda
1 8192 16777216 1.27867 0.508025 0.359494 357.793
1 8192 8388608 1.20184 0.521105 0.35224 350.508
1 8192 4194304 1.1111 0.539969 0.352384 350.58
1 8192 2097152 1.00117 0.616662 0.363419 360.989
2 4096 16777216 1.31744 0.473769 0.359071 357.357
2 4096 8388608 1.23921 0.485617 0.352289 350.561
2 4096 4194304 1.15135 0.535282 0.352414 350.63
2 4096 2097152 1.04113 0.719597 0.36344 361.083
4 2048 16777216 1.35732 0.436528 0.359094 357.402
4 2048 8388608 1.27743 0.452028 0.352289 350.57
4 2048 4194304 1.18916 0.543369 0.352439 350.649
4 2048 2097152 1.07381 0.758775 0.363406 361.055
8 1024 16777216 1.39427 0.397088 0.359053 357.337
8 1024 8388608 1.31477 0.444568 0.352289 350.564
8 1024 4194304 1.22528 0.611969 0.35252 350.733
8 1024 2097152 1.11211 1.0931 0.363478 361.008
16 512 16777216 1.43577 0.366237 0.359115 357.392
16 512 8388608 1.35591 0.457148 0.352238 350.535
16 512 4194304 1.2647 0.677423 0.352545 350.751
16 512 2097152 1.15066 1.231 0.363681 361.2
32 256 16777216 1.47393 0.367335 0.359121 357.4
32 256 8388608 1.39483 0.534617 0.352277 350.564
32 256 4194304 1.30422 0.949957 0.352515 350.708
32 256 2097152 1.19595 2.12184 0.36514 362.649
64 128 16777216 1.5095 0.381129 0.359166 357.452
64 128 8388608 1.43096 0.60651 0.352294 350.553
64 128 4194304 1.33906 1.13266 0.35332 351.499
64 128 2097152 1.24202 2.75111 0.364963 362.483
tested data length: 268435456
cache factor block length chunk size merge time multi way gpu omp gpu cuda
gpu kernel and transfer test
data length transfer time kernel time
131072 0.141468 0.707043
262144 0.271736 1.10394
524288 0.436042 1.78852
1048576 0.764523 2.99288
2097152 1.42352 5.44717
4194304 2.71637 10.5398
8388608 5.32359 20.7386
16777216 10.5362 41.1821
33554432 20.9584 81.8091
67108864 41.8165 163.285
134217728 83.5161 327.614
tested data length: 1048576
cache factor block length chunk size merge time multi way
1 8192 131072 0.00602409 0.0045537
1 8192 65536 0.00557754 0.00472258
1 8192 32768 0.00511307 0.00495993
1 8192 16384 0.0046876 0.00657198
2 4096 131072 0.00600161 0.00415643
2 4096 65536 0.0055379 0.00432099
2 4096 32768 0.00508193 0.00478401
2 4096 16384 0.00464252 0.00662644
4 2048 131072 0.00600902 0.00374336
4 2048 65536 0.00554577 0.00416817
4 2048 32768 0.00509218 0.00475376
4 2048 16384 0.00464915 0.00760079
8 1024 131072 0.00599442 0.00344255
8 1024 65536 0.0055441 0.00400054
8 1024 32768 0.00508095 0.00523432
8 1024 16384 0.00463455 0.00915217
16 512 131072 0.00614222 0.00335864
16 512 65536 0.00568835 0.00431853
16 512 32768 0.00521792 0.00577685
16 512 16384 0.00475626 0.0117665
32 256 131072 0.00628188 0.00351862
32 256 65536 0.00581948 0.0050944
32 256 32768 0.00535015 0.00788992
32 256 16384 0.0048513 0.0165558
64 128 131072 0.00660018 0.00371936
64 128 65536 0.00610987 0.00563271
64 128 32768 0.00563324 0.00947698
64 128 16384 0.00568752 0.0264414
tested data length: 2097152
cache factor block length chunk size merge time multi way
1 8192 262144 0.014345 0.0113728
1 8192 131072 0.0134979 0.0113008
1 8192 65536 0.0119044 0.0110461
1 8192 32768 0.0102111 0.0121481
2 4096 262144 0.0129686 0.00884728
2 4096 131072 0.0120852 0.00899056
2 4096 65536 0.0111828 0.00987708
2 4096 32768 0.0103009 0.0130314
4 2048 262144 0.0130215 0.00801777
4 2048 131072 0.0121378 0.00860134
4 2048 65536 0.0112402 0.00981418
4 2048 32768 0.0103103 0.0145749
8 1024 262144 0.0130685 0.00723398
8 1024 131072 0.0121647 0.00802321
8 1024 65536 0.0112197 0.0105306
8 1024 32768 0.0103161 0.0168381
16 512 262144 0.013421 0.00690192
16 512 131072 0.0125051 0.00829455
16 512 65536 0.0116068 0.011605
16 512 32768 0.0106959 0.0217141
32 256 262144 0.013643 0.0070729
32 256 131072 0.0127538 0.00892387
32 256 65536 0.0118368 0.016066
32 256 32768 0.0109315 0.02895
64 128 262144 0.0143631 0.00747054
64 128 131072 0.0134955 0.0112068
64 128 65536 0.0125042 0.0187191
64 128 32768 0.0114904 0.0457463
tested data length: 1048576
cache factor block length chunk size merge time multi way
1 8192 131072 0.121342 0.113118
1 8192 65536 0.118816 0.123142
1 8192 32768 0.117853 0.11721
1 8192 16384 0.123097 0.127823
2 4096 131072 0.0680541 0.0648217
2 4096 65536 0.0676248 0.0622498
2 4096 32768 0.0673001 0.0636358
2 4096 16384 0.0671043 0.0679831
4 2048 131072 0.0384989 0.03545
4 2048 65536 0.0377664 0.0363396
4 2048 32768 0.0374914 0.0375905
4 2048 16384 0.0372371 0.0413322
8 1024 131072 0.0204032 0.0220355
8 1024 65536 0.0198129 0.0220039
8 1024 32768 0.0196085 0.0216466
8 1024 16384 0.0192205 0.025368
16 512 131072 0.0118876 0.0123189
16 512 65536 0.0115054 0.0118618
16 512 32768 0.0124358 0.0145389
16 512 16384 0.012286 0.0207319
32 256 131072 0.00854379 0.00770199
32 256 65536 0.00750641 0.00850875
32 256 32768 0.00681624 0.0101911
32 256 16384 0.00646781 0.0200483
64 128 131072 0.0066008 0.00473508
64 128 65536 0.00597146 0.00640918
64 128 32768 0.00561516 0.0107694
64 128 16384 0.00527474 0.0260669
tested data length: 2097152
cache factor block length chunk size merge time multi way
1 8192 262144 0.26234 0.239683
1 8192 131072 0.261769 0.259552
1 8192 65536 0.261529 0.255509
1 8192 32768 0.258287 0.261699
2 4096 262144 0.138479 0.130411
2 4096 131072 0.136445 0.125503
2 4096 65536 0.134738 0.128016
2 4096 32768 0.134348 0.136572
4 2048 262144 0.081315 0.0712888
4 2048 131072 0.0777924 0.0728147
4 2048 65536 0.0797238 0.0803206
4 2048 32768 0.075474 0.0834344
8 1024 262144 0.0453339 0.0444124
8 1024 131072 0.0410857 0.0438799
8 1024 65536 0.0399156 0.0434617
8 1024 32768 0.0393088 0.0488539
16 512 262144 0.0292202 0.0241011
16 512 131072 0.0242414 0.0226941
16 512 65536 0.0228244 0.0252868
16 512 32768 0.0221705 0.0365184
32 256 262144 0.0213847 0.01296
32 256 131072 0.0156882 0.0140911
32 256 65536 0.0143421 0.0207917
32 256 32768 0.0137291 0.0359397
64 128 262144 0.0195758 0.0090111
64 128 131072 0.013431 0.0129914
64 128 65536 0.0120064 0.0214619
64 128 32768 0.0112919 0.0511211
tested data length: 1048576
cache factor block length chunk size merge time multi way
1 8192 131072 0.236767 0.33036
1 8192 65536 0.236785 0.374777
1 8192 32768 0.255734 0.40598
1 8192 16384 0.255636 0.386225
2 4096 131072 0.0992862 0.132289
2 4096 65536 0.0960152 0.124304
2 4096 32768 0.0959808 0.122356
2 4096 16384 0.0958709 0.114987
4 2048 131072 0.0465323 0.0515303
4 2048 65536 0.0457624 0.0515334
4 2048 32768 0.0455293 0.0495389
4 2048 16384 0.0451812 0.0497217
8 1024 131072 0.0232383 0.0260641
8 1024 65536 0.0226771 0.025741
8 1024 32768 0.0223552 0.0245505
8 1024 16384 0.022044 0.0282335
16 512 131072 0.0119359 0.0124295
16 512 65536 0.0114485 0.0112384
16 512 32768 0.0111385 0.0127718
16 512 16384 0.0107915 0.0188211
32 256 131072 0.00783912 0.00664715
32 256 65536 0.00712724 0.00827174
32 256 32768 0.00687033 0.0103448
32 256 16384 0.00694471 0.02176
64 128 131072 0.00722843 0.00555378
64 128 65536 0.00680596 0.00791386
64 128 32768 0.00670643 0.0122435
64 128 16384 0.0061358 0.0285379
tested data length: 2097152
cache factor block length chunk size merge time multi way
1 8192 262144 0.490607 0.698375
1 8192 131072 0.490691 0.779644
1 8192 65536 0.484839 0.775107
1 8192 32768 0.489366 0.750624
2 4096 262144 0.195104 0.261552
2 4096 131072 0.191814 0.257275
2 4096 65536 0.18941 0.24893
2 4096 32768 0.190019 0.232917
4 2048 262144 0.100145 0.109008
4 2048 131072 0.093168 0.103254
4 2048 65536 0.091735 0.09909
4 2048 32768 0.0911975 0.0989044
8 1024 262144 0.0513737 0.0525954
8 1024 131072 0.0467529 0.0507692
8 1024 65536 0.0455474 0.0498838
8 1024 32768 0.0450415 0.0546636
16 512 262144 0.0293773 0.0242984
16 512 131072 0.0245923 0.0230786
16 512 65536 0.0230445 0.0255712
16 512 32768 0.0222993 0.0367812
32 256 262144 0.0202023 0.0132944
32 256 131072 0.0157429 0.0144966
32 256 65536 0.0145488 0.0212395
32 256 32768 0.0138449 0.0368747
64 128 262144 0.0194232 0.00920082
64 128 131072 0.0134838 0.013343
64 128 65536 0.0122294 0.0215943
64 128 32768 0.0114047 0.0512116
tested data length: 1048576
cache factor block length chunk size merge time multi way
1 8192 131072 0.0048534 0.00368417
1 8192 65536 0.00423724 0.00377406
1 8192 32768 0.00393045 0.00412206
1 8192 16384 0.00363892 0.00543329
2 4096 131072 0.00489503 0.0033874
2 4096 65536 0.00421928 0.00361028
2 4096 32768 0.00393446 0.00407604
2 4096 16384 0.0036243 0.00614027
4 2048 131072 0.00480916 0.00323622
4 2048 65536 0.00424787 0.00342611
4 2048 32768 0.00394032 0.00429615
4 2048 16384 0.00363697 0.00688732
8 1024 131072 0.00482784 0.0029782
8 1024 65536 0.00427404 0.00360856
8 1024 32768 0.00396181 0.00469151
8 1024 16384 0.00364987 0.00900937
16 512 131072 0.00486156 0.00294817
16 512 65536 0.00430256 0.0035308
16 512 32768 0.00398346 0.0054282
16 512 16384 0.00367036 0.0109883
32 256 131072 0.0049264 0.00299514
32 256 65536 0.00432223 0.00466693
32 256 32768 0.00399711 0.00725013
32 256 16384 0.0036733 0.0166287
64 128 131072 0.0049413 0.00297956
64 128 65536 0.00432296 0.00465075
64 128 32768 0.00398838 0.00884144
64 128 16384 0.00367066 0.0232534
tested data length: 2097152
cache factor block length chunk size merge time multi way
1 8192 262144 0.0127335 0.00770138
1 8192 131072 0.0094724 0.00806621
1 8192 65536 0.00840297 0.00827371
1 8192 32768 0.00779985 0.0100782
2 4096 262144 0.0134172 0.00726578
2 4096 131072 0.00946837 0.00764026
2 4096 65536 0.00842861 0.00839692
2 4096 32768 0.00782274 0.0116928
4 2048 262144 0.0140309 0.00674548
4 2048 131072 0.00948214 0.00725478
4 2048 65536 0.00846204 0.0085484
4 2048 32768 0.00786876 0.0132078
8 1024 262144 0.0146008 0.00630693
8 1024 131072 0.0095915 0.00724758
8 1024 65536 0.00850327 0.00978144
8 1024 32768 0.00787939 0.0164755
16 512 262144 0.0146087 0.00605603
16 512 131072 0.00972174 0.00730591
16 512 65536 0.00855196 0.0104517
16 512 32768 0.00792411 0.02072
32 256 262144 0.0150389 0.00625562
32 256 131072 0.0107472 0.00983614
32 256 65536 0.00994982 0.0168794
32 256 32768 0.00908479 0.0319618