-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
74 lines (53 loc) · 2.38 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
serial: Serial version
step1:
Added simple acc directives
#pragma acc routine seq
#pragma acc parallel loop
The image array is transferred to the GPU unnecessarily
step2:
Added in unstructured data region (enter, exit)
#pragma acc enter data create(image[0:3*(image_width*image_height)])
#pragma acc exit data delete(image[0:3*(image_width*image_height)])
Added present clause to acc parallel loop
#pragma acc parallel loop present(image[0:3*(image_width*image_height)])
-> This eliminates data transfer to device
Added explict host update
#pragma acc update self(image[0:3*(image_width*image_height)])
step3:
Added block/tile loop to separate image into chunks of rows
-> Compute only
Added separate pragma for "data present" instead of attaching to loop clause
#pragma acc data present(image[0:3*(image_width*image_height)])
#pragma acc parallel loop
step4:
Added blocking of data transfers by moving the "update pragma" inside
the block loop and changing bound of image update
#pragma acc update self(image[block*(3*block_height*image_width):block_height*(3*image_width)])
step5:
Now that computation and data transfers are blocked/tiled, allow for asynchronous
work by adding async clauses to the parallel loop and update self pragmas
async(block % 2 + 1)
-> use modulus to assign blocks to 2 separate cuda streams depending on block id
-> the "+1" is to ensure we're not using the default stream
We must also synchronize before data is accessed (written to file)
#pragma acc wait
step6:
Compute portion of blocks on each of the 4 GPUs on Summitdev
Query for # of GPUs
int num_gpus = acc_get_num_devices(acc_device_nvidia);
Add OpenMP parallel region to divde blocks among GPUs
#pragma omp parallel
Assign 1 GPU per OpenMP thread (This requires num_gpus = num_omp_threads)
int omp_thread_id = omp_get_thread_num();
acc_set_device_num(omp_thread_id % num_gpus, acc_device_nvidia);
Move "acc data create" inside "omp parallel" region so that image array is allocated
on each GPU
-> Each GPU does not actually need a copy of the entire array, but for now we leave it.
Add OpenMP parallel loop directive to block loop to divde blocks among GPUs
#pragma omp for
NOTE: This problem is NOT load balanced. Some GPUs have more compute intensive
portions of the image to work on.
step7:
Balance workload among GPUs
step8:
Decompose grid using MPI