Skip to content

Commit

Permalink
add test of probability-ordered keys branching and update doc of othe…
Browse files Browse the repository at this point in the history
…rs test with result for Core M5 skylake cpu
  • Loading branch information
szaghi committed Oct 22, 2016
1 parent ca11612 commit 6cfbbc3
Show file tree
Hide file tree
Showing 13 changed files with 324 additions and 41 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ Currently DEFY collection includes:
+ [goto is fastest](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest):
+ [goto if select comparison 1](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_select_comparison_1);
+ [goto if select comparison 2](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_select_comparison_2);
+ [goto if select comparison 3](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_select_comparison_3);
+ [goto if block comparison 1](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_block_comparison_1);
+ [powers naive definitions have overhead](https://github.com/szaghi/DEFY/tree/master/src/powers_naive_definitions_have_overhead):
+ [powers 1](https://github.com/szaghi/DEFY/tree/master/src/powers_naive_definitions_have_overhead/powers_1):
Expand Down
20 changes: 14 additions & 6 deletions src/goto_is_fastest/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#### Myth example

The myth states that
The myth states that for a genric (possible randomic) select value, the `goto`-based branching-flow

```fortran
goto (10, 20, 30), selector
Expand All @@ -18,7 +18,7 @@ goto 40
40 continue
```

is compiled into a **faster** branching-flow than
is compiled into a **faster** selector than

```fortran
select case(selector)
Expand All @@ -41,7 +41,14 @@ elseif (selector==3)
end if
```

The myth originates from the old-good days when other branching-flow models (e.g. `if elseif` and `select case`) were added to the language (the early Fortran 90 implementations) alongside `goto`: *probably* the early compilers implementations supporting the *new* (for those days) branching models were not able to optimized the compiled selection based on the models as well as they did for the very-well supported (computed) `goto` model.
The myth originates from the old-good days when other branching-flow models (e.g. `if elseif` and `select case`) were added to the language (the early Fortran 90 implementations) alongside `goto`: *probably* the early compilers implementations supporting the *new* (for those days) branching models were not able to optimized the compiled selection based on that models as well as they did for the very-well supported (computed) `goto` model.

#### Variants

The simple branching-flow afore described is analyzed for also some variants:

+ *flushed* branching-flow: the selector is used to find only the first worker to call, but also all other subsequent workers are called; this is intended to flavor `goto` that follow this bias without the need of *nested checks*;
+ *probability-ordered* branching-flow: the selector values are (pre) ordered into a list from the most probable (to be called) selector value to the most improbable; this is intended to help the optimizer to guess (e.g. pre-fetching) the next most probable branch.

### Demystified

Expand All @@ -63,8 +70,9 @@ The presupposed `goto` higher performance is a **myth** nowadays. Moreover, `got
### DEFY Tests

DEFY provides the following tests for this myth demystification:
+ [goto if select comparison 1](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_select_comparison_1);
+ [goto if select comparison 2](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_select_comparison_2);
+ [goto if block comparison 1](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_block_comparison_1).
+ [goto if select comparison 1](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_select_comparison_1): the baseline test;
+ [goto if select comparison 2](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_select_comparison_2): a variant of the baseline test proposed by FortranFan;
+ [goto if select comparison 3](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_select_comparison_3): the baseline variation using pre-ordered most-probable selector values list;
+ [goto if block comparison 1](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_block_comparison_1): the baseline variation with *flushed flow* bias.

See their README.md to see the results obtained.
28 changes: 19 additions & 9 deletions src/goto_is_fastest/goto_if_block_comparison_1/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,16 @@
### Goto-if elseif-select case performance comparison, test 1

This test compare (computed) `goto` with `if` branching-flow construct. The selector for the branching-jump is computed pseudo-randomically and the *work* done inside the *workers* called by each branch is not uniform.
This test compare (computed) `goto` with `if` and `block (if)` branching-flow constructs.

This is a modification of [goto-if elseif-select case](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_select_comparison_1) test proposed by Ron Shepard (select case is not considered into this test, rather the `block` construct). Essentially, the branching-flow is now *flushed*: the selector selects *from which keyword* start to call the workers and call not only the worker corresponding to that keyword, but also all subsequent workers, e.g.
> The selector for the branching-jump is computed pseudo-randomically.
> The *work* done inside the *workers* called by each branch is not uniform rather it depends on keywords value.
This is a modification of [goto-if elseif-select case](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_select_comparison_1) test proposed by Ron Shepard and further improved by FortranFan.

> Select case is not considered into this test (because it generates highly-nested branching-flow less clear than the others), rather the `block` construct.
Essentially, the branching-flow is now *flushed*: the selector selects *from which keyword* to start to call the workers and call not only the worker corresponding to that keyword, but also all subsequent workers, e.g.

```fortran
goto (1, 2, 3), keyword
Expand All @@ -26,7 +34,7 @@ selector: block
end block selector
```

In this case the `goto` should actually be advantaged, although the tests performed confirm again that the performance are almost identical.
In this case the `goto` should actually be advantaged, although the tests performed confirm (again) that the performance are almost identical.

### Run test

Expand All @@ -39,9 +47,11 @@ Four bash scripts are provided to run the test:

### Results obtained

|Compiler|Optimizations|Architecture | goto | if |block |
|--------|-------------|-----------------------------------------------------|-----------|-----------|-----------|
| GNU | yes |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.5480^10-4|0.5480^10-4|0.5480^10-4|
| GNU | no |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.7578^10-3|0.7578^10-3|0.7578^10-3|
| Intel | yes |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.5228^10-4|0.5237^10-4|0.5237^10-4|
| Intel | no |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.9449^10-3|0.9550^10-3|0.9550^10-3|
|Compiler |Optimizations|Architecture | goto | if |block |
|----------------------|-------------|-----------------------------------------------------|-----------|-----------|-----------|
| GNU (6.2.0, 64bit) | -O3 |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.5480^10-4|0.5480^10-4|0.5480^10-4|
| GNU (6.2.0, 64bit) | -Og |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.7578^10-3|0.7578^10-3|0.7578^10-3|
| Intel (16.0.3, 64bit)| -O3 |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.5228^10-4|0.5237^10-4|0.5237^10-4|
| Intel (16.0.3, 64bit)| -O0 |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.9449^10-3|0.9550^10-3|0.9550^10-3|
| GNU (7.0.0, 32bit) | -?? |Intel Core [email protected], 4GB RAM, Windows 64-bit |0.1357^10-3|0.1356^10-3|0.1356^10-3|
| Intel (17.0.0, 64bit)| -?? |Intel Core [email protected], 4GB RAM, Windows 64-bit |0.4650^10-4|0.4400^10-4|0.4400^10-4|
18 changes: 11 additions & 7 deletions src/goto_is_fastest/goto_if_select_comparison_1/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
### Goto-if elseif-select case performance comparison, test 1

This test compare (computed) `goto` with `if elseif` and `select case` branching-flow constructs. The selector for the branching-jump is computed pseudo-randomically and the *work* done inside the *workers* called by each branch is not uniform.
This test compare (computed) `goto` with `if elseif` and `select case` branching-flow constructs.

> The selector for the branching-jump is computed pseudo-randomically.
> The *work* done inside the *workers* called by each branch is not uniform rather it depends on keywords value.
### Run test

Expand All @@ -13,9 +17,9 @@ Four bash scripts are provided to run the test:

### Results obtained

|Compiler|Optimizations|Architecture | goto | if elseif | select case |
|--------|-------------|-----------------------------------------------------|-----------|-----------|-------------|
| GNU | yes |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.3852^10-4|0.3856^10-4| 0.3857^10-4 |
| GNU | no |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.5788^10-3|0.5778^10-3| 0.5783^10-3 |
| Intel | yes |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.3896^10-4|0.3913^10-4| 0.3905^10-4 |
| Intel | no |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.5796^10-3|0.5785^10-3| 0.5810^10-3 |
|Compiler |Optimizations|Architecture | goto | if elseif | select case |
|---------------|-------------|-----------------------------------------------------|-----------|-----------|-------------|
| GNU (6.2.0) | -O3 |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.3852^10-4|0.3856^10-4| 0.3857^10-4 |
| GNU (6.2.0) | -Og |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.5788^10-3|0.5778^10-3| 0.5783^10-3 |
| Intel (16.0.3)| -O3 |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.3896^10-4|0.3913^10-4| 0.3905^10-4 |
| Intel (16.0.3)| -O0 |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.5796^10-3|0.5785^10-3| 0.5810^10-3 |
7 changes: 5 additions & 2 deletions src/goto_is_fastest/goto_if_select_comparison_2/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
### Goto-if elseif-select case performance comparison, test 1

This test compare (computed) `goto` with `if elseif` and `select case` branching-flow constructs.
This test compare (computed) `goto` with `select case` branching-flow constructs.

To be completed.

Expand All @@ -15,4 +15,7 @@ Four bash scripts are provided to run the test:

### Results obtained

To be written.
|Compiler |Optimizations|Architecture | goto |select case |
|---------------|-------------|--------------------------------------------------|-----------|------------|
| Intel (16.0.3)| -O3 |Intel Core [email protected], 4GB RAM, x86_64 Ubuntu|2.0460^10-3|2.0394^10-3 |
| Intel (16.0.3)| -O0 |Intel Core [email protected], 4GB RAM, x86_64 Ubuntu|3.4972^10-3|4.0245^10-3 |
36 changes: 36 additions & 0 deletions src/goto_is_fastest/goto_if_select_comparison_3/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
### Goto-if elseif-select case performance comparison, test 3

This test compare (computed) `goto` with `if elseif` and `select case` branching-flow constructs.

The keywords are ordered as following:

+ keys value:
+ key(1) = 3
+ key(2) = 4
+ key(3) = 1
+ key(4) = 2
+ keys probability:
+ key(1) ~ 36% (10 matches on 28)
+ key(2) ~ 29% (8 matches on 28)
+ key(3) ~ 21% (6 matches on 28)
+ key(4) ~ 14% (4 matches on 28)

> The *work* done inside the *workers* called by each branch is not uniform rather it depends on keywords value.
### Run test

Four bash scripts are provided to run the test:

1. `run_gnu.sh`, run the test with GNU gfortran compiler without optimizations;
2. `run_gnu_optimized.sh`, run the test with GNU gfortran compiler with optimizations;
3. `run_gnu.sh`, run the test with Intel Fortran Compiler without optimizations;
4. `run_gnu_optimized.sh`, run the test with Intel Fortran Compiler with optimizations;

### Results obtained

|Compiler |Optimizations|Architecture | goto | if elseif | select case |
|---------------|-------------|--------------------------------------------------|-----------|-----------|-------------|
| GNU (6.2.0) | -O3 |Intel Core [email protected], 4GB RAM, x86_64 Ubuntu|0.1111^10-3|0.1111^10-3|0.1111 ^10-3 |
| GNU (6.2.0) | -Og |Intel Core [email protected], 4GB RAM, x86_64 Ubuntu|0.2136^10-2|0.2135^10-2|0.2137 ^10-2 |
| Intel (16.0.3)| -O3 |Intel Core [email protected], 4GB RAM, x86_64 Ubuntu|0.1143^10-3|0.1143^10-3|0.1154 ^10-3 |
| Intel (16.0.3)| -O0 |Intel Core [email protected], 4GB RAM, x86_64 Ubuntu|0.2691^10-2|0.2691^10-2|0.2691 ^10-2 |
135 changes: 135 additions & 0 deletions src/goto_is_fastest/goto_if_select_comparison_3/defy.f90
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
! A DEFY (DEmystyfy Fortran mYths) test.
! Author: Stefano Zaghi
! Date: 2016-10-22
!
! License: this file is licensed under the Creative Commons Attribution 4.0 license,
! see http://creativecommons.org/licenses/by/4.0/ .

program defy
use iso_fortran_env
implicit none
integer(int32), parameter :: tests_number = 3000
integer(int32) :: keyword
integer(int32) :: keywords(1:4,1:2)
real(real64), allocatable :: key_work(:)
integer(int64) :: profiling(1:2)
integer(int64) :: count_rate
real(real64) :: system_clocks(1:3)
integer(int32) :: i
integer(int32) :: k
integer(int32) :: p

keywords = 0
! keys value
keywords(1,1) = 3
keywords(2,1) = 4
keywords(3,1) = 1
keywords(4,1) = 2
! keys probability
keywords(1,2) = 10
keywords(2,2) = 8
keywords(3,2) = 6
keywords(4,2) = 4

system_clocks = 0._real64
do i=1, tests_number

do k=1, size(keywords, dim=1)

keyword = keywords(k, 1)

do p=1, keywords(k, 2)

call system_clock(profiling(1), count_rate)
select case(keyword)
case(1)
call worker1(key=keyword, array=key_work)
case(2)
call worker2(key=keyword, array=key_work)
case(3)
call worker3(key=keyword, array=key_work)
case(4)
call worker4(key=keyword, array=key_work)
endselect
call system_clock(profiling(2), count_rate)
system_clocks(1) = system_clocks(1) + real(profiling(2) - profiling(1), kind=real64)/count_rate

call system_clock(profiling(1), count_rate)
if (keyword==1) then
call worker1(key=keyword, array=key_work)
elseif (keyword==2) then
call worker2(key=keyword, array=key_work)
elseif (keyword==3) then
call worker3(key=keyword, array=key_work)
elseif (keyword==4) then
call worker4(key=keyword, array=key_work)
endif
call system_clock(profiling(2), count_rate)
system_clocks(2) = system_clocks(2) + real(profiling(2) - profiling(1), kind=real64)/count_rate

call system_clock(profiling(1), count_rate)
goto (10, 20, 30, 40), keyword
goto 50
10 call worker1(key=keyword, array=key_work) ; goto 50
20 call worker2(key=keyword, array=key_work) ; goto 50
30 call worker3(key=keyword, array=key_work) ; goto 50
40 call worker4(key=keyword, array=key_work) ; goto 50
50 continue
call system_clock(profiling(2), count_rate)
system_clocks(3) = system_clocks(3) + real(profiling(2) - profiling(1), kind=real64)/count_rate
enddo
enddo
enddo
print '(A,E23.15)', ' select case average performance: ', system_clocks(1)/tests_number
print '(A,E23.15)', ' if elseif average performance: ', system_clocks(2)/tests_number
print '(A,E23.15)', ' goto average performance: ', system_clocks(3)/tests_number

contains
pure subroutine worker1(key, array)
integer(int32), intent(in) :: key
real(real64), allocatable, intent(out) :: array(:)
integer(int32) :: j

allocate(array(1:key*tests_number))
array = 0._real64
do j=1, key*tests_number
array(j) = key**2._real64 * tests_number * j
enddo
endsubroutine worker1

pure subroutine worker2(key, array)
integer(int32), intent(in) :: key
real(real64), allocatable, intent(out) :: array(:)
integer(int32) :: j

allocate(array(1:key*tests_number))
array = 0._real64
do j=1, key*tests_number
array(j) = key**2._real64 * tests_number * j
enddo
endsubroutine worker2

pure subroutine worker3(key, array)
integer(int32), intent(in) :: key
real(real64), allocatable, intent(out) :: array(:)
integer(int32) :: j

allocate(array(1:key*tests_number))
array = 0._real64
do j=1, key*tests_number
array(j) = key**2._real64 * tests_number * j
enddo
endsubroutine worker3

pure subroutine worker4(key, array)
integer(int32), intent(in) :: key
real(real64), allocatable, intent(out) :: array(:)
integer(int32) :: j

allocate(array(1:key*tests_number))
array = 0._real64
do j=1, key*tests_number
array(j) = key**2._real64 * tests_number * j
enddo
endsubroutine worker4
endprogram defy
11 changes: 11 additions & 0 deletions src/goto_is_fastest/goto_if_select_comparison_3/run_gnu.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash
# script to build and run DEFY tests.
#
# License: this file is licensed under the Creative Commons Attribution 4.0 license,
# see http://creativecommons.org/licenses/by/4.0/ .

test=$(basename $(pwd))/defy.f90
echo "Build and run $test by means of 'gfortran -Og'"
gfortran -Og defy.f90 -o defy
./defy
rm -f defy
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash
# script to build and run DEFY tests.
#
# License: this file is licensed under the Creative Commons Attribution 4.0 license,
# see http://creativecommons.org/licenses/by/4.0/ .

test=$(basename $(pwd))/defy.f90
echo "Build and run $test by means of 'gfortran -O3'"
gfortran -O3 defy.f90 -o defy
./defy
rm -f defy
11 changes: 11 additions & 0 deletions src/goto_is_fastest/goto_if_select_comparison_3/run_intel.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash
# script to build and run DEFY tests.
#
# License: this file is licensed under the Creative Commons Attribution 4.0 license,
# see http://creativecommons.org/licenses/by/4.0/ .

test=$(basename $(pwd))/defy.f90
echo "Build and run $test by means of 'ifort -O0'"
ifort -O0 defy.f90 -o defy
./defy
rm -f defy
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash
# script to build and run DEFY tests.
#
# License: this file is licensed under the Creative Commons Attribution 4.0 license,
# see http://creativecommons.org/licenses/by/4.0/ .

test=$(basename $(pwd))/defy.f90
echo "Build and run $test by means of 'ifort -O3'"
ifort -O3 defy.f90 -o defy
./defy
rm -f defy
19 changes: 17 additions & 2 deletions src/powers_naive_definitions_have_overhead/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,25 @@
### (Naive) definitions of powers (elevation) could have relevant overhead

To be written.
> A lazy (naive) definition of power elevations can generate relevant overhead degrading the computational speed.
Power elevations can be written in different form. Let us consider the square computation. It can be written as

+ `a*a`, by means the multiplication operator;
+ `a**2`, by means of the power operator using the integer constant `2`;
+ `a**2.0`, by means of the power operator using the real constant `2.0` with the default kind;
+ `a**2.0_real64`, by means of the power operator using the real constant `2.0` with the 64 bits kind;

> These definitions are not equivalent in terms of computational speed: they should be ordered form the fastest to the slowest.
Similarly, the square root can be written as:

+ `sqrt(a)`, by means the builtin `sqrt` function;
+ `a**0.5`, by means of the power operator using the real constant `0.5` with the default kind;
+ `a**0.5_real64`, by means of the power operator using the real constant `0.5` with the 64 bits kind;

### Not demystified

To be written.
> The *myth* is confirmed (not demystified), but overheads are somehow less than expected.
### DEFY Tests

Expand Down
Loading

0 comments on commit 6cfbbc3

Please sign in to comment.