add test of probability-ordered keys branching and update doc of othe…

…rs test with result for Core M5 skylake cpu
szaghi · Oct 22, 2016 · 6cfbbc3 · 6cfbbc3
1 parent ca11612
commit 6cfbbc3
Show file tree

Hide file tree

Showing 13 changed files with 324 additions and 41 deletions.
diff --git a/README.md b/README.md
@@ -93,6 +93,7 @@ Currently DEFY collection includes:
 + [goto is fastest](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest):
   + [goto if select comparison 1](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_select_comparison_1);
   + [goto if select comparison 2](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_select_comparison_2);
+  + [goto if select comparison 3](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_select_comparison_3);
   + [goto if block comparison 1](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_block_comparison_1);
 + [powers naive definitions have overhead](https://github.com/szaghi/DEFY/tree/master/src/powers_naive_definitions_have_overhead):
   + [powers 1](https://github.com/szaghi/DEFY/tree/master/src/powers_naive_definitions_have_overhead/powers_1):

diff --git a/src/goto_is_fastest/README.md b/src/goto_is_fastest/README.md
@@ -4,7 +4,7 @@
 
 #### Myth example
 
-The myth states that
+The myth states that for a genric (possible randomic) select value, the `goto`-based branching-flow
 
 ```fortran
 goto (10, 20, 30), selector
@@ -18,7 +18,7 @@ goto 40
 40 continue
 ```
 
-is compiled into a **faster** branching-flow than
+is compiled into a **faster** selector than
 
 ```fortran
 select case(selector)
@@ -41,7 +41,14 @@ elseif (selector==3)
 end if
 ```
 
-The myth originates from the old-good days when other branching-flow models (e.g. `if elseif` and `select case`) were added to the language (the early Fortran 90 implementations) alongside `goto`: *probably* the early compilers implementations supporting the *new* (for those days) branching models were not able to optimized the compiled selection based on the models as well as they did for the very-well supported (computed) `goto` model.
+The myth originates from the old-good days when other branching-flow models (e.g. `if elseif` and `select case`) were added to the language (the early Fortran 90 implementations) alongside `goto`: *probably* the early compilers implementations supporting the *new* (for those days) branching models were not able to optimized the compiled selection based on that models as well as they did for the very-well supported (computed) `goto` model.
+
+#### Variants
+
+The simple branching-flow afore described is analyzed for also some variants:
+
++ *flushed* branching-flow: the selector is used to find only the first worker to call, but also all other subsequent workers are called; this is intended to flavor `goto` that follow this bias without the need of *nested checks*;
++ *probability-ordered* branching-flow: the selector values are (pre) ordered into a list from the most probable (to be called) selector value to the most improbable; this is intended to help the optimizer to guess (e.g. pre-fetching) the next most probable branch.
 
 ### Demystified
 
@@ -63,8 +70,9 @@ The presupposed `goto` higher performance is a **myth** nowadays. Moreover, `got
 ### DEFY Tests
 
 DEFY provides the following tests for this myth demystification:
-+ [goto if select comparison 1](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_select_comparison_1);
-+ [goto if select comparison 2](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_select_comparison_2);
-+ [goto if block comparison 1](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_block_comparison_1).
++ [goto if select comparison 1](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_select_comparison_1): the baseline test;
++ [goto if select comparison 2](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_select_comparison_2): a variant of the baseline test proposed by FortranFan;
++ [goto if select comparison 3](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_select_comparison_3): the baseline variation using pre-ordered most-probable selector values list;
++ [goto if block comparison 1](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_block_comparison_1): the baseline variation with *flushed flow* bias.
 
 See their README.md to see the results obtained.
diff --git a/src/goto_is_fastest/goto_if_block_comparison_1/README.md b/src/goto_is_fastest/goto_if_block_comparison_1/README.md
@@ -1,8 +1,16 @@
 ### Goto-if elseif-select case performance comparison, test 1
 
-This test compare (computed) `goto` with `if` branching-flow construct. The selector for the branching-jump is computed pseudo-randomically and the *work* done inside the *workers* called by each branch is not uniform.
+This test compare (computed) `goto` with `if` and `block (if)` branching-flow constructs.
 
-This is a modification of [goto-if elseif-select case](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_select_comparison_1) test proposed by Ron Shepard (select case is not considered into this test, rather the `block` construct). Essentially, the branching-flow is now *flushed*: the selector selects *from which keyword* start to call the workers and call not only the worker corresponding to that keyword, but also all subsequent workers, e.g.
+> The selector for the branching-jump is computed pseudo-randomically.
+
+> The *work* done inside the *workers* called by each branch is not uniform rather it depends on keywords value.
+
+This is a modification of [goto-if elseif-select case](https://github.com/szaghi/DEFY/tree/master/src/goto_is_fastest/goto_if_select_comparison_1) test proposed by Ron Shepard and further improved by FortranFan.
+
+> Select case is not considered into this test (because it generates highly-nested branching-flow less clear than the others), rather the `block` construct.
+
+Essentially, the branching-flow is now *flushed*: the selector selects *from which keyword* to start to call the workers and call not only the worker corresponding to that keyword, but also all subsequent workers, e.g.
 
 ```fortran
 goto (1, 2, 3), keyword
@@ -26,7 +34,7 @@ selector: block
 end block selector
 ```
 
-In this case the `goto` should actually be advantaged, although the tests performed confirm again that the performance are almost identical.
+In this case the `goto` should actually be advantaged, although the tests performed confirm (again) that the performance are almost identical.
 
 ### Run test
 
@@ -39,9 +47,11 @@ Four bash scripts are provided to run the test:
 
 ### Results obtained
 
-|Compiler|Optimizations|Architecture                                         | goto      | if        |block      |
-|--------|-------------|-----------------------------------------------------|-----------|-----------|-----------|
-| GNU    |   yes       |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.5480^10-4|0.5480^10-4|0.5480^10-4|
-| GNU    |   no        |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.7578^10-3|0.7578^10-3|0.7578^10-3|
-| Intel  |   yes       |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.5228^10-4|0.5237^10-4|0.5237^10-4|
-| Intel  |   no        |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.9449^10-3|0.9550^10-3|0.9550^10-3|
+|Compiler              |Optimizations|Architecture                                         | goto      | if        |block      |
+|----------------------|-------------|-----------------------------------------------------|-----------|-----------|-----------|
+| GNU (6.2.0, 64bit)   | -O3         |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.5480^10-4|0.5480^10-4|0.5480^10-4|
+| GNU (6.2.0, 64bit)   | -Og         |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.7578^10-3|0.7578^10-3|0.7578^10-3|
+| Intel (16.0.3, 64bit)| -O3         |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.5228^10-4|0.5237^10-4|0.5237^10-4|
+| Intel (16.0.3, 64bit)| -O0         |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.9449^10-3|0.9550^10-3|0.9550^10-3|
+| GNU (7.0.0, 32bit)   | -??         |Intel Core [email protected], 4GB RAM, Windows 64-bit  |0.1357^10-3|0.1356^10-3|0.1356^10-3|
+| Intel (17.0.0, 64bit)| -??         |Intel Core [email protected], 4GB RAM, Windows 64-bit  |0.4650^10-4|0.4400^10-4|0.4400^10-4|
diff --git a/src/goto_is_fastest/goto_if_select_comparison_1/README.md b/src/goto_is_fastest/goto_if_select_comparison_1/README.md
@@ -1,6 +1,10 @@
 ### Goto-if elseif-select case performance comparison, test 1
 
-This test compare (computed) `goto` with `if elseif` and `select case` branching-flow constructs. The selector for the branching-jump is computed pseudo-randomically and the *work* done inside the *workers* called by each branch is not uniform.
+This test compare (computed) `goto` with `if elseif` and `select case` branching-flow constructs.
+
+> The selector for the branching-jump is computed pseudo-randomically.
+
+> The *work* done inside the *workers* called by each branch is not uniform rather it depends on keywords value.
 
 ### Run test
 
@@ -13,9 +17,9 @@ Four bash scripts are provided to run the test:
 
 ### Results obtained
 
-|Compiler|Optimizations|Architecture                                         | goto      | if elseif | select case |
-|--------|-------------|-----------------------------------------------------|-----------|-----------|-------------|
-| GNU    |   yes       |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.3852^10-4|0.3856^10-4| 0.3857^10-4 |
-| GNU    |   no        |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.5788^10-3|0.5778^10-3| 0.5783^10-3 |
-| Intel  |   yes       |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.3896^10-4|0.3913^10-4| 0.3905^10-4 |
-| Intel  |   no        |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.5796^10-3|0.5785^10-3| 0.5810^10-3 |
+|Compiler       |Optimizations|Architecture                                         | goto      | if elseif | select case |
+|---------------|-------------|-----------------------------------------------------|-----------|-----------|-------------|
+| GNU (6.2.0)   | -O3         |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.3852^10-4|0.3856^10-4| 0.3857^10-4 |
+| GNU (6.2.0)   | -Og         |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.5788^10-3|0.5778^10-3| 0.5783^10-3 |
+| Intel (16.0.3)| -O3         |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.3896^10-4|0.3913^10-4| 0.3905^10-4 |
+| Intel (16.0.3)| -O0         |Intel Xeon [email protected], 24GB RAM, x86_64 Arch Linux|0.5796^10-3|0.5785^10-3| 0.5810^10-3 |
diff --git a/src/goto_is_fastest/goto_if_select_comparison_2/README.md b/src/goto_is_fastest/goto_if_select_comparison_2/README.md
@@ -1,6 +1,6 @@
 ### Goto-if elseif-select case performance comparison, test 1
 
-This test compare (computed) `goto` with `if elseif` and `select case` branching-flow constructs.
+This test compare (computed) `goto` with `select case` branching-flow constructs.
 
 To be completed.
 
@@ -15,4 +15,7 @@ Four bash scripts are provided to run the test:
 
 ### Results obtained
 
-To be written.
+|Compiler       |Optimizations|Architecture                                      | goto      |select case |
+|---------------|-------------|--------------------------------------------------|-----------|------------|
+| Intel (16.0.3)| -O3         |Intel Core [email protected], 4GB RAM, x86_64 Ubuntu|2.0460^10-3|2.0394^10-3 |
+| Intel (16.0.3)| -O0         |Intel Core [email protected], 4GB RAM, x86_64 Ubuntu|3.4972^10-3|4.0245^10-3 |
diff --git a/src/goto_is_fastest/goto_if_select_comparison_3/README.md b/src/goto_is_fastest/goto_if_select_comparison_3/README.md
@@ -0,0 +1,36 @@
+### Goto-if elseif-select case performance comparison, test 3
+
+This test compare (computed) `goto` with `if elseif` and `select case` branching-flow constructs.
+
+The keywords are ordered as following:
+
++ keys value:
+  + key(1) = 3
+  + key(2) = 4
+  + key(3) = 1
+  + key(4) = 2
++ keys probability:
+  + key(1) ~ 36% (10 matches on 28)
+  + key(2) ~ 29% (8  matches on 28)
+  + key(3) ~ 21% (6  matches on 28)
+  + key(4) ~ 14% (4  matches on 28)
+
+> The *work* done inside the *workers* called by each branch is not uniform rather it depends on keywords value.
+
+### Run test
+
+Four bash scripts are provided to run the test:
+
+1. `run_gnu.sh`, run the test with GNU gfortran compiler without optimizations;
+2. `run_gnu_optimized.sh`, run the test with GNU gfortran compiler with optimizations;
+3. `run_gnu.sh`, run the test with Intel Fortran Compiler without optimizations;
+4. `run_gnu_optimized.sh`, run the test with Intel Fortran Compiler with optimizations;
+
+### Results obtained
+
+|Compiler       |Optimizations|Architecture                                      | goto      | if elseif | select case |
+|---------------|-------------|--------------------------------------------------|-----------|-----------|-------------|
+| GNU (6.2.0)   | -O3         |Intel Core [email protected], 4GB RAM, x86_64 Ubuntu|0.1111^10-3|0.1111^10-3|0.1111 ^10-3 |
+| GNU (6.2.0)   | -Og         |Intel Core [email protected], 4GB RAM, x86_64 Ubuntu|0.2136^10-2|0.2135^10-2|0.2137 ^10-2 |
+| Intel (16.0.3)| -O3         |Intel Core [email protected], 4GB RAM, x86_64 Ubuntu|0.1143^10-3|0.1143^10-3|0.1154 ^10-3 |
+| Intel (16.0.3)| -O0         |Intel Core [email protected], 4GB RAM, x86_64 Ubuntu|0.2691^10-2|0.2691^10-2|0.2691 ^10-2 |
diff --git a/src/goto_is_fastest/goto_if_select_comparison_3/defy.f90 b/src/goto_is_fastest/goto_if_select_comparison_3/defy.f90
@@ -0,0 +1,135 @@
+! A DEFY (DEmystyfy Fortran mYths) test.
+! Author: Stefano Zaghi
+! Date: 2016-10-22
+!
+! License: this file is licensed under the Creative Commons Attribution 4.0 license,
+! see http://creativecommons.org/licenses/by/4.0/ .
+
+program defy
+  use iso_fortran_env
+  implicit none
+  integer(int32), parameter :: tests_number = 3000
+  integer(int32)            :: keyword
+  integer(int32)            :: keywords(1:4,1:2)
+  real(real64), allocatable :: key_work(:)
+  integer(int64)            :: profiling(1:2)
+  integer(int64)            :: count_rate
+  real(real64)              :: system_clocks(1:3)
+  integer(int32)            :: i
+  integer(int32)            :: k
+  integer(int32)            :: p
+
+  keywords = 0
+  ! keys value
+  keywords(1,1) = 3
+  keywords(2,1) = 4
+  keywords(3,1) = 1
+  keywords(4,1) = 2
+  ! keys probability
+  keywords(1,2) = 10
+  keywords(2,2) = 8
+  keywords(3,2) = 6
+  keywords(4,2) = 4
+
+  system_clocks = 0._real64
+  do i=1, tests_number
+
+    do k=1, size(keywords, dim=1)
+
+      keyword = keywords(k, 1)
+
+      do p=1, keywords(k, 2)
+
+        call system_clock(profiling(1), count_rate)
+        select case(keyword)
+        case(1)
+          call worker1(key=keyword, array=key_work)
+        case(2)
+          call worker2(key=keyword, array=key_work)
+        case(3)
+          call worker3(key=keyword, array=key_work)
+        case(4)
+          call worker4(key=keyword, array=key_work)
+        endselect
+        call system_clock(profiling(2), count_rate)
+        system_clocks(1) = system_clocks(1) + real(profiling(2) - profiling(1), kind=real64)/count_rate
+
+        call system_clock(profiling(1), count_rate)
+        if (keyword==1) then
+          call worker1(key=keyword, array=key_work)
+        elseif (keyword==2) then
+          call worker2(key=keyword, array=key_work)
+        elseif (keyword==3) then
+          call worker3(key=keyword, array=key_work)
+        elseif (keyword==4) then
+          call worker4(key=keyword, array=key_work)
+        endif
+        call system_clock(profiling(2), count_rate)
+        system_clocks(2) = system_clocks(2) + real(profiling(2) - profiling(1), kind=real64)/count_rate
+
+        call system_clock(profiling(1), count_rate)
+        goto (10, 20, 30, 40), keyword
+        goto 50
+        10 call worker1(key=keyword, array=key_work) ; goto 50
+        20 call worker2(key=keyword, array=key_work) ; goto 50
+        30 call worker3(key=keyword, array=key_work) ; goto 50
+        40 call worker4(key=keyword, array=key_work) ; goto 50
+        50 continue
+        call system_clock(profiling(2), count_rate)
+        system_clocks(3) = system_clocks(3) + real(profiling(2) - profiling(1), kind=real64)/count_rate
+      enddo
+    enddo
+  enddo
+  print '(A,E23.15)', ' select case average performance: ', system_clocks(1)/tests_number
+  print '(A,E23.15)', ' if elseif   average performance: ', system_clocks(2)/tests_number
+  print '(A,E23.15)', ' goto        average performance: ', system_clocks(3)/tests_number
+
+  contains
+    pure subroutine worker1(key, array)
+      integer(int32),            intent(in)  :: key
+      real(real64), allocatable, intent(out) :: array(:)
+      integer(int32)                         :: j
+
+      allocate(array(1:key*tests_number))
+      array = 0._real64
+      do j=1, key*tests_number
+        array(j) = key**2._real64 * tests_number * j
+      enddo
+    endsubroutine worker1
+
+    pure subroutine worker2(key, array)
+      integer(int32),            intent(in)  :: key
+      real(real64), allocatable, intent(out) :: array(:)
+      integer(int32)                         :: j
+
+      allocate(array(1:key*tests_number))
+      array = 0._real64
+      do j=1, key*tests_number
+        array(j) = key**2._real64 * tests_number * j
+      enddo
+    endsubroutine worker2
+
+    pure subroutine worker3(key, array)
+      integer(int32),            intent(in)  :: key
+      real(real64), allocatable, intent(out) :: array(:)
+      integer(int32)                         :: j
+
+      allocate(array(1:key*tests_number))
+      array = 0._real64
+      do j=1, key*tests_number
+        array(j) = key**2._real64 * tests_number * j
+      enddo
+    endsubroutine worker3
+
+    pure subroutine worker4(key, array)
+      integer(int32),            intent(in)  :: key
+      real(real64), allocatable, intent(out) :: array(:)
+      integer(int32)                         :: j
+
+      allocate(array(1:key*tests_number))
+      array = 0._real64
+      do j=1, key*tests_number
+        array(j) = key**2._real64 * tests_number * j
+      enddo
+    endsubroutine worker4
+endprogram defy
diff --git a/src/goto_is_fastest/goto_if_select_comparison_3/run_gnu.sh b/src/goto_is_fastest/goto_if_select_comparison_3/run_gnu.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+# script to build and run DEFY tests.
+#
+# License: this file is licensed under the Creative Commons Attribution 4.0 license,
+# see http://creativecommons.org/licenses/by/4.0/ .
+
+test=$(basename $(pwd))/defy.f90
+echo "Build and run $test by means of 'gfortran -Og'"
+gfortran -Og defy.f90 -o defy
+./defy
+rm -f defy
diff --git a/src/goto_is_fastest/goto_if_select_comparison_3/run_gnu_optimized.sh b/src/goto_is_fastest/goto_if_select_comparison_3/run_gnu_optimized.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+# script to build and run DEFY tests.
+#
+# License: this file is licensed under the Creative Commons Attribution 4.0 license,
+# see http://creativecommons.org/licenses/by/4.0/ .
+
+test=$(basename $(pwd))/defy.f90
+echo "Build and run $test by means of 'gfortran -O3'"
+gfortran -O3 defy.f90 -o defy
+./defy
+rm -f defy
diff --git a/src/goto_is_fastest/goto_if_select_comparison_3/run_intel.sh b/src/goto_is_fastest/goto_if_select_comparison_3/run_intel.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+# script to build and run DEFY tests.
+#
+# License: this file is licensed under the Creative Commons Attribution 4.0 license,
+# see http://creativecommons.org/licenses/by/4.0/ .
+
+test=$(basename $(pwd))/defy.f90
+echo "Build and run $test by means of 'ifort -O0'"
+ifort -O0 defy.f90 -o defy
+./defy
+rm -f defy
diff --git a/src/goto_is_fastest/goto_if_select_comparison_3/run_intel_optimized.sh b/src/goto_is_fastest/goto_if_select_comparison_3/run_intel_optimized.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+# script to build and run DEFY tests.
+#
+# License: this file is licensed under the Creative Commons Attribution 4.0 license,
+# see http://creativecommons.org/licenses/by/4.0/ .
+
+test=$(basename $(pwd))/defy.f90
+echo "Build and run $test by means of 'ifort -O3'"
+ifort -O3 defy.f90 -o defy
+./defy
+rm -f defy
diff --git a/src/powers_naive_definitions_have_overhead/README.md b/src/powers_naive_definitions_have_overhead/README.md
@@ -1,10 +1,25 @@
 ### (Naive) definitions of powers (elevation) could have relevant overhead
 
-To be written.
+> A lazy (naive) definition of power elevations can generate relevant overhead degrading the computational speed.
+
+Power elevations can be written in different form. Let us consider the square computation. It can be written as
+
++ `a*a`, by means the multiplication operator;
++ `a**2`, by means of the power operator using the integer constant `2`;
++ `a**2.0`, by means of the power operator using the real constant `2.0` with the default kind;
++ `a**2.0_real64`, by means of the power operator using the real constant `2.0` with the 64 bits kind;
+
+> These definitions are not equivalent in terms of computational speed: they should be ordered form the fastest to the slowest.
+
+Similarly, the square root can be written as:
+
++ `sqrt(a)`, by means the builtin `sqrt` function;
++ `a**0.5`, by means of the power operator using the real constant `0.5` with the default kind;
++ `a**0.5_real64`, by means of the power operator using the real constant `0.5` with the 64 bits kind;
 
 ### Not demystified
 
-To be written.
+> The *myth* is confirmed (not demystified), but overheads are somehow less than expected.
 
 ### DEFY Tests