Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simulation performance #46

Open
wky opened this issue Aug 21, 2014 · 1 comment
Open

simulation performance #46

wky opened this issue Aug 21, 2014 · 1 comment

Comments

@wky
Copy link

wky commented Aug 21, 2014

Problem:
running a simple profiler, I see that the program spends so much time in this function:
unsigned dramsim_log2(unsigned) in SystemConfiguration.h

dramsim_log2 is used too frequently. For example, it is invoked 7 times in function addressMapping in AddressMapping.cpp, on variables that do not change at all during one simulation.

My suggestion:

  1. use __builtin_clz to calculate log2
  2. use pre-calculated log2 number in the program

Thanks

@dramninjasUMD
Copy link
Collaborator

I have been through DRAMSim2 with a not-so-simple profiler and I agree there are some pretty serious issues. At some point I was putting together a series of patches to try to cut out some of the more egregious inefficiencies. However, I got busy with my thesis and work and so they never made it to the light of day because I never quite got them working. I could push these unfinished patches, but I'm not sure anyone would want to take on the task of finishing them off ...

Do you know if there's a way to check if a builtin is supported by a compiler at compiletime? I don't want to add this in and then have it break builds for people using compilers other than some specific version of gcc.

cota added a commit to cota/DRAMSim2 that referenced this issue Nov 5, 2014
The log2 function is slow. But this is not the issue; calling it
in the hot path is the real (performance) issue.

The appended precomputes frequently-used values to avoid computing
them in the hot path. This results in a lot less instructions
executed, and a significant increase in performance.

This closes issue umd-memsys#46: simulation performance.

Before the patch:

 Performance counter stats for './DRAMSim -t traces/k6_aoe_02_short.trc -d ini/DDR3_micron_16M_8B_x8_sg15.ini -s system.ini.example -c 10000000' (3 runs):

      20507.071226 task-clock                #    1.000 CPUs utilized            ( +-  1.09% )
                61 context-switches          #    0.000 M/sec                    ( +-  0.95% )
                 0 CPU-migrations            #    0.000 M/sec
               468 page-faults               #    0.000 M/sec
    58,683,786,689 cycles                    #    2.862 GHz                      ( +-  0.69% ) [83.33%]
    13,434,240,170 stalled-cycles-frontend   #   22.89% frontend cycles idle     ( +-  2.80% ) [83.34%]
     5,915,970,070 stalled-cycles-backend    #   10.08% backend  cycles idle     ( +-  4.87% ) [66.67%]
   120,280,797,002 instructions              #    2.05  insns per cycle
                                             #    0.11  stalled cycles per insn  ( +-  0.00% ) [83.33%]
    23,425,385,282 branches                  # 1142.308 M/sec                    ( +-  0.01% ) [83.34%]
       226,637,631 branch-misses             #    0.97% of all branches          ( +-  1.03% ) [83.33%]

      20.514895432 seconds time elapsed                                          ( +-  1.09% )

After:

 Performance counter stats for './DRAMSim -t traces/k6_aoe_02_short.trc -d ini/DDR3_micron_16M_8B_x8_sg15.ini -s system.ini.example -c 10000000' (3 runs):

      14015.117288 task-clock                #    1.000 CPUs utilized            ( +-  1.02% )
                58 context-switches          #    0.000 M/sec                    ( +- 10.11% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +-100.00% )
               467 page-faults               #    0.000 M/sec
    39,609,908,343 cycles                    #    2.826 GHz                      ( +-  0.50% ) [83.33%]
     9,797,983,314 stalled-cycles-frontend   #   24.74% frontend cycles idle     ( +-  2.11% ) [83.33%]
     5,119,893,396 stalled-cycles-backend    #   12.93% backend  cycles idle     ( +-  1.57% ) [66.67%]
    75,126,820,352 instructions              #    1.90  insns per cycle
                                             #    0.13  stalled cycles per insn  ( +-  0.00% ) [83.34%]
    13,072,793,821 branches                  #  932.764 M/sec                    ( +-  0.01% ) [83.35%]
       172,600,788 branch-misses             #    1.32% of all branches          ( +-  3.76% ) [83.34%]

      14.020889029 seconds time elapsed                                          ( +-  1.02% )

Signed-off-by: Emilio G. Cota <[email protected]>
cota added a commit to cota/DRAMSim2 that referenced this issue Nov 5, 2014
The log2 function is slow. But this is not the issue; calling it
in the hot path is the real (performance) issue.

The appended precomputes frequently-used values to avoid computing
them in the hot path. This results in a lot less instructions
executed, and a significant increase in performance.

This closes issue umd-memsys#46: simulation performance.

Before the patch:

 Performance counter stats for './DRAMSim -t traces/k6_aoe_02_short.trc -d ini/DDR3_micron_16M_8B_x8_sg15.ini -s system.ini.example -c 10000000' (3 runs):

      20507.071226 task-clock                #    1.000 CPUs utilized            ( +-  1.09% )
                61 context-switches          #    0.000 M/sec                    ( +-  0.95% )
                 0 CPU-migrations            #    0.000 M/sec
               468 page-faults               #    0.000 M/sec
    58,683,786,689 cycles                    #    2.862 GHz                      ( +-  0.69% ) [83.33%]
    13,434,240,170 stalled-cycles-frontend   #   22.89% frontend cycles idle     ( +-  2.80% ) [83.34%]
     5,915,970,070 stalled-cycles-backend    #   10.08% backend  cycles idle     ( +-  4.87% ) [66.67%]
   120,280,797,002 instructions              #    2.05  insns per cycle
                                             #    0.11  stalled cycles per insn  ( +-  0.00% ) [83.33%]
    23,425,385,282 branches                  # 1142.308 M/sec                    ( +-  0.01% ) [83.34%]
       226,637,631 branch-misses             #    0.97% of all branches          ( +-  1.03% ) [83.33%]

      20.514895432 seconds time elapsed                                          ( +-  1.09% )

After:

 Performance counter stats for './DRAMSim -t traces/k6_aoe_02_short.trc -d ini/DDR3_micron_16M_8B_x8_sg15.ini -s system.ini.example -c 10000000' (3 runs):

      14015.117288 task-clock                #    1.000 CPUs utilized            ( +-  1.02% )
                58 context-switches          #    0.000 M/sec                    ( +- 10.11% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +-100.00% )
               467 page-faults               #    0.000 M/sec
    39,609,908,343 cycles                    #    2.826 GHz                      ( +-  0.50% ) [83.33%]
     9,797,983,314 stalled-cycles-frontend   #   24.74% frontend cycles idle     ( +-  2.11% ) [83.33%]
     5,119,893,396 stalled-cycles-backend    #   12.93% backend  cycles idle     ( +-  1.57% ) [66.67%]
    75,126,820,352 instructions              #    1.90  insns per cycle
                                             #    0.13  stalled cycles per insn  ( +-  0.00% ) [83.34%]
    13,072,793,821 branches                  #  932.764 M/sec                    ( +-  0.01% ) [83.35%]
       172,600,788 branch-misses             #    1.32% of all branches          ( +-  3.76% ) [83.34%]

      14.020889029 seconds time elapsed                                          ( +-  1.02% )

Signed-off-by: Emilio G. Cota <[email protected]>
cota added a commit to cota/DRAMSim2 that referenced this issue Nov 5, 2014
The log2 function is slow. But this is not the issue; calling it
in the hot path is the real (performance) issue.

The appended precomputes frequently-used values to avoid computing
them in the hot path. This results in a lot less instructions
executed, and a significant increase in performance.

This closes issue umd-memsys#46: simulation performance.

Before the patch:

 Performance counter stats for './DRAMSim -t traces/k6_aoe_02_short.trc -d ini/DDR3_micron_16M_8B_x8_sg15.ini -s system.ini.example -c 10000000' (3 runs):

      20507.071226 task-clock                #    1.000 CPUs utilized            ( +-  1.09% )
                61 context-switches          #    0.000 M/sec                    ( +-  0.95% )
                 0 CPU-migrations            #    0.000 M/sec
               468 page-faults               #    0.000 M/sec
    58,683,786,689 cycles                    #    2.862 GHz                      ( +-  0.69% ) [83.33%]
    13,434,240,170 stalled-cycles-frontend   #   22.89% frontend cycles idle     ( +-  2.80% ) [83.34%]
     5,915,970,070 stalled-cycles-backend    #   10.08% backend  cycles idle     ( +-  4.87% ) [66.67%]
   120,280,797,002 instructions              #    2.05  insns per cycle
                                             #    0.11  stalled cycles per insn  ( +-  0.00% ) [83.33%]
    23,425,385,282 branches                  # 1142.308 M/sec                    ( +-  0.01% ) [83.34%]
       226,637,631 branch-misses             #    0.97% of all branches          ( +-  1.03% ) [83.33%]

      20.514895432 seconds time elapsed                                          ( +-  1.09% )

After:

 Performance counter stats for './DRAMSim -t traces/k6_aoe_02_short.trc -d ini/DDR3_micron_16M_8B_x8_sg15.ini -s system.ini.example -c 10000000' (3 runs):

      15562.506598 task-clock                #    1.000 CPUs utilized            ( +-  0.72% )
                55 context-switches          #    0.000 M/sec
                 0 CPU-migrations            #    0.000 M/sec                    ( +-100.00% )
               469 page-faults               #    0.000 M/sec                    ( +-  0.07% )
    43,650,612,082 cycles                    #    2.805 GHz                      ( +-  0.58% ) [83.33%]
    11,878,548,969 stalled-cycles-frontend   #   27.21% frontend cycles idle     ( +-  1.46% ) [83.33%]
     6,125,126,936 stalled-cycles-backend    #   14.03% backend  cycles idle     ( +-  3.74% ) [66.67%]
    82,655,485,444 instructions              #    1.89  insns per cycle
                                             #    0.14  stalled cycles per insn  ( +-  0.01% ) [83.33%]
    14,515,927,254 branches                  #  932.750 M/sec                    ( +-  0.02% ) [83.34%]
       235,566,078 branch-misses             #    1.62% of all branches          ( +-  1.87% ) [83.34%]

      15.568698124 seconds time elapsed                                          ( +-  0.72% )

Signed-off-by: Emilio G. Cota <[email protected]>
cota added a commit to cota/DRAMSim2 that referenced this issue Nov 5, 2014
The log2 function is slow. But this is not the issue; calling it
in the hot path is the real (performance) issue.

The appended precomputes frequently-used values to avoid computing
them in the hot path. This results in a lot less instructions
executed, and a significant increase in performance.

This closes issue umd-memsys#46: simulation performance.

Before the patch:

 Performance counter stats for './DRAMSim -t traces/k6_aoe_02_short.trc -d ini/DDR3_micron_16M_8B_x8_sg15.ini -s system.ini.example -c 10000000' (3 runs):

      20507.071226 task-clock                #    1.000 CPUs utilized            ( +-  1.09% )
                61 context-switches          #    0.000 M/sec                    ( +-  0.95% )
                 0 CPU-migrations            #    0.000 M/sec
               468 page-faults               #    0.000 M/sec
    58,683,786,689 cycles                    #    2.862 GHz                      ( +-  0.69% ) [83.33%]
    13,434,240,170 stalled-cycles-frontend   #   22.89% frontend cycles idle     ( +-  2.80% ) [83.34%]
     5,915,970,070 stalled-cycles-backend    #   10.08% backend  cycles idle     ( +-  4.87% ) [66.67%]
   120,280,797,002 instructions              #    2.05  insns per cycle
                                             #    0.11  stalled cycles per insn  ( +-  0.00% ) [83.33%]
    23,425,385,282 branches                  # 1142.308 M/sec                    ( +-  0.01% ) [83.34%]
       226,637,631 branch-misses             #    0.97% of all branches          ( +-  1.03% ) [83.33%]

      20.514895432 seconds time elapsed                                          ( +-  1.09% )

After:

 Performance counter stats for './DRAMSim -t traces/k6_aoe_02_short.trc -d ini/DDR3_micron_16M_8B_x8_sg15.ini -s system.ini.example -c 10000000' (3 runs):

      15562.506598 task-clock                #    1.000 CPUs utilized            ( +-  0.72% )
                55 context-switches          #    0.000 M/sec
                 0 CPU-migrations            #    0.000 M/sec                    ( +-100.00% )
               469 page-faults               #    0.000 M/sec                    ( +-  0.07% )
    43,650,612,082 cycles                    #    2.805 GHz                      ( +-  0.58% ) [83.33%]
    11,878,548,969 stalled-cycles-frontend   #   27.21% frontend cycles idle     ( +-  1.46% ) [83.33%]
     6,125,126,936 stalled-cycles-backend    #   14.03% backend  cycles idle     ( +-  3.74% ) [66.67%]
    82,655,485,444 instructions              #    1.89  insns per cycle
                                             #    0.14  stalled cycles per insn  ( +-  0.01% ) [83.33%]
    14,515,927,254 branches                  #  932.750 M/sec                    ( +-  0.02% ) [83.34%]
       235,566,078 branch-misses             #    1.62% of all branches          ( +-  1.87% ) [83.34%]

      15.568698124 seconds time elapsed                                          ( +-  0.72% )

Signed-off-by: Emilio G. Cota <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants