-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simulation performance #46
Comments
I have been through DRAMSim2 with a not-so-simple profiler and I agree there are some pretty serious issues. At some point I was putting together a series of patches to try to cut out some of the more egregious inefficiencies. However, I got busy with my thesis and work and so they never made it to the light of day because I never quite got them working. I could push these unfinished patches, but I'm not sure anyone would want to take on the task of finishing them off ... Do you know if there's a way to check if a builtin is supported by a compiler at compiletime? I don't want to add this in and then have it break builds for people using compilers other than some specific version of gcc. |
The log2 function is slow. But this is not the issue; calling it in the hot path is the real (performance) issue. The appended precomputes frequently-used values to avoid computing them in the hot path. This results in a lot less instructions executed, and a significant increase in performance. This closes issue umd-memsys#46: simulation performance. Before the patch: Performance counter stats for './DRAMSim -t traces/k6_aoe_02_short.trc -d ini/DDR3_micron_16M_8B_x8_sg15.ini -s system.ini.example -c 10000000' (3 runs): 20507.071226 task-clock # 1.000 CPUs utilized ( +- 1.09% ) 61 context-switches # 0.000 M/sec ( +- 0.95% ) 0 CPU-migrations # 0.000 M/sec 468 page-faults # 0.000 M/sec 58,683,786,689 cycles # 2.862 GHz ( +- 0.69% ) [83.33%] 13,434,240,170 stalled-cycles-frontend # 22.89% frontend cycles idle ( +- 2.80% ) [83.34%] 5,915,970,070 stalled-cycles-backend # 10.08% backend cycles idle ( +- 4.87% ) [66.67%] 120,280,797,002 instructions # 2.05 insns per cycle # 0.11 stalled cycles per insn ( +- 0.00% ) [83.33%] 23,425,385,282 branches # 1142.308 M/sec ( +- 0.01% ) [83.34%] 226,637,631 branch-misses # 0.97% of all branches ( +- 1.03% ) [83.33%] 20.514895432 seconds time elapsed ( +- 1.09% ) After: Performance counter stats for './DRAMSim -t traces/k6_aoe_02_short.trc -d ini/DDR3_micron_16M_8B_x8_sg15.ini -s system.ini.example -c 10000000' (3 runs): 14015.117288 task-clock # 1.000 CPUs utilized ( +- 1.02% ) 58 context-switches # 0.000 M/sec ( +- 10.11% ) 0 CPU-migrations # 0.000 M/sec ( +-100.00% ) 467 page-faults # 0.000 M/sec 39,609,908,343 cycles # 2.826 GHz ( +- 0.50% ) [83.33%] 9,797,983,314 stalled-cycles-frontend # 24.74% frontend cycles idle ( +- 2.11% ) [83.33%] 5,119,893,396 stalled-cycles-backend # 12.93% backend cycles idle ( +- 1.57% ) [66.67%] 75,126,820,352 instructions # 1.90 insns per cycle # 0.13 stalled cycles per insn ( +- 0.00% ) [83.34%] 13,072,793,821 branches # 932.764 M/sec ( +- 0.01% ) [83.35%] 172,600,788 branch-misses # 1.32% of all branches ( +- 3.76% ) [83.34%] 14.020889029 seconds time elapsed ( +- 1.02% ) Signed-off-by: Emilio G. Cota <[email protected]>
The log2 function is slow. But this is not the issue; calling it in the hot path is the real (performance) issue. The appended precomputes frequently-used values to avoid computing them in the hot path. This results in a lot less instructions executed, and a significant increase in performance. This closes issue umd-memsys#46: simulation performance. Before the patch: Performance counter stats for './DRAMSim -t traces/k6_aoe_02_short.trc -d ini/DDR3_micron_16M_8B_x8_sg15.ini -s system.ini.example -c 10000000' (3 runs): 20507.071226 task-clock # 1.000 CPUs utilized ( +- 1.09% ) 61 context-switches # 0.000 M/sec ( +- 0.95% ) 0 CPU-migrations # 0.000 M/sec 468 page-faults # 0.000 M/sec 58,683,786,689 cycles # 2.862 GHz ( +- 0.69% ) [83.33%] 13,434,240,170 stalled-cycles-frontend # 22.89% frontend cycles idle ( +- 2.80% ) [83.34%] 5,915,970,070 stalled-cycles-backend # 10.08% backend cycles idle ( +- 4.87% ) [66.67%] 120,280,797,002 instructions # 2.05 insns per cycle # 0.11 stalled cycles per insn ( +- 0.00% ) [83.33%] 23,425,385,282 branches # 1142.308 M/sec ( +- 0.01% ) [83.34%] 226,637,631 branch-misses # 0.97% of all branches ( +- 1.03% ) [83.33%] 20.514895432 seconds time elapsed ( +- 1.09% ) After: Performance counter stats for './DRAMSim -t traces/k6_aoe_02_short.trc -d ini/DDR3_micron_16M_8B_x8_sg15.ini -s system.ini.example -c 10000000' (3 runs): 14015.117288 task-clock # 1.000 CPUs utilized ( +- 1.02% ) 58 context-switches # 0.000 M/sec ( +- 10.11% ) 0 CPU-migrations # 0.000 M/sec ( +-100.00% ) 467 page-faults # 0.000 M/sec 39,609,908,343 cycles # 2.826 GHz ( +- 0.50% ) [83.33%] 9,797,983,314 stalled-cycles-frontend # 24.74% frontend cycles idle ( +- 2.11% ) [83.33%] 5,119,893,396 stalled-cycles-backend # 12.93% backend cycles idle ( +- 1.57% ) [66.67%] 75,126,820,352 instructions # 1.90 insns per cycle # 0.13 stalled cycles per insn ( +- 0.00% ) [83.34%] 13,072,793,821 branches # 932.764 M/sec ( +- 0.01% ) [83.35%] 172,600,788 branch-misses # 1.32% of all branches ( +- 3.76% ) [83.34%] 14.020889029 seconds time elapsed ( +- 1.02% ) Signed-off-by: Emilio G. Cota <[email protected]>
The log2 function is slow. But this is not the issue; calling it in the hot path is the real (performance) issue. The appended precomputes frequently-used values to avoid computing them in the hot path. This results in a lot less instructions executed, and a significant increase in performance. This closes issue umd-memsys#46: simulation performance. Before the patch: Performance counter stats for './DRAMSim -t traces/k6_aoe_02_short.trc -d ini/DDR3_micron_16M_8B_x8_sg15.ini -s system.ini.example -c 10000000' (3 runs): 20507.071226 task-clock # 1.000 CPUs utilized ( +- 1.09% ) 61 context-switches # 0.000 M/sec ( +- 0.95% ) 0 CPU-migrations # 0.000 M/sec 468 page-faults # 0.000 M/sec 58,683,786,689 cycles # 2.862 GHz ( +- 0.69% ) [83.33%] 13,434,240,170 stalled-cycles-frontend # 22.89% frontend cycles idle ( +- 2.80% ) [83.34%] 5,915,970,070 stalled-cycles-backend # 10.08% backend cycles idle ( +- 4.87% ) [66.67%] 120,280,797,002 instructions # 2.05 insns per cycle # 0.11 stalled cycles per insn ( +- 0.00% ) [83.33%] 23,425,385,282 branches # 1142.308 M/sec ( +- 0.01% ) [83.34%] 226,637,631 branch-misses # 0.97% of all branches ( +- 1.03% ) [83.33%] 20.514895432 seconds time elapsed ( +- 1.09% ) After: Performance counter stats for './DRAMSim -t traces/k6_aoe_02_short.trc -d ini/DDR3_micron_16M_8B_x8_sg15.ini -s system.ini.example -c 10000000' (3 runs): 15562.506598 task-clock # 1.000 CPUs utilized ( +- 0.72% ) 55 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec ( +-100.00% ) 469 page-faults # 0.000 M/sec ( +- 0.07% ) 43,650,612,082 cycles # 2.805 GHz ( +- 0.58% ) [83.33%] 11,878,548,969 stalled-cycles-frontend # 27.21% frontend cycles idle ( +- 1.46% ) [83.33%] 6,125,126,936 stalled-cycles-backend # 14.03% backend cycles idle ( +- 3.74% ) [66.67%] 82,655,485,444 instructions # 1.89 insns per cycle # 0.14 stalled cycles per insn ( +- 0.01% ) [83.33%] 14,515,927,254 branches # 932.750 M/sec ( +- 0.02% ) [83.34%] 235,566,078 branch-misses # 1.62% of all branches ( +- 1.87% ) [83.34%] 15.568698124 seconds time elapsed ( +- 0.72% ) Signed-off-by: Emilio G. Cota <[email protected]>
The log2 function is slow. But this is not the issue; calling it in the hot path is the real (performance) issue. The appended precomputes frequently-used values to avoid computing them in the hot path. This results in a lot less instructions executed, and a significant increase in performance. This closes issue umd-memsys#46: simulation performance. Before the patch: Performance counter stats for './DRAMSim -t traces/k6_aoe_02_short.trc -d ini/DDR3_micron_16M_8B_x8_sg15.ini -s system.ini.example -c 10000000' (3 runs): 20507.071226 task-clock # 1.000 CPUs utilized ( +- 1.09% ) 61 context-switches # 0.000 M/sec ( +- 0.95% ) 0 CPU-migrations # 0.000 M/sec 468 page-faults # 0.000 M/sec 58,683,786,689 cycles # 2.862 GHz ( +- 0.69% ) [83.33%] 13,434,240,170 stalled-cycles-frontend # 22.89% frontend cycles idle ( +- 2.80% ) [83.34%] 5,915,970,070 stalled-cycles-backend # 10.08% backend cycles idle ( +- 4.87% ) [66.67%] 120,280,797,002 instructions # 2.05 insns per cycle # 0.11 stalled cycles per insn ( +- 0.00% ) [83.33%] 23,425,385,282 branches # 1142.308 M/sec ( +- 0.01% ) [83.34%] 226,637,631 branch-misses # 0.97% of all branches ( +- 1.03% ) [83.33%] 20.514895432 seconds time elapsed ( +- 1.09% ) After: Performance counter stats for './DRAMSim -t traces/k6_aoe_02_short.trc -d ini/DDR3_micron_16M_8B_x8_sg15.ini -s system.ini.example -c 10000000' (3 runs): 15562.506598 task-clock # 1.000 CPUs utilized ( +- 0.72% ) 55 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec ( +-100.00% ) 469 page-faults # 0.000 M/sec ( +- 0.07% ) 43,650,612,082 cycles # 2.805 GHz ( +- 0.58% ) [83.33%] 11,878,548,969 stalled-cycles-frontend # 27.21% frontend cycles idle ( +- 1.46% ) [83.33%] 6,125,126,936 stalled-cycles-backend # 14.03% backend cycles idle ( +- 3.74% ) [66.67%] 82,655,485,444 instructions # 1.89 insns per cycle # 0.14 stalled cycles per insn ( +- 0.01% ) [83.33%] 14,515,927,254 branches # 932.750 M/sec ( +- 0.02% ) [83.34%] 235,566,078 branch-misses # 1.62% of all branches ( +- 1.87% ) [83.34%] 15.568698124 seconds time elapsed ( +- 0.72% ) Signed-off-by: Emilio G. Cota <[email protected]>
Problem:
running a simple profiler, I see that the program spends so much time in this function:
unsigned dramsim_log2(unsigned) in SystemConfiguration.h
dramsim_log2 is used too frequently. For example, it is invoked 7 times in function addressMapping in AddressMapping.cpp, on variables that do not change at all during one simulation.
My suggestion:
Thanks
The text was updated successfully, but these errors were encountered: