Contributors: Kyriafinis Vasilis, Nikolaos Giannopoulos
Winter Semester 2021 - 2022
- 1. Contents
- 2. Simulation Parameters (Question 1)
- 3. Run Statistics (Question 2)
- 4. CPI (Question 3)
- 5. Documentation References (Question 4)
- 6. Compile and simulate a C program for ARM (Question 4a)
- 7. Change parameters and detect differences (Question 4b)
- 8. Sources
- 9. Σχόλια για την εργασία
The file starter_se.py
contains the script that defines the characteristics of the system to be emulated. The command to run the simulation is:
./build/ARM/gem5.opt configs/example/arm/starter_se.py --cpu="minor" "tests/test-progs/hello/bin/arm/linux/hello"
- The
configs/example/arm/starter_se.py
flag is the path to configuration script of the simulation - The
--cpu="minor"
"tests/test-progs/hello/bin/arm/linux/hello"
are the command line arguments for the configuration script--cpu="minor"
: Defines the type of the CPU to be used (default="atomic")."tests/test-progs/hello/bin/arm/linux/hello"
: The path to the executable binary that will run in the emulator.
In the lines 189 to 208 of the file starter_se.py
the arguments are parsed. For this simulation only the above arguments were provided, so all the other arguments defaulted to the values specified in the script.
--cpu-freq
: CPU clock frequency. Default = 1GHz--num-cores
: Number of cores of the CPU. Default = 1--mem-type
: Type of system memory. Default =DDR3_1600_8x8
meaning Transfer rate = 1.6 x 8 x 8 x 1/8 = 12.8GBps--mem-channels
: The number of memory channels. Default = 2--mem-ranks
: Number of ranks per memory channel. Default = None--mem-size
: Memory size. Default = 2GB
-
System Clock: The system clock has a default frequency of 1GHz and it is not the same clock with the CPU clock. To change the system clock frequency the value in the below line must be changed.
self.clk_domain = SrcClockDomain(clock="1GHz",voltage_domain=self.voltage_domain)
-
Memory bus: The memory bus connects the CPU with the system memory.
self.membus = SystemXBar()
-
DRAM: The Dynamic Random Access Memory of the sytem. By default the
starter_se.py
script specifiesDDR3_1600_8x8
but this can be changed by setting the flag--mem-type {type of DRAM}
. -
CPU: The type and the frequency of the CPU are determined by the command line arguments. To change the default frequency of the CPU the
--cpu-freq
argument must be passed to the script.devices.`path/to/binary`(self, args.num_cores, args.cpu_freq, "1.2V", *cpu_types[args.cpu])
Depending on the types of the CPU it is determined if the system will have cache memories. If the memory model is not Atomic, which means the data are read from the memory with no delays, 2 cache levels are created.
-
L1 Cache: L1 cache is private to each core, so that every core can only access his L1 cache.
self.cpu_cluster.addL1()
-
L2 Cache: L2 Cache is shared between all of the cores and is unified, meaning the data and the instructions use the same memory.
self.cpu_cluster.addL2(self.cpu_cluster.clk_domain)
The stats.txt
file contains information about the simulation from all the SimObjects. At the end of the simulation the statistics are automatically dumped to the file. Some important information can be derived from the bellow values:
- sim_seconds: Number of seconds simulated (0.000035 s). This is the time the binary took to execute in the simulator.
- sim_insts: The total number of instructions that where simulated (5027 instructions).
- host_inst_rate: The instructions per second of the simulator on the host machine (118842 inst/s). Basically this is the performance of the gem5 simulator.
In order to calculate the CPI (Clocks per instruction) the total cache misses for L1 data and instructions and the total cache misses for L2 are needed. Also the miss penalties and hit penalties for both levels of cache and finally the total number of instruction executed are required. The mathematical type is very simple:
The first 1 is there because if we miss on the L1 cache we have already paid for the hit penalty. Afterward all the cycles spent dealing with the miss are counted and finally the average per instruction is calculated by dividing with the number of the total instructions.
In our case the number of misses on the L1 cache are:
-
327 misses for instruction cache:
system.cpu_cluster.cpus.icache.demand_misses::.cpu_cluster.cpus.inst 327 # number of demand (read+write) misses
-
177 misses for data cache:
system.cpu_cluster.cpus.dcache.demand_misses::.cpu_cluster.cpus.data 177 # number of demand (read+write) misses
The L2 cache is unified, meaning there is no instruction and data part. So the misses for the L2 cache are calculated by adding the misses from L2 instructions and L2 data misses.
In our case the number of misses on the L2 cache are:
-
327 misses for instructions:
system.cpu_cluster.l2.demand_misses::.cpu_cluster.cpus.inst 327 # number of demand (read+write) misses
-
147 misses for data:
system.cpu_cluster.l2.demand_misses::.cpu_cluster.cpus.data 147 # number of demand (read+write) misses
The number of instruction misses are the same on the L1 and L2 caches because the different instructions executed fitted inside the L1 instruction cache and the L2 cache. That means the instructions where requested by the CPU and were loaded from the DRAM so for every instruction fetched there was an initial compulsory miss.
On the other hand the data misses are more for the L1 cache. This is the case because data where used more than once so after the first compulsory miss and the data fetch from the DRAM to the L2 cache the data were small enough to remain inside L2 cache but not small enough to fit in the L1 cache all at once. So there were cases that data was requested by the CPU the L1 cache got a miss bu the L2 cache got a hit.
Finally the total instructions simulated were:
sim_insts 5027 # Number of instructions simulated
Applying the above data to the equation gives CPI = 6.32 cycles per instruction
- SimpleCPU - A good place to start learning about how to fetch, decode, execute, and complete instructions in M5.
- O3CPU - Specific documentation on how all of the pipeline stages work, and how to modify and create new CPU models based on it.
- Checker - Details how to use it in your CPU model.
- InOrderCPU - Specific documentation on how all of the pipeline stages work, and how to modify and create new CPU models based on it.
The InOrder CPU model was designed to provide a generic framework to simulate in-order pipelines with an arbitrary ISA and with arbitrary pipeline descriptions. The model was originally conceived by closely mirroring the O3CPU model to provide a simulation framework that would operate at the "Tick" granularity. We then abstract the individual stages in the O3 model to provide generic pipeline stages for the InOrder CPU to leverage in creating a user-defined amount of pipeline stages. Additionally, we abstract each component that a CPU might need to access (ALU, Branch Predictor, etc.) into a "resource" that needs to be requested by each instruction according to the resource-request model we implemented. This will potentially allow for researchers to model custom pipelines without the cost of designing the complete CPU from scratch.
Pipeline stages in the InOrder CPU are implemented as abstract implementations of what a pipeline stage would be in any CPU model. Typically, one would imagine a particularly pipeline stage being responsible for:
- Performing specific operations such as "Decode" or "Execute" and either
- Sending that instruction to the next stage if that operation was successful and the next stage's buffer has room for incoming instructions or
- Keeping that instruction in the pipeline's instruction buffer if that operation was unsuccessful or there is no room in the next stage's buffer
The "PipelineStage" class maintains the functionality of (2a) and (2b) but abstracts (1) out of the implementation. More specifically, no pipeline stage is explicitly marked "Decode" or "Execute". Instead, the PipelineStage class allows the instruction and it's corresponding instruction schedule to define what tasks they want to do in a particular stage.
At the heart of the InOrderCPU model is the concept of Instruction Schedules (IS). Instruction schedules create the generic framework that allow for developer's to make a custom pipeline. A pipeline definition can be seen as a collection of instruction schedules that govern what an instruction will do in any given stage and what stage that instruction will go to next. In general, each instruction has a stage-by-stage list of tasks that need to be accomplished before moving on to the next stage. This list we refer to as the instruction's schedule. Each list is composed of "ScheduleEntry"s that define a task for the instruction to do for a given pipeline stage. Instruction scheduling is then divided into a front-end schedule (e.g. Instruction Fetch and Decode) which is uniform for all the instructions, and a back-end schedule, which varies across the different instructions (e.g. a 'addu' instruction and a 'mult' instruction need to access different resources). The combination of a front-end schedule and a back-end schedule make up the instruction schedule. Ideally, changing the pipeline can be as simple as editing how a certain class of instructions operate by editing the instruction schedule functions.
The simpleExample.c file is a simple program that generates 2 tables with random values of 3x3 floating point values which was compiled with the following command:
arm-linux-gnueabihf-gcc --static tests/test-progs/simplyTableExample/simpleExample.c -o tests/test-progs/simplyTableExample/simpleExample.out
Then having previously compiled with the command:
scons build/ARM/gem5.opt -j 2 -force-lto
we run the following command:
./build/ARM/gem5.opt -d TimeSimpleCPU configs/example/se.py --cmd=tests/test-progs/simplyTableExample/simpleExample.out --cpu-type=TimingSimpleCPU --caches
And we get the TimeSimpleCPU_stats.txt file after the end of the simulation. The process is repeated for the MinorCPU
model:
./build/ARM/gem5.opt -d MinorCPU configs/example/se.py --cmd=tests/test-progs/simplyTableExample/simpleExample.out --cpu-type=MinorCPU --caches
And we get the MinorCPU_stats.txt file after the end of the simulation.
We'll try to change the frequency of operation and the memory technology in both cases.
-
To change the operating frequency for CPU TimingSimpleCPU we execute the following command:
./build/ARM/gem5.opt -d TimeSimpleCPU_changed configs/example/se.py --cmd=tests/test-progs/simplyTableExample/simpleExample.out --cpu-type=TimingSimpleCPU --cpu-clock=2.5GHz --caches
Where at the exit you will notice the difference in the "tick"
Exiting @ tick 99151200 because exiting with last active thread context
Compared with the original results where we hadn't touched the frequency at all
Exiting @ tick 114677000 because exiting with last active thread context
The difference between the original frequency and the new frequency, that we defined before, making the subtraction is
114.677.000 - 99.151.200 = 15.525.800
. That's due to the higher frequency but the same size and architecture of memory. TimingSimpleCPU 1GHz is faster with a frequency of 2.5GHz -
To change the operating frequency for CPU MinorCPU we execute the following command:
./build/ARM/gem5.opt -d MinorCPU_changed configs/example/se.py --cmd=tests/test-progs/simplyTableExample/simpleExample.out --cpu-type=MinorCPU --cpu-clock=2.5GHz --caches
The simulation result taken from the exit is:
Exiting @ tick 77432400 because exiting with last active thread context
Comparing it with the original situation without any change in frequency
Exiting @ tick 84846000 because exiting with last active thread context
As before, making the difference we get
84.846.000-77.432.400= 7.413.600
and again we end up having a small difference (not like the previous one) with the higher frequency in the MinorCPU processor. -
We understand that the difference in frequency affects the operation of the processor very much, sometimes it affects it very much as in the first case with TimingSImpleCPU, sometimes less like MinorCPU.TimingSimpleCPU uses timing memory accesses. It delays cache accesses and waits for the memory system to respond before proceeding. Like the AtomicSimpleCPU, the TimingSimpleCPU is derived from the BaseSimpleCPU and implements the same set of functions. It defines the port used to connect to the memory and connects the CPU to the cache. It also defines the necessary functions to handle the response from memory to the accesses sent.The conclusion is that the TimingSimpleCPU is faster than the MinorCPU and this can be seen in the 'tick' after the frequency change.
-
To change the RAM architecture for CPU TimingSimpleCPU we execute the following command:
./build/ARM/gem5.opt -d TimingSimple__changed_mem_type configs/example/se.py --cmd=tests/test-progs/simplyTableExample/simpleExample.out --cpu-type=TimingSimpleCPU --mem-type=DDR4_2400_16x4 --caches
Where at the exit as before you will notice the difference in the "tick"
Exiting @ tick 114281000 because exiting with last active thread context
Compared to the initial results where we had not touched the memory technology at all
Exiting @ tick 114677000 because exiting with last active thread context
The difference between the original memory technology and the new one we defined before removing the ticks is
114,677,000 - 114,281,000 = 396,000
. This is due to the higher GB/s transmission which is2.4x16x16x4x1/8 = 19.2GBps
but having the same frequency as the original. -
To change the RAM architecture for CPU MinorCPU we execute the following command:
./build/ARM/gem5.opt -d MinorCPU_changed_mem_type configs/example/se.py --cmd=tests/test-progs/simplyTableExample/simpleExample.out --cpu-type=MinorCPU --mem-type=DDR4_2400_16x4 --caches
And the exit is
Exiting @ tick 80777000 because exiting with last active thread context
Compared to the initial results where we had not touched the memory technology at all
Exiting @ tick 84846000 because exiting with last active thread context
As before, making the difference we have
84,846,000 - 80,777,000 = 4,069,000
and again we end up with a small difference (not like the previous one) with the different memory technology in the MinorCPU processor. -
So we understand that the difference in memory technology moderately affects the operation of the processor.TimingSimpleCPU uses timed memory accesses. But here due to the faster and higher GBps transfer rate of RAM used was better compared to MinorCPU and all the disadvantages created earlier are negated. But MinorCPU uses InOrder technology for which it uses a fixed pipeline but with configurable data structures and execution behavior but due to the fixed pipeline the higher the speed of transfers to and from it after the "upper limit" of the pipeline it stops progressing and thus becomes worse than TimingSimpleCPU. The conclusion is that MinorCPU is faster than TimingSimpleCPU this is seen in the difference in both 'tick' after the memory technology change.
Η εισαγωγή στον προσομοιωτή GEM5 ήταν μια πολύ ωραία εμπειρία που έδειξε πτυχές του γηπέδου που δεν είναι τόσο ορατές σε όλους. Δεδομένου ότι κάθε αρχιτεκτονική διαφέρει και ότι κάθε νέα ή παλιά αρχιτεκτονική περιελάμβανε μεγάλη έρευνα και μελέτη για την επίτευξη πολλών ζητημάτων, όπως για παράδειγμα η αλλαγή ενός υπο-αρχιτεκτονικού μνημείου τα αποτελέσματα ήταν άλλοτε δραματικά δυσάρεστα και άλλοτε θεαματικά όμορφα, πρέπει να υπάρχει μια "χρυσή τομή" μεταξύ όλων των συστημάτων και όλα πρέπει να λειτουργούν αρμονικά χωρίς καμία παραμόρφωση, ώστε το πραγματικό προϊόν να είναι πολύ κοντά σε αυτό που βλέπουμε στην προσομοίωση. Το συμπέρασμα της όλης διαδικασίας είναι ότι απαιτείται μια ομάδα για τη δημιουργία μιας νέας αρχιτεκτονικής που θα ανταποκρίνεται στις ανάγκες της τεχνολογικής προόδου και της εξερεύνησης πολλών πτυχών της φύσης. Φυσικά υπάρχουν φυσικά όρια, όπως για παράδειγμα το GAP MEMORY μεταξύ της ταχύτητας των επεξεργαστών σε σύγκριση με αυτή των μνημών, το οποίο θα εξελίσσεται αργά ή γρήγορα με νέες καινοτομίες.
Σε γενικές γραμμές δεν υπήρχε καμία δυσκολία με τον προσομοιωτή εκτός από το ότι έπρεπε να κάνουμε compile με την εντολή scons..
και υπήρχαν προβλήματα με το linking και άλλα προβλήματα με την υπερφόρτωση της μνήμης ακόμα και αν μπορούσαμε να N = 1
. Αυτό το πρόβλημα λύθηκε με το να σας δώσουμε ένα έτοιμο VirtualMachine με εγκατεστημένο το GEM5.