-
Notifications
You must be signed in to change notification settings - Fork 1
QEMU execution flow
Starting from this section, we explain the high level ideas (explained in the previous sections) by walking through the code. We start our explanation from the Qemu “main” function. The main execution loop of QEMU is in a function named cpu_exec. This function is called in a subtle way. The call trace to this function starts from pc_init_pci function for x86 architecture. This function itself is the init function for an abstraction object called QEMUMachine. After initialization, Qemu calls this function in the “main” through a machine->init call (line 3397, vl.c). Setting pc_init_pci as the init function of QEMUMachine happens even before calling QEMU “main” function. Qemu sets a constructor function that will be called automatically before loading the main module. This constructor calls register_module_init function to register QEMUMachine objects with their pc_init_pci function . Figure 3 shows the execution trace in the .init constructor of the main module. find_type finds the relevant abstraction struct, QEMUMachine in this case, and sets the function pointer. The reason Qemu works as above is to be as platform neutral as possible. Many of the previous initializations is based on the configurations that happen at compile time even before executin Qemu.
Figure 3. .init execution
The “main” function initializes several parameters and prepares the Qemu environment for execution. The “main” function for the emulator exists in the vl.c file. The list of initialization actions is very large. We discuss four important initialization actions in this section. We discuss two more in __ldb_mmu can not be found in the code because it is defined through a glue mechanism(see XXX). An interesting reader can see softmmu_template.h, line 94: section, and we encourage the interested reader to read the code for the complete list. Figure 4 shows the state diagram for the four important initalizations. Through MODULE_INIT_MACHINE macro, QEMU calls the pc_machine_init function that registers the functions that will eventually lead to the execution of the QEMU main loop. Through configure_accelerator function, QEMU accomplishes several tasks. Some of these tasks are allocating tcg translation blocks, allocating code caches in form of an executable page and planting tcg_prologue function in memory. Further, QEMU calls cpu_exec_init_all function that, among others, initializes the mapped IO memories. Finally, QEMU calls the init function of the QEMUMachine object that is pc_init_pci function. The calls from this function leads to the execution of the main loop that executes through the whole life cycle of Qemu in a separate thread.
Figure 4. Main initialization flow
The execution of the main loop starts after calling the init function of the QEMUMachine that is pc_init_pci. The execution leads to the creation of a thread that executes the main loop. The thread creation trace is shown in the Figure 5. The thread creation leads to the execution of qemu_tcg_cpu_thread_fn function.
Figure 5. thread creation
qemu_tcg_cpu_thread_fn finally calls the cpu_exec function that contains the main execution loop. The execution trace leading to cpu_exec is depicted in Figure 6. In each iteration, cpu_exec does the followings. First, it serves any pending exception or interrupt. It both cases, it calls cpu_loop_exit function. If there is no pending exception or interrupt, the main loop within cpu_exec tries to fetch the next block and executes. Fetching and executing basic blocks (a block of code without a branch or jump) include several mechanisms. In the next subsections, we explain three of these important mechanisms that are block translation, block interpretation, block chaining patching.
Figure 6. trace to cpu_exec
Block translation The execution of the guest instructions is based on the translation of the basic blocks of the executables. In the beginning, Qemu starts the translation of basic blocks. Afterwards, if a basic block is to run again (for instance a second call to a function), Qemu returns the reference to an already translated block. The former is done via tb_find_slow and the latter via tb_find_fast function. Precisely, cpu_exec makes only one call to tb_find_fast and it calls tb_find_slow if it can not serve the request via already translated code blocks. This is done via Address Lookup Table (ALT). ALT is implemented by a hash table referenced by tb_jmp_cache field of the CPUState object. tb_jmp_cache hashes the next PC value (tracks CPU Instruction Pointer) and searches based on this hash value. There are in fact two levels of caching for acceleration. The first level happens in tb_find_fast that we already explained. This caching helps us to find the reference to the cache code (machine code) in case we have it in our cache. The second level of the caching is in tb_find_slow. tb_find_slow caches the tcg code and if the request can be served from the cache, tb_find_slow does so. This happens in the lines 100 to 121 of the cpu-exec.c tb_find_slow starts the translation process if the guest code has not been translated before. To translate, tb_find_slow calls tb_gen_code function. tb_gen_code first gets a reference to an empty translation block cache. Then, it calls get_page_addr_code function that gives a reference to the guest binary code. tb_gen_code then calls cpu_gen_code that does the tcg translation and writes it to the translation block cache. tb_gen_code, then, creates an entry in its cache table by calling tb_link_page with PC address, translation block pointer and a pointer to the physical page of the guest code. The actual disassembly of the instructions happens in disas_insn function that is called indirectly by cpu_gen_code. cpu_gen_code calls gen_intermediate_code function that calls disas_insn for disassembly. gen_intermediate_code takes into consideration especial cases like exceptions, traps and interrupts and act accordingly. The translation block can be found in struct TranslationBlock.
After disassembly and generation of the tcg code, it is time to interpret tcg and generate the host executable code. This process happens in tcg_gen_code. tcg_gen_code calls tcg_gen_code_common that in a for loop reads the tcg instructions and produces the machine code. More specifically, this function writes to the code cache pointed by the code_ptr of a TCGContext data structure. Many of the instructions are handled not directly in this function but in tcg_reg_alloc_op function. As a matter of fact, only a few instruction such as MOV, MOVI and CALL are directly handled in tcg_gen_code_common. For the rest, tcg_gen_code_common calls tcg_reg_alloc_op. After some preprocessing, tcg_reg_alloc_op calls tcg_out_op that checks the operations and the operands from the data structure and write the binary code.
- Talking about the x86, tcg_op and tcg_opc
Qemu executes one block of the guest code at a time. After a block of code is executed in the host, the control must be transferred to the next block; this process is done through block chaining. By block chaining, we either directly jump to the next translated block or we jump back to the emulation manager (QEMU main loop). Initially, the control is often transferred to Qemu main loop but after each block is seen once, the blocks are chained together and we don’t have to return to emulation manager anymore.
The execution of a block happens based on the host architecture. If the architecture is a supported architecture by QEMU, then there is a binary interpretation of the code and hence the control should be transferred to the binary block. This will be done through the tcg_qemu_tb_exec call in cpu_exec function (see Figure 7). This function for an architecture specific execution will lead to code_gen_prologue execution that is basically a call to the beginning of a binary code block. A binary code block, or as mentioned in the QEMU code buffer is composed of three parts (see Figure 7). The translation block starts with executing the prologue. The prologue does the followings:
- Saving the registers (including architecture specific registers such as xmm and mmx registers)
- Adjusting the stack size
- Setting the frame pointer
- Transferring the control to the translation block
Figure 7. Translation block
After finishing execution, the control is transferred to the epilogue. The jump to the epilogue is inserted at the end of each block. End of a block is denoted by jump like instructions such as jump or call. The tcg instruction that denotes the jump to the epilogue is INDEX_op_exit_tb. Seeing this instruction, the interpreter inserts a jump to the epilogue as below:
tcg_out_op (){ switch(opc) { case INDEX_op_exit_tb: tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_EAX, args[0]); //sina: the return value in EAX tcg_out_jmp(s, (tcg_target_long) tb_ret_addr); //sina: note that inserting jump to Epilogue happens here break; }