-
Notifications
You must be signed in to change notification settings - Fork 880
Optimizing VMCS reads/writes with a new cache implementation #130
base: master
Are you sure you want to change the base?
Conversation
b938039
to
ba7cd13
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! You've identified an important problem, and I like your systematic approach to solving it. I'm anxious to see how much performance improvement this PR will eventually make, and maybe not just for the nested virtualization use case :-)
So far I've only finished reviewing the first commit, and please allow me more time for the rest. Meanwhile, I have a question about the VMREAD/VMWRITE data you collected, which I'll ask in a separate post.
@@ -187,7 +173,6 @@ struct vcpu_state_t { | |||
|
|||
uint32_t _activity_state; | |||
uint32_t pad; | |||
interruptibility_state_t _interruptibility_state; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the removal of this unused field, but vcpu_state_t
is part of the HAXM API, so let's handle this with care:
- Replace this field with a placeholder:
uint64_t pad2;
, so HAXM and QEMU always agree on the size ofvcpu_state_t
. - Update
api.md
accordingly. - Prepare a corresponding patch for QEMU
target/i386/hax-interface.h
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, let me share the "decoder" you sent me offline, in case other people are wondering how to interpret the logs:
I'm just surprised by how small But if you look at For the ease of discussion, let me provide an outline of the current VM entry-exit workflow:
|
Nice catch! On HAXM, the counter is incremented in vmread/vmx_vmwrite, which is fine since on 64-bit each will generate one single vmread/vmwrite instruction. The KVM counters also increment correctly. However, on HAXM, I had placed the counter display/reset before and after I repeated the benchmarks for the unpatched code, and these are the updated results:
The updated full logs (and the diff's to replicate them!) are available at: vmx-measurements.zip. |
- Moved vcpu_vmx_data and interruptibility_state_t from {vcpu,vcpu_state}.h to vmx.h. - Removed vcpu_state_t::interruptibility_state_t since it's not used. Signed-off-by: Alexandro Sanchez Bach <[email protected]>
Signed-off-by: Alexandro Sanchez Bach <[email protected]>
- Replaced vmcs_pending* fields, with the new automatically-generated function that updates VMCS based on vmcs_cache_w flags. Only `vmcs_pending_guest_cr3` has been preserved since it requires extra logic. - Removed unnecessary vmread's at the end of `cpu_vmx_execute` that are not always required. Corresponding members in `vcpu_vmx_data` have been removed and function depending on them now call `vmcs_read`. Signed-off-by: Alexandro Sanchez Bach <[email protected]>
Signed-off-by: Alexandro Sanchez Bach <[email protected]>
ad2dd49
to
5636f62
Compare
core/vmx.c
Outdated
#define COMP_PENDING(cache_r, cache_w, width, name) \ | ||
COMP_PENDING_##cache_w(name) | ||
|
||
void vcpu_handle_vmcs_pending(struct vcpu_t* vcpu) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have changed the semantics of this function, but not where it is called. Now that it has the responsibility of flushing all the dirty VMCS values, we should delay it as much as possible, in case vmcs_write
gets called again after the flush but before VM entry (asm_vmxrun
). Probably we need to draw a line somewhere in cpu_vmx_execute
or in cpu_vmx_run
:
// Cached vmcs_write() is preferred over uncached vmwrite()
vmcs_write(..);
vmcs_write(..);
...
vcpu_handle_vmcs_pending(vcpu);
// *** No cached vmcs_write allowed after this point ***
vmwrite(..);
vmwrite(..);
...
// VM entry
Signed-off-by: Alexandro Sanchez Bach <[email protected]>
Thanks, the new data is plausible. Clearly there's a lot of room for improvement for HAXM, to look on the bright side :-) |
I've fixed some of the issues you mentioned. To make reviewing easier, I'll avoid rewriting the commit history until this is ready to merge. Then, I'll squash and re-sign everything into the first three commits, i.e.:
|
Signed-off-by: Alexandro Sanchez Bach <[email protected]>
Signed-off-by: Alexandro Sanchez Bach <[email protected]>
EDIT: Sorry, I clicked the wrong button and ended up closing/reopening the pull request. |
Signed-off-by: Alexandro Sanchez Bach <[email protected]>
Signed-off-by: Alexandro Sanchez Bach <[email protected]>
aa20c04
to
8e49240
Compare
Signed-off-by: Alexandro Sanchez Bach <[email protected]>
c12dd15
to
689c04d
Compare
Signed-off-by: Alexandro Sanchez Bach <[email protected]>
689c04d
to
61779ca
Compare
Signed-off-by: Alexandro Sanchez Bach <[email protected]>
Signed-off-by: Alexandro Sanchez Bach <[email protected]>
e616045
to
b00b798
Compare
What is with this PR? My windows 7 guest boots and runs pretty slow compared to VmWare, I wan't do some enhancements to speed it up. This PR is a good start point. |
Thanks for @nevilad comments. I noticed that @AlexAltea added new commits a month ago. And @AlexAltea once declared that the pull request were not completed. @AlexAltea, could you help to rebase the current pull request to the head of master branch first? So we can consider to proceed it and @nevilad can start the enhancement based on it. Thanks for both of you. |
I don't see new comments, the latest is in november of 2018. |
563eb1b
to
6b942e3
Compare
b73a231
to
da1b8ec
Compare
Disclaimer: This patch isn't finished, but it's available for early reviews/discussion.
Motivation
During development, running HAXM in a virtual machine prevents your "bare-metal" host system from kernel panics if runtime errors occur. The downside of nested virtualization, is that during L2-guest VM-exit events, the L1-hypervisor, i.e. HAXM, might need to perform actions, e.g. vmread/vmwrite, that trigger another exit to the L0-hypervisor, i.e. KVM, VMware, etc. These recursive exits add latency so it's convenient to minimize them.
While booting few kernels using both HAXM and KVM as L1-hypervisor, under the same conditions, HAXM appears to be 4-5 times slower than KVM (related: #40). After measuring the amount of vmread/vmwrite's between guest-exit and guest-enter events, I've noticed HAXM amount is (unsurprisingly) 4-5 times higher. Specifically:
KVM: Does 6-10 reads and 2-6 writes on average.
KVM: Does 4-8 reads and 0-4 writes on average (depends on VMX exit handler).HAXM: Does 32-33 reads and 21-23 writes on average.
HAXM: Does 32 reads and 1 write (nearly always!).Here's the raw data, corresponds to around the first few seconds of trying to boot a Windows 7 install disc: vmx-measurements.zip
(haxm-vmx.log and kvm-vmx.log). Efforts at 96af3d2 (thanks @junxiaoc!) have improved this situation, but there's still room for improvement.Changes
Previous efforts, and other hypervisors like KVM, have minimized vmread/vmwrite usage by caching the component values with hand-written helper functions. While this works, it doesn't scale well for the 120+ components. Plus, it's not trivial what should be cached: caching unused components is counterproductive. Fine-tuning the set of cached components when in manually-implemented code is too time-consuming.
Instead, this patch relies in preprocessor X-Macros to automatically create the cache structures, alongside cached/dirty flags and helper functions to access all VMCS components. This Godbolt snippet allows to inspect the preprocessor output for clear view of the resulting code (ideally, apply clang-format).
Summarizing the changes, this patch:
cpu_vmx_execute
.Benchmarks
Measuring time elapsed between booting a virtual machine with a Windows XP SP2 installation disc, until the installer shows the "welcome screen", using an Ubuntu 18.04 (+ Linux 4.17) host. This is a particularly bad scenario that shows up to x10 slowdown. (Disclaimer: time measured manually with my phone).
Pending